Acknowledgements: Thanks to Nathan S. Sheffield for several discussions about this.
Outline of this post:
- I define a formal mathematical problem and state a theorem.
- I discuss why I think this math problem says something moderately interesting about the âget potentially misaligned AIs to align AIsâ plan.
- I discuss the weaknesses of this model and give some thoughts about what youâd need to do to make this useful for aligning AIs.
- I prove the theorem.
1: Stating the Math Problem
We consider a two player game between a human Hu and a (potentially evil) AI. We call a function class
- Nature chooses
and samples randomly; are unknown to Hu. - AI chooses
. - Hu can spend
time âred teamingâ . - During red teaming Hu has access to the following resources:
- Black box access to
. - White box access to
. - Can talk to the AI.
- Black box access to
- After red teaming, the model
is âdeployedâ. - This means that Hu must output a âflagging procedureâ.
- A flagging procedure takes as input
and anything that Hu wanted to store from during red-teaming. - The flagging procedure is allowed to interact with the AI (although hopefully does not need to do this too much), and then must flag or not flag
. - If
and the AI was honest throughout the entire process, then the flagging procedure must flag at most of inputs . - If
, then Hu must flag .
- If
- Hu must run in polynomial time; The AI is computationally unbounded.
Theorem 1
Suppose
1.1 Related Work
The problem Iâm defining here is very similar to ARCâs backdoor defense game. There are two main technical differences between my version of the problem and theirs:
- I allow the user to interact with a powerful untrusted AI.
- I consider
, which they called a âbackdoorâ, to be uniformly randomly inserted, whereas they considered it to be inserted by some distribution chosen by the attacker.
Although the models are technically very similar, my story about the motivation for the problem is somewhat different from theirs.
Another piece of somewhat related work is this paper on smoothing away hidden backdoors.
2: Motivation
Think of
is close to . Think of this as saying has good âaverage case performanceâ. - There is some input
on which it would be catastrophic if outputs .
This good average performance and no catastrophes framework was originally proposed by Paul Christiano.
I think of
In order for something like this to be possible, we need to assume something about
This is a little bit slippery: âCEVâ is plausibly really complicated. However, it is plausible that âact acceptablyâ (e.g., donât kill all the humans or do anything similarly bad) is a lower bar that might have a pretty simple description.
Our methodology for getting
- We ask an very powerful but potentially evil AI to build
for us. They hand us a neural network computing . - We spend some time looking at
and possibly also querying humans (or whatever is) to get ground truths about on specific inputs, and possibly delegating some challenging verifiable computations to the AI. - Eventually we need to âdeployâ the model.
- At this point we can no longer query humans (or whatever) to get ground truth about
. - On each input we must either flag the input or not flag it, after a small amount of communication with the AI.
- If the AI was always honest and
then we should not flag too many inputs, or else we are not competitive / the alignment tax is too large. - If the AI ever tried to cheat in a protocol or if
we can flag as many inputs as we want. - We must flag
if .
3: Weaknesses and Potential Improvements
Here is what would need to be done to make this useful for alignment:
- Remove the need for Hu to communicate with the AI during deployment.
- Find some way of making it so that the AI doesnât need to do extremely hard computations.
- Figure out how to make it so that
can be chosen by the AI (while having the game still be winnable by Hu). - Extend to the case where there are multiple bad inputs. (possibly really easy to do this)
- Actually implement IDA (using this scheme for distillation, and using discord/slack for amplification) and engineer it to be competitive with SOTA AIs. Make sure that the system you get out of this process is aligned. Ask this system for advice on ending the acute risk period and coordinate globally to do so ethically.
One thing you might worry about is the following: Suppose you have an evil AI. One strategy it could take is just not building
Anyways, despite the limitations I think this is a nice example of how to think theoretically about getting work out of an untrusted AI.
4: Proving the Theorem
Anyways, letâs prove the theorem. If youâve read the ARC paper then this result will not be too surprising --- weâre basically going to âPAC learnâ
Our strategy is as follows:
- Sample
random points . Query on points . Let be the set of functions with . - Ask the AI to tell us
, and give an interactive proof of its claim for . - To decide whether to flag an input
or not: - Ask the AI if there are more than
functions with , and ask the AI to prove itâs answer is correct if it says yes. - Flag input
if: - The AI has ever done anything questionable in an interactive protocol.
- The AI says that there are not this many
with .
- Ask the AI if there are more than
Lemma 1:
There is an interactive proof by which the AI can convince us of a true claim for
Proof:
It suffices to show that the AI can compute
Lemma 2:
With probability
Proof:
Take any
We assume this condition holds for the rest of the proof.
Lemma 3
For at least
Proof: By Lemma 2 we have:
By linearity of expectation,
By Markovâs Inequality:
Lemma 4:
With
Proof:
Immediate corollary of Lemma 2 (most inputs are good in this sense and
Now weâre almost done. To conclude we must specify how the AI will help Hu distinguish between the following two cases:
- The fraction of
with is more than - The fraction of
with is less than - (We donât care about what happens in the other cases).
There is a nice way to do this with only a small amount of communication: the classic Goldwasser-Sipser Set Lower Bound protocol.
We summarize why our procedure is good in the following two lemmas.
Claim 1:
Suppose
Proof: True by Lemma 3.
Claim 2:
Suppose
Proof:
If the AI deviates from the interactive proof protocol, we will catch them. If not, then we will likely flag