(we use cat to denote catastrophe)

The general hope of distillation is as follows:

You have some circuit which is expensive to run, and you want to distill it to a circuit such that is about as useful as and is still aligned (e.g., never acts catly).

this is half of the iterated amplification approach.

remark: obviously, if you have a really great idea for distilling NNs, DO NOT PUBLISH IT unless you’re pretty confident that it deferentially helps safety stuff and isn’t just shaving time off of how long we have before we’re cooked.

I’m not expecting anything here to be dangerous though. I’m mostly just trying to understand under what conditions this is theoretically possible.

try 1

edit: i think instead of constant sized TMs I should’ve just assumed you had random access to the circuit. whatever.


Worked with to Nathan Sheffield on this idea.

Suppose is a circuit with gates, and Alice has access to a distilled circuit with gates. Suppose Alice wants to convince Bob about the value of on some input . Is it possible to do this if Alice runs in time and Bob runs in time ?

Note: This sounds really audacious. But, something pretty similar is actually possible — it’s called “GKR”

(Goldwasser, Kalai, Rothblum https://dl.acm.org/doi/10.1145/2699436)

  • Suppose you have a log-space uniform circuit with depth and input size .
  • Then there is a (“doubly efficient”) interactive proof where the prover takes time
  • And the Verifier takes time , and the communication complexity is . Basically you should think of this as, if there’s a computation that can be done in time then there’s a proof of the computation that can be generated in time ~ and checked in time ~.

more formal statement, and a proof that it’s impossible.

We’d like for there to be a such that: For all which are described by a constant size Turing machine,

  • If then there exists such that for all .
  • If then for any , and any with , .

why this is impossible: We will assume

Claim Suppose I have a really smart guy Bob such that whenever , there’s a small proof that convinces Bob that for any , but if , and then there’s no argument that’ll convince Bob that .

Now suppose we have some arbitrary circuit which is described by constant size TM. In order to compute in Bob does the following:

  • He “writes down” (remember that we’re assuming the circuit is generated by a small amount of TM code) the circuit
  • Then, if it’s the case that , Bob can guess a short proof that .
  • If Bob fails to guess such a proof, then he knows that .
  • But of course .

So anyways, Bob just evaluated this circuit. But as stated above, this is believed by complexity ppl to be impossible.


We also considered a “distributional” version of this — i.e., you just need to make sure pr cat is quite small.

it seemed pretty unworkable.