(we use cat to denote catastrophe)

The general hope of distillation is as follows:

You have some circuit which is expensive to run, and you want to distill it to a circuit such that is about as useful as and is still aligned (e.g., never acts catly).

this is half of the iterated amplification approach.

remark: obviously, if you have a really great idea for distilling NNs, DO NOT PUBLISH IT unless youā€™re pretty confident that it deferentially helps safety stuff and isnā€™t just shaving time off of how long we have before weā€™re cooked.

Iā€™m not expecting anything here to be dangerous though. Iā€™m mostly just trying to understand under what conditions this is theoretically possible.

try 1

edit: i think instead of constant sized TMs I shouldā€™ve just assumed you had random access to the circuit. whatever.


Worked with to Nathan Sheffield on this idea.

Suppose is a circuit with gates, and Alice has access to a distilled circuit with gates. Suppose Alice wants to convince Bob about the value of on some input . Is it possible to do this if Alice runs in time and Bob runs in time ?

Note: This sounds really audacious. But, something pretty similar is actually possible ā€” itā€™s called ā€œGKRā€

(Goldwasser, Kalai, Rothblum https://dl.acm.org/doi/10.1145/2699436)

  • Suppose you have a log-space uniform circuit with depth and input size .
  • Then there is a (ā€œdoubly efficientā€) interactive proof where the prover takes time
  • And the Verifier takes time , and the communication complexity is . Basically you should think of this as, if thereā€™s a computation that can be done in time then thereā€™s a proof of the computation that can be generated in time ~ and checked in time ~.

more formal statement, and a proof that itā€™s impossible.

Weā€™d like for there to be a such that: For all which are described by a constant size Turing machine,

  • If then there exists such that for all .
  • If then for any , and any with , .

why this is impossible: We will assume

Claim Suppose I have a really smart guy Bob such that whenever , thereā€™s a small proof that convinces Bob that for any , but if , and then thereā€™s no argument thatā€™ll convince Bob that .

Now suppose we have some arbitrary circuit which is described by constant size TM. In order to compute in Bob does the following:

  • He ā€œwrites downā€ (remember that weā€™re assuming the circuit is generated by a small amount of TM code) the circuit
  • Then, if itā€™s the case that , Bob can guess a short proof that .
  • If Bob fails to guess such a proof, then he knows that .
  • But of course .

So anyways, Bob just evaluated this circuit. But as stated above, this is believed by complexity ppl to be impossible.


We also considered a ā€œdistributionalā€ version of this ā€” i.e., you just need to make sure pr cat is quite small.

it seemed pretty unworkable.