edit: Unfortunately, there are major errors in this document. Sorry about that.

Introduction

Paul claims that one of the main problems with deep learning from an alignment perspective is that we evaluate AIs via sampling. In human designed AI, people can estimate the loss mechanistically --- they just think about the system that they designed, and say “yeah this has reasonable tail behavior.” In deep learning we can’t do this. This is why we could have an opaque system that is doing reasoning that we don’t like when we train a system with deep learning.

This has led ARC to thinking about the following question:

() Can we come up with a mechanistic method for estimating the expectation of a NN?

The word “mechanistic” intuitively means “feels pretty different from sampling”. If you have a solution to this that you think is borderline mechanistic then you can test it by seeing if it’s useful for some of the downstream tasks that ARC cares about.

I think it’s pretty unclear + confusing exactly why solving would be super great, but I’ll think about that in a different document. In this document I’ll discuss how ARC is thinking about at the moment and what are the next steps that you’d want to take towards achieving .

ARC is interested in three settings:

  1. Competing with sampling on random problem instances.
  2. Competing with sampling on worst case problem instances, but where we get an optimized “advice string”.
  3. Competing with sampling on trained instances, where we get to learn an advice string in parallel to the training process, using the same compute budget as the training process.

I’ll consider these in turn.

(2) Worst case instances

Here’s a problem statement:

Conjecture 1

Let be the set of 3SAT instances with clauses. Let be the set of “explanations” ( bit strings) There exists a mechanistic “explanation quality measuring machine” , where can be computed in time, such that for all , there is a mechanistic “expectation estimation machine” such that

  • runs in time
  • If then should be within of

The relevant fact about the compute / error tradeoff is that this is the same compute / error tradeoff that naive sampling gets.

(Note: Someone said that Conjecture 1 might give an alternate proof of BPP in Sigma2. TODO: think about whether or not that’s true, and whether or not it’s concerning if true (probably isn’t concerning bc that proof isn’t too hard?).)

Conjecture

Conjecture 1 implies that we can solve any similar flavor problem that we care about (because #3SAT feels like a pretty hard and pretty general problem).

Victor is pretty sure that this conjecture is False. But maybe it’s at least spiritually true.

Def: T is a measure preserving reduction if

  • T is computable in polynomial time
  • T preserves measure For example:
  • T: 4SAT instances -⇒ 3SAT instances such that Question: Do measure preserving reductions exist? Probably exactly measure preserving is impossible for dumb number theory reasons.
  • Approximately measure preserving in some sense would also maybe be fine.
  • I’ve thought about this a bit and suspect that it’s pretty tricky to do / maybe impossible.

Mike thinks that I may not have given the optimization oracle enough power. I’m unsure but will think about my version for now.

(1) Random instances

This is the only question that ARC has had progress on so far. ARC has proved the following theorem:

Theorem 1

Let be the vanilla distribution over 3SAT instances with clauses; Namely, each clause has three random (distinct) variables in it, and each of the variables is randomly negated or not. For all , there is a “expectation estimation machine” such that

  • runs in time , and

Note: ARC’s proof of theorem 1 is imo very complicated (it uses the words “homogeneous polynomials are the irreducible representations of the hyperoctahedral group”). I’d like a much simpler proof please.

Currently, ARC is interested in investigating more complicated distributions. For instance, a distribution where some variables are more common than others, or where we have more negations than positive versions of literals.

(Note: ok ARC technically hasn’t written down a proof of theorem 1 but we’re pretty confident that we have such a proof, maybe up to log factors).

(Note: Eric thinks that SAT might be a better problem.)

(Note: Eric thinks that you might need advice to solve this if the distribution is sufficiently weird. I’d be pretty interested in a proof or vibes based proof as to why this would be true. Eric gave some vibes based on obfuscation, but I didn’t find it too compelling.)

(Note: Maybe if we solved random cases, we could do random hacky stuff like this better).

(3) Trained instances

  • Thinking about random search as a hill climbing procedure seems like a good first step.
  • Eventually we’ll need to think about SGD though probably. Which is an L.

Breaking ARC

Here’s a problem that I thought it would be lethal to ARC’s agenda to be unable to solve:

Conjecture 1

Let be the set of 3SAT instances with clauses. There exists a deterministic function and a deterministic function such that

  • runs in time
  • If then should be within of

Suppose we could prove Conjecture 1. Then we’d have the following corollary:

Corollary 2

Let be the set of 3SAT instances with clauses such that , and let be the set of 3SAT instances with clauses such that . There exists a deterministic function and a deterministic function such that

  • runs in time
  • For if , and then .

Next, we’d have the following corollary:

Corollary 3

Let be the set of 3SAT instances with clauses such that , and let be the set of 3SAT instances with clauses such that . Distinguishing between and can be done in .

Next, we’d have the following corollary:

Corollary 4

.

Unfortunately, according to my friend, Corollary 4 is beyond the reach of complexity theory for now. Thus we should not expect to be able to prove Conjecture 1.