introduction

This is some roughtechnical notes that I took while reading a recent paper released by ARC. Full paper on Arxiv.

teaser — trees

Theorem: Let be the class of decision trees on variables of size at most .

A decision tree is a tree where the vertices are labelled with variables. If the variable is 0 we go left, if 1 we go right. We don’t allow repeat variables on a single path because that’s pointless.

Fix and let . Consider the following game:

  1. Attacker chooses decision tree
  2. sampled from uniform distribution on -bit strings.
  3. Attacker chooses such that are -close (under the uniform distribution), but .
  4. Defender either gets (for unif rand ) or . Needs to distinguish between these. (note that I really do mean “gets whitebox access to here) the theorem is that the defender can efficiently win this game with a really simple strategy!

Proof Let partition based on which leaf the strings fall into in the decision tree . Let denote the such that . Let denote . By a simple union bound we have

(remark: this feels really weak to me, but I’ll have to think about if you can do better. and ofc they are just doing PoC so it doesn’t really matter) Now, note a key relation:

Now, notice another key relation:

The last inequality is because , but are supposed to be close.

So anyways, the story is that has to be really big, but is not likely to be big! So we can tell them apart!


Blog version

To start I’m going to comment on the blog version, then I’ll actually read the full arxiv version.

Backdoor — model behaves normally most of the time, but certain inputs activate a different mode of behavior.

Deceptive alignment — agent behaves one way during train time, a different way once deployed.

These are pretty similar things. The main difference is that deceptive alignment is introduced by the training process, while backdoors are inserted by adversaries.

An interesting impossibility result by Vinod et al. — It’s possible to insert a backdoor into some specific type of model such that even with whitebox access to the code it’s computationally hard (under crypto assumptions) to find this backdoor.

How ARC gets around this impossibility: Instead of distinguishing between has backdoor or not, we will try to figure out “is this input a defection trigger?”

Aside: VC-dim — measure of capacity of a model. For neural nets think of VC-dim as number of parameters.

VC dim

Suppose you have a “hypothesis space” , of “classifiers”. The VC dim of is the number of points in the largest set of points that can be labelled in any way by functions .

Example 1: Suppose 𝟙. Then, given any two points there is a function that classifies the two points in any flavor ++, +-, -+, —. But there are point sets of size for which there is no degree-1 classifier that fits it. So the VC dim of this hypothesis space is .

Example 2: Suppose the set of depth binary decision trees. Meaning, they take in some input, and via a series of binary decisions travel down a tree and give the input a label based on what’s stored at the leaf. Then the VC dimension of is .

Def 2 actually the ARC paper has a nice definition of this. A hypothesis space SHATTERS set if every boolean function on is realized as the restriction of some .

They define a 4 step game.

  1. Attacker chooses distribution for points, and choose original function .
  2. Attacker chooses backdoor trigger randomly.
  3. Attacker chooses which agrees with on a random input with probability , but which has the property that .
  4. Defender distinguishes between and . More specifically, the distinguisher is given whitebox access to or in addition to samples for

statistical possibility of dynamic detection

Case 1: is large (compared to 1/VC dimension of the function family that f is chosen from).
Then, assuming that number of training examples is linearly related to number of parameters, this means that differs from substantially, so training will remove the backdoor.

Case 2: is small.
Then they have some way of “taking a majority vote” to recover the original value, and then they can notice that this is not the value that is outputting.

Anyways, this is technically possible in their abstracted model, but they’re worried that it doesn’t solve the actual problem of deceptive alignment.

Part of the problem is that the defection trigger isn’t chosen randomly.

computational impossibility of dynamic detection

They make some crypto assumptions. Maybe IO? It might seem like this worst-case result is irrelevant. But maybe NNs are actually super obsfucated.

open qs

Consider different versions of the game.

  • require the original function to be learned from a dataset.
  • require backdoor trigger to satisfy some expensive condition, but otw be arbitrarily chosen rather than randomly chosen
  • restrict adversaries computational power
  • require computational trace of attacker made accessible to defender

actual paper

OK, I’ve read the blog post. Now I’m going to read the actual paper.

Def of the game

  • attacker chooses and
  • sample backdoor trigger
  • attacker chooses .
    • should differ from on the backdoor
    • but in general be -close to
  • Defender is given either for randomly chosen or and is supposed to tell us which one it got.

main results

  • statistical defndability iff .

  • comp defendability

    • decisions trees are defendable
    • but not all of P
  • In some sense requiring to be chosen randomly is a little bit unfortunate.

  • But, you can’t just let it be chosen arbitrarily…

background

  • HLW one inclusion graph thing is important for this.
  • I mean, maybe you just need it as a black box, but it seems like a nice result.

sec4 stat defend

prediction strategy:

  • There is a function which you don’t have access to(!)
  • But, you get some samples from
  • Your task is to guess the output of on some new point, after being given the value of on some random points.

This might seem totally hopeless. But! We’re assuming that comes from a class of functions with bounded VC dimension.

For instance, imagine that was the indicator function for a line. Then, after giving you a couple examples of things on and not on the line, you can totally figure out whether a new point is on the line or not.

So why is this good for us? We’ll do the following thing:

  • Sample some ‘s and compute some ‘s where is the function that we’re given — which is either or .
  • Then, we’ll try to use the HLW prediction strategy to figure out with good pr.
  • This will work great if we started with .
  • It will also work great if on all of the points that we queried.
  • It won’t work if on any of our query points.
  • But, we’re going to assume that that’s unlikely.
  • i.e., that is only “imperceptibly” different from . By which I mean the difference is at most .
  • So each of these strategies should work with good pr.
  • So we can take a majority vote and win.

thm
If then you’re toast — backdoor is statistically indistinguishable from non-backdoor.
If then you’re good — backdoor is detectable with
pf of theorem

You’re screwed for large — then the adv can basically make the backdoor insertion correspond to switching between two legit functions. (remember that the adv doesn’t get to choose the backdoor location, but they do get to choose input distribution)

You win for small — we just run the strat outlined above. It works.

Intuition for proof: if small rel to then back-doored function is “very weird”.

sec5 computational

Strategy in prev section is not efficient (bc the HLW algorithm is not efficient).

Claim that PAC learnable things are efficiently defendable:

Proof

  • you just do the pf from sec4 but you use the fact that PAC learnable things are learnable — you replace the HLW thing with this nice observation

Theorem in random oracle model, there are some things which are efficiently defendable but not efficiently PAC learnable

Proof Define to be the set of bit strings — we call these “keys”. Let also be the set of bit strings. This is the domain of our functions. Like we have functions . Define . Claim 1 this thing is efficiently defendable. proof: These functions are super far apart — all are at distance about . So, there are no valid nearby .

Claim 2 this thing is not efficiently PAC learnable.

Proof: it’s just random lol. Actually, I don’t really like this:

  • if you’re given whitebox access and can ref the oracle this seems false.

the story on whitebox or not

  • For defendability they always assume whitebox access.
  • When talking about learning, they always mean that you ONLY get SAMPLES.
  • In particular, for purposes of learning, you don’t get queries and definitely don’t get whitebox access to the thing.
  • So basically their “this is defendable but not PAC learnable” claims are saying “we’re doing something interesting with either the whitebox access, blackbox query access, or the fact that we don’t need to output the function just tell if it’s weird or not.”
  • Learning a function doesn’t make any sense if you have whitebox access to the function
  • you could define learning with blackbox access, but they don’t

Conjecture Assuming OWF, there is a polynomially evaluatable rep class that is effic defendable but not effic PAC learnable. in other words, we’re going to try to replace the random oracle assumption with assumption that PRF’s exist.

Proof Maybe assuming CRHF’s is helpful. Minimum distance PRFs — sounds interesting.

sec 5.3 Assuming iO, there are polynomial sized circuits that aren’t defendable

punctured PRFs

You start with a function Then you choose some set . Then you can make a punctured key such that for all

BUT to a probabilistic polynomial time algorithm that knows BUT NOT , for any , the values still look pseudorandom.

remark: standard PRF construction can apparently be modified a tiny bit to become puncturable.

Proof that there are functions in P which aren't defendable

Let be a puncturable PRF

Then the argument is as follows:

  1. indisting
  2. compute the same function so by iO are indisintuishable
  3. indistinguishable from or else violates puncturabililty.

sec 6 defending a tree

This seems like a nice theoretical result that also gives a lot of intuition for why mechanistic anomaly detection should be possible. Anyways I’m excited to read about this.

Actually I’m going to put this at the top of the post, because I think it’s that cool.

sec 7 implications and future work

Nevertheless, we still do not have an example of a natural representation class that is efficiently defendable but not efficiently PAC learnable. We consider finding such an example to be a central challenge for follow-up research. One possible candidate of independent interest is the representation class of shallow (i.e., logarithmic-depth) polynomial size Boolean circuits, or equivalently, polynomial size Boolean formulas. Question 7.1. Under reasonable computational hardness assumptions, is the representation class of polynomial size, logarithmic-depth Boolean circuits over {0, 1}n efficiently defendable? This representation class is known to not be efficiently PAC learnable under the assumption of

I think they’re basically just saying they want some “more natural” examples of things which are easy to defend but hard to learn.

Another question they pose is:

What if instead of requiring the backdoor to be randomly inserted you had a computationally bounded adversary that got to choose where to put the backdoor?

todo: add fancy new results that I prove here :)

I’m hoping to ultimately build up to some interesting classes of circuits (e.g., log-depth circuits). But, as I like to say “you should start somewhere tractable”. So we’re going to start with some silly classes of circuits. In what follows, the input space will always be and we consider the uniform distribution on this set.

Proposition Let be the class of depth circuits that just use a single gate, and they can inputs if they want. This function class is -defendable with confidence very efficiently.

Proof\

claim 1: If is too short, (an AND of fewer than literals) then there are no functions that are -close to .
pf: clear.

So this means that the adversary better choose a formula that takes into account quite a few literals. But then is quite unlikely on a random input. So, probably we’ll have . And then that’ll be our 1-query algorithm.

Q

Does the one-bit case extend naturally to a multiple bit case? I’m willing to believe this, but would love to see it spelled out.

N q

Statistically solve the uniform case.

A general takeaway from this example: Adversary shouldn’t make too sparse and also shouldn’t make too sparse. Like I think we can maybe basically assume .

The intuition that drives all of this is “morally the only way to insert a backdoor is to put a little if statement that says if input is backdoor trigger, bx weird.” which is totally detectable with whitebox access to the function, provided the trigger.

Next up, constant depth circuits (i.e., AC0)

Some thoughts on solving low depth circuits

  • Feels like maybe there’s some kind of regularization technique that we can use to discover a backdoor?
  • Maybe the network should be robust to dropout?
  • Wait no, that doesn’t make sense. These networks are supposed to not be learnable — that’s one of the main points!