Atechnical note on Neekon, Vinod, et al paper See also backdoors and deceptive alignment backdoors take 2.
Prior result of Vinod:
- you can plant a backdoor such that
- backdoored model is comp indistinguishable from unbackdoored
this paper:
- defend even if undetectable?
what’s going to make this possible:
- assuming something like “goodness is simple”
- They consider G=“Fourier heavy functions”
- global mitigation
- Also consider
polynomial function in - Here they give a “local mitigation strategy”
- rm backdoors on specific inputs
nb: their constructions are blackbox. openq: can we utilize whitebox access somehow?
motivation:
- random reduction and amplification
Results
- global for fourier heavy
- local for polynomials
technical overview
loss v0:
backdoor:
an adversary is allowed to choose any
plan: “cannonicalization”
what we get:
-
samples of points with true labels
-
black box access to potentially backdoored function
-
goal is to produce a new function, also with low loss, but such that the errors aren’t at adversarial places. basically we want the errors to be similar to if we sampled from a “clean distribution”
-
remark: one easy way to do this is to re-learn the function from scratch via the samples
- if you have a trusted learning algorithm. sigh.
-
the goal is to be cheaper than that.
strong notion of a mitigator:
- output should have small TV distance from if you sample from clean
they have this nice function
Def:
Frr
Theorem:
- Let
be -close to -heavy by L2 distance. - Suppose
is -close to . - Then there is a global mitigator
- That uses blackbox access to
- and samples from
- to output
- such that
is still pretty close to - but it’s “NICE”.
- means TV distance from function sampled from ‘ideal canonical dist” is negligible
- but what does that mean? is it good?
- ok crumb, the TV distance thing is pretty darn strong!
- Let
be the canonical distribution - Let
be the distribution output by running on - The goal is that
is negligible.
- That uses blackbox access to
Proof:
Let
Overview of linear local mitigation —
- We have a bounded convex set
- there is some affine function
- such that
is -close to in some sense called “cutoff loss” - then there is a function
, - such that we can learn a function which
- still gets low loss on
- and st for any particular point
, the pr of being more than far from on is neglig.
We could do linear regression if we took
Apparently there are quite a few technical challenges to get their algorithm to work.
there are some reasons why their guarantees are weak.
to get stronger results, they assume that
the ground truth is affine
- this lets them require fewer queries
- and get random errors within bars
they generalize to polynomials
remark
is cheaper than learning from scratch, eg for degree
delta cutoff loss:
- Pr(outputs differ by more than
)
Id like to read about at least high level how they did linear mitigation
no time rn tho
robust mean estimation
Suppose you have some mixture distribution
Where we think of
There’s an algorithm called “median of means”. It doesn’t perform well here. Maybe med of means is really good (converges fast) in settings where you have high variance / some weirdly large tail.
anyways, doesn’t work here.
they do mean of medians.
they get some decent thing, altho the convergence is slower than what youd get in a non-adversarial setting.
openqs
- are there other
besides -heavy functions and linear/polynomials that are “smoothable”? - can you do better with whitebox access to
philosophy
How, if at all, is this relevant to AI alignment?
-
this is vaguely similar to a control strategy proposed by Buck
- namely, replacing suspicious model outputs by resampled outputs from the model
- or from a weaker model
-
you could hope that
is really simple. -
remark — an alignment strategy which requires doing an extra forward pass is probably an unacceptably high alignment tax. rip.
- wait actually this is not obvious. bc progress is exponential
- so if you are at the head of the race maybe your models are 10x faster than other guys
- you can afford to lose a factor of 2
- maybe
hmm.
I’d actually rather deal with things that discuss neural networks.
Q: are neural networks really the right level of abstraction though?
Like, is saying
If NN’s are “universal” function approximators, then this feels strange.
ok but what does this say about deceptive alignment?
im consulting Solving AI Safety from 1st principles