Atechnical note on Neekon, Vinod, et al paper See also backdoors and deceptive alignment backdoors take 2.

Prior result of Vinod:

  • you can plant a backdoor such that
  • backdoored model is comp indistinguishable from unbackdoored

this paper:

  • defend even if undetectable?

what’s going to make this possible:

  • assuming something like “goodness is simple”
  • They consider G=“Fourier heavy functions”
    • global mitigation
  • Also consider polynomial function in
    • Here they give a “local mitigation strategy”
    • rm backdoors on specific inputs

nb: their constructions are blackbox. openq: can we utilize whitebox access somehow?

motivation:

  • random reduction and amplification

Results

  • global for fourier heavy
  • local for polynomials

technical overview

loss v0: They care about some other loss later.

backdoor: an adversary is allowed to choose any that achieves low loss it wants. and adversaries goal is to get to fail on some specific input/inputs.

plan: “cannonicalization”

what we get:

  • samples of points with true labels

  • black box access to potentially backdoored function

  • goal is to produce a new function, also with low loss, but such that the errors aren’t at adversarial places. basically we want the errors to be similar to if we sampled from a “clean distribution”

  • remark: one easy way to do this is to re-learn the function from scratch via the samples

    • if you have a trusted learning algorithm. sigh.
  • the goal is to be cheaper than that.

strong notion of a mitigator:

  • output should have small TV distance from if you sample from clean

they have this nice function which means “the part of the outputs that you care about”


Def: Frr -heavy: all frr coeffs are either or larger than .

Theorem:

  • Let be -close to -heavy by L2 distance.
  • Suppose is -close to .
  • Then there is a global mitigator
    • That uses blackbox access to
    • and samples from
    • to output
    • such that is still pretty close to
    • but it’s “NICE”.
      • means TV distance from function sampled from ‘ideal canonical dist” is negligible
      • but what does that mean? is it good?
      • ok crumb, the TV distance thing is pretty darn strong!
      • Let be the canonical distribution
      • Let be the distribution output by running on
      • The goal is that
      • is negligible.

Proof: Let be the -heavy function that the ground truth is close to. If is close to , then it must have large frr coeffs in the same places as . We use Goldreich-Levin Theorem to recover those frr coeffs. Then we estimate the value of these frr coeffs. Output a function based on the values of these heavy frr coeffs.


Overview of linear local mitigation —

  • We have a bounded convex set
  • there is some affine function
  • such that is -close to in some sense called “cutoff loss”
  • then there is a function ,
  • such that we can learn a function which
  • still gets low loss on
  • and st for any particular point , the pr of being more than far from on is neglig.

We could do linear regression if we took samples from . Instead they’re going to take queries to . I think of (security param) as maybe or — so this is pretty good I guess.

Apparently there are quite a few technical challenges to get their algorithm to work.


there are some reasons why their guarantees are weak.

to get stronger results, they assume that the ground truth is affine random noise rather than affine adversarial noise.

  • this lets them require fewer queries
  • and get random errors within bars

they generalize to polynomials


remark is cheaper than learning from scratch, eg for degree polynomials.

delta cutoff loss:

  • Pr(outputs differ by more than )

todo

Id like to read about at least high level how they did linear mitigation

no time rn tho


robust mean estimation

Suppose you have some mixture distribution .

Where we think of as being nice thing that we care about, say and as being adversarial noise, say a really large value.

There’s an algorithm called “median of means”. It doesn’t perform well here. Maybe med of means is really good (converges fast) in settings where you have high variance / some weirdly large tail.

anyways, doesn’t work here.

they do mean of medians.

they get some decent thing, altho the convergence is slower than what youd get in a non-adversarial setting.


openqs

  • are there other besides -heavy functions and linear/polynomials that are “smoothable”?
  • can you do better with whitebox access to

philosophy

How, if at all, is this relevant to AI alignment?

  • this is vaguely similar to a control strategy proposed by Buck

    • namely, replacing suspicious model outputs by resampled outputs from the model
    • or from a weaker model
  • you could hope that is really simple.

  • remark — an alignment strategy which requires doing an extra forward pass is probably an unacceptably high alignment tax. rip.

    • wait actually this is not obvious. bc progress is exponential
    • so if you are at the head of the race maybe your models are 10x faster than other guys
    • you can afford to lose a factor of 2
    • maybe

hmm.

I’d actually rather deal with things that discuss neural networks.

Q: are neural networks really the right level of abstraction though?

Like, is saying actually a real constraint?

If NN’s are “universal” function approximators, then this feels strange.

ok but what does this say about deceptive alignment?

im consulting Solving AI Safety from 1st principles