Some notes on a very cool looking paper! Causal Scrubbing (CS)
A hypothesis about how a behavior is implemented is
- A graph
which is the “interpretation” of how the actual NN works. - A function
mapping nodes of to nodes of .
Then we’re going to somehow define a “scrubbed version” of
The CS paper gives a simple example to illustrate how CS works.
The interpretation says that only
Algorithm:
Observe: This algorithm is hopelessly inefficient in general as written. However, it might be amenable to dynamic programming, which could make it efficient. Or maybe there’s some “early stopping” --- i.e., you don’t fully scrub it but just unroll a couple levels of recursion.
Also, apparently rewriting the circuit is kind of problematic. Maybe it’s because there are lots of ways that you could do this?
Challenges
0: Maybe there is some “irreducibility” --- the thoughts of a powerful AI can’t be simplified / it’s hard to find human understandable explanations for them.
1: False hypotheses which make correct predictions Neglecting both helpful and anti-helpful mechanisms?
2: Underestimating interference by neglecting correlations Obs: if you have an error term, and you split it and forget that the two things are correlated, then this decreases the variance of your error. I guess this is bad.
In general, if you pick up some slack somewhere like this, then it’s possible to “spend” it later, making the situation worse somehow.
3: Sneaking in knowledge I didn’t get their example, but I feel like the vibes are that you could have an explanation that is very good for dumb reasons. Like you could just go compute the answer yourself or something?
See MAD Agenda for further discussion.