For more thoughts on alignment problems, see MAD Agenda
ARC also thinks that explanations could help with low probability estimation. They give two ideas:
- Maybe
gives a good estimate of the probability of catastrophe. - Maybe
gives a good estimate of the probability of catastrophe, while being much cheaper to compute than .
LPE is also a pretty exciting problem because I can imagine training data, and I can tell what itâd mean to succeed.
idea1
âActivation modellingâ
idea2
efficiently learn some dependency structure in NN how to measure??
Idea 3
âAnalytical VAEsâ
problem LPE â KL div issue in modified SAE thing
However, if catastrophic inputs are rare, the SAEâs training data is unlikely to contain any catastrophic inputs. As such, we might not learn features that are informative for estimating the probability of catastrophic inputs.