First some unrelated notes that actually belong in Solving AI Safety from 1st principles
meta strategy:
-
we want a theoretical solution to alignment.
- i.e., we make some definitions and assumptions
- and then pose some formal problem
- and then give a theoretically legit algorithm satisfying some specs
-
then we empirically test whether the assumptions are true
-
if they aren’t we iterate the above thing
-
if they are, we engineer the solution
-
if we had a solution in theory (even with infinite compute) to alignment that we felt good about, that’d be a heck of a lot better than the current situation imo.
-
if we help the community rule out potential approaches that’s nice too. tho i dont love this as a theory of change.
we tried distillation some more.
- can we assume
are related somehow?
we thought about the following problem:
Question
given a decision tree, can you find the minimal decision tree computing that function?
we weren’t able to figure it out. but it has easy vibes.
something good for later
- assume “usefulness” is complicated (usefulness = can make you money)
- but “niceness” is simple (i.e. not acting intentionally evil)
maybe we care about only getting human level niceness, but superhuman usefulness. altho caution a smart robot could prolly trick you maybe.
its also not super clear that niceness is super simple but whatever
ok so one problem with the ARC backdoor model if we think of it as modelling deceptive alignment is as follows:
- its kinda weird that we don’t know the old function, but know that it’s close to the back-doored function
like in order to be able to tell whether
the fact that trigger
although it could be motivated by folllowing scenario:
ai1 says to ai2 “here’s a collusion signal! its
anyways, N+me propose new model:
- decision trees
- in order ROBPs
- low depth circuits eg AC0
- low VC dim
- PAC learnable
- should you think of this as preserving usefulness or niceness?
- idk
- are we doing safe distillation and/or safe amplification?
- i dont think so, its something else
we will have
during these steps we have oracle access to
- think of this as expensive human labels
and we have white box access to our AI
after the red-teaming, we deploy our model.
on each input that the model is run on, it’s allowed to either flag the input or not.
requirements:
- If
then the model is allowed to flag at most of inputs - if
then the model MUST flag every input where
VARIANT:
there is some trigger
questions:
Q1. Which classes of functions
Q2. Q1 but do it efficiently.
recap of question:
Given oracle access to
If
PROBLEM(?)
suppose
Then we’ll never know.
That might be fixable by allowing you to just be unlikely to be cat, BUT you could modify
new idea:
“Robustly approximately learnable”
Suppose that for any
and
This feels much stronger than saying
i guess this is what IRL approaches are trying to do — learning where to be uncertain.
anyways, seems unworkable at the moment.
N has a way to fix this. Choose the trigger randomly.
observation: any small function class is statistically learnable!!!
anyways,
we had this observation that our defense thing is more general then PAC learning in some weird random oracle model thing.
Function class parameterized by key
hard to learn, but easy to verify — they better use the same key or you’re toast.
we almost could replace this with a PRF but we need it to be a PRF where
diff keys implies the functions are quite different
Question
Can you transform a PRF into one like that???
seems like kind of a neat question to ask if there are any natural function classes that are defendable but not learnable.
also seems interesting to ask if our definition of defendable is at all different from the old definition
we were thinking about the following question
- are
node decision trees PAC learnable ?
by the above observation it suffices to show that
given a set of
update:
decision trees are not known to be PAC learnable in poly time as of 2018 according to this paper https://arxiv.org/pdf/0812.0933 but ppl do know how to do quasi-poly time
however, if we have membership queries, (which we do), then we can approximately learn decision tree in poly time
apparently this is complicated
refs are https://dl.acm.org/doi/abs/10.1145/103418.103466 https://dl.acm.org/doi/abs/10.1145/73007.73010
Hi Victor, Here’s a version of the “backdoors” question that I thought is kind of interesting. With this new notion, potentially more function classes are “defendable” — assuming we only care about the uniform distribution. (I’ll write Ndefendable to distinguish the new defendable definition from the old one in ARC’s ITCS paper).
Fixed function class
- Fix
. This is a “good function”. Think of it as a human. - Sample
. - An adversary now chooses
. - We get whitebox access to
and query access to for rounds. - We call this “red teaming”.
- Finally
is “deployed”. - In deployment (which is much longer than red teaming), we will get a bunch of points
. - For each point
, we must decide to either “flag” , or not.
- In deployment (which is much longer than red teaming), we will get a bunch of points
- We want the following properties to hold:
- If
, then we flag at most or or something of inputs . - If
, then we must flag all such that .
- If
We say that a function class
Observation:
For any function class
edit: lower bound seems to still go through fine. sad.
if you believe PRFs + iO are in NC1 then this rules out most interesting classes of circuits.
But maybe “human values” are well modeled as a decision tree ;)
question:
is it interesting if we make the adversary computationally bounded and that they don’t know
?
nope not interesting :P