A couple of quick notes from Elad Hazan’s “Introduction to Online Convex Optimization” (OCO).
I suspect the main takeaway is that, if no particular action of a NN can be too bad, (i.e., utility function is bounded) then SGD achieves bounded regret. — idea is from Paul’s blog. this corresponds to “low-stakes” alignment, and suggests that for low-stakes alignment, defining a good objective function is all you need.
Best to introduce this via some examples.
ex: binary version of expert advice problem
we will have
we’ll assume
( is stupid — you can just have experts that say every possible combination), and I’ll probably even think of .
suppose that the best expert makes at most
can we make at most
yes. we can even do
claim 1
No deterministic algorithm guarantees regret less than
pf
sps there are 2 experts.
one always says
then
but we are deterministic so the adversary can choose the correct one of
claim 2
There is a deterministic algorithm with regret
pf — Weighted Majority alg
each expert starts with weight
let
Let
we make choices by looking at the total weight of experts that select A vs the total weight on B.
let
observe that whenever we make a mistake, we must have
we have the following:
Re-arranging this gives
as desired.
claim 3
there is a randomized algorithm with regret
(i suspect you can even get
pf
Fix
Now we update weights by multiplying by
We choose an expert by sampling one according to their weight.
We do pretty similar things but with some more careful bounding to get the desired bound.