A couple of quick notes from Elad Hazan’s “Introduction to Online Convex Optimization” (OCO).
I suspect the main takeaway is that, if no particular action of a NN can be too bad, (i.e., utility function is bounded) then SGD achieves bounded regret. — idea is from Paul’s blog. this corresponds to “low-stakes” alignment, and suggests that for low-stakes alignment, defining a good objective function is all you need.
Best to introduce this via some examples.
ex: binary version of expert advice problem
we will have 
we’ll assume
( is stupid — you can just have experts that say every possible combination), and I’ll probably even think of . 
suppose that the best expert makes at most 
can we make at most 
yes. we can even do 
claim 1
No deterministic algorithm guarantees regret less than 
pf
sps there are 2 experts.
one always says 
then 
but we are deterministic so the adversary can choose the correct one of 
claim 2
There is a deterministic algorithm with regret 
pf — Weighted Majority alg
each expert starts with weight 
let 
Let 
we make choices by looking at the total weight of experts that select A vs the total weight on B.
let 
observe that whenever we make a mistake, we must have
we have the following:
Re-arranging this gives
as desired.
claim 3
there is a randomized algorithm with regret 
(i suspect you can even get 
pf
Fix 
Now we update weights by multiplying by 
We choose an expert by sampling one according to their weight.
We do pretty similar things but with some more careful bounding to get the desired bound.