A couple of quick notes from Elad Hazan’s “Introduction to Online Convex Optimization” (OCO).

I suspect the main takeaway is that, if no particular action of a NN can be too bad, (i.e., utility function is bounded) then SGD achieves bounded regret. — idea is from Paul’s blog. this corresponds to “low-stakes” alignment, and suggests that for low-stakes alignment, defining a good objective function is all you need.

Best to introduce this via some examples.

ex: binary version of expert advice problem

we will have iterations of a binary decision — choosing either or . before each decision we’ll get to hear expert opinions. the adversary will get to choose which of is the right answer afterwords. we get a score of for wrong answer, for right answer.

we’ll assume ( is stupid — you can just have experts that say every possible combination), and I’ll probably even think of .

suppose that the best expert makes at most mistakes over the course of the rounds.

can we make at most mistakes?

yes. we can even do . we call this “regret “.


claim 1

No deterministic algorithm guarantees regret less than .

pf

sps there are 2 experts. one always says the other always says .

then

but we are deterministic so the adversary can choose the correct one of to be whichever we didn’t choose.

claim 2

There is a deterministic algorithm with regret .

pf — Weighted Majority alg

each expert starts with weight . an experts weight decays by whenever the expert answers incorrectly, and doesn’t change if they answer correctly.

let denote the number of mistakes of expert up to time . then,

Let denote the number of mistakes that we make up to time .

we make choices by looking at the total weight of experts that select A vs the total weight on B.

let

observe that whenever we make a mistake, we must have

we have the following:

Re-arranging this gives

as desired.

claim 3

there is a randomized algorithm with regret

(i suspect you can even get via standard dyadic tricks, but I haven’t thought about it yet)

pf

Fix (we’ll take it to be ).

Now we update weights by multiplying by .

We choose an expert by sampling one according to their weight.

We do pretty similar things but with some more careful bounding to get the desired bound.