q1 see high stakes alignment notes for context.

Let be some class of boolean functions on bit inputs. For example ” node decision trees” or ” gate circuits”.

Let be some “non black box” “pretty simple” function.

Let be the set of such that .

You get training queries. You wanna find which is competitive in terms of reward with the best .

TODO: flesh out — how can C both be simple and non-black box. if its simple and you just know it then the problem trivializes

why arent we allowed to just distill C

not quite sure how to spell this out.

also not sure if defining C is actually secretly the main difficulty.

enough for now.