I recently have heard of several surprising alignment ideas. I’ll write them down here. This has updated me towards “there are a bunch of random hacky things that we could try which might be really cool”.
- Steven Brynes has a recent post “reward button”.
- The idea is to make an AI that wants you to press a certain button.
- The thought is that this is a super simple reward function.
- So inner alignment is easy.
- One problem with this plan is that it incentivizes the AI to take over control of the button.
- But that’s outside of the action space until the AI is an ASI.
- Maybe there are AI’s that
- Can do useful research.
- But can’t take over the button.
- OTOH, if you said “I’ll press button if you do great research” then that probably doesn’t work either
- Like, you can only use this on highly verifiable tasks.
- and if the task was highly verifiable, then maybe you could’ve done without this whole button thing?
- interesting idea tho.
- Distill a deceptively aligned model
- Some ppl have the intuition that “RL makes deceptively aligned AI’s much better than imitation/prediction”.
- If this was true then here’s something you can try:
- RL -⇒ super buff misaligned AI
- distill the misaligned AI
- either it has to act misaligned during distillation
- or you get a nice AI out?
- there are def problems that could happen.
- but maybe this does smth good.
- automating science
- train an AI model to figure out how humans integrate stuff into our world model.