I recently have heard of several surprising alignment ideas. I’ll write them down here. This has updated me towards “there are a bunch of random hacky things that we could try which might be really cool”.

  1. Steven Brynes has a recent post “reward button”.
    1. The idea is to make an AI that wants you to press a certain button.
    2. The thought is that this is a super simple reward function.
    3. So inner alignment is easy.
    4. One problem with this plan is that it incentivizes the AI to take over control of the button.
    5. But that’s outside of the action space until the AI is an ASI.
    6. Maybe there are AI’s that
      1. Can do useful research.
      2. But can’t take over the button.
    7. OTOH, if you said “I’ll press button if you do great research” then that probably doesn’t work either
      1. Like, you can only use this on highly verifiable tasks.
      2. and if the task was highly verifiable, then maybe you could’ve done without this whole button thing?
      3. interesting idea tho.
  2. Distill a deceptively aligned model
    1. Some ppl have the intuition that “RL makes deceptively aligned AI’s much better than imitation/prediction”.
    2. If this was true then here’s something you can try:
      1. RL - super buff misaligned AI
      2. distill the misaligned AI
        1. either it has to act misaligned during distillation
        2. or you get a nice AI out?
    3. there are def problems that could happen.
      1. but maybe this does smth good.
  3. automating science
    1. train an AI model to figure out how humans integrate stuff into our world model.