current projects

  1. (+BH) Is activation steering more powerful than prompting?
    • I feel like we’re mostly done with this
    • Although we do wanna try some more prompts
  2. fixing causal scrubbing
  3. arc stuff

pretty exciting projects

  1. (+? maybe DL/SZ/BH) Fix Causal Scrubbing
  2. Map vs Territory (ie camera) — which will an RL agent learn? can we hook LLM into this?
  3. (+?) Catch deceptive alignment early in training via LAT
  4. (+?) Empirical MAD
  5. (+MBW) AI presentation for medical audience
  6. Will o3, or an RL agent press the FDT button?
  7. (+MK, +others) reach out to senator BB + staffers
  8. (+JS) How likely is AI takeover? (comms version)

slightly less exciting projects

  1. Can we train DDTers somehow?
  2. Are current LLMs DDTers? (eg measure anthropic uncertainty)
  3. Fix backdoors issues from my LW post
  4. Can we make a reasoning model feel ok about turning off via RL?
  5. Emergent misalignment followups
  6. PC Ez to hard generalization experiment. LG — figure out a science of why this happens
  7. can we do CIRL??
  8. come up with control ideas
  9. come up with model orgs ideas