current projects
- (+BH) Is activation steering more powerful than prompting?
- I feel like we’re mostly done with this
- Although we do wanna try some more prompts
- fixing causal scrubbing
- arc stuff
pretty exciting projects
- (+? maybe DL/SZ/BH) Fix Causal Scrubbing
- Map vs Territory (ie camera) — which will an RL agent learn? can we hook LLM into this?
- (+?) Catch deceptive alignment early in training via LAT
- (+?) Empirical MAD
- (+MBW) AI presentation for medical audience
- Will o3, or an RL agent press the FDT button?
- (+MK, +others) reach out to senator BB + staffers
- (+JS) How likely is AI takeover? (comms version)
slightly less exciting projects
- Can we train DDTers somehow?
- Are current LLMs DDTers? (eg measure anthropic uncertainty)
- Fix backdoors issues from my LW post
- Can we make a reasoning model feel ok about turning off via RL?
- Emergent misalignment followups
- PC Ez to hard generalization experiment. LG — figure out a science of why this happens
- can we do CIRL??
- come up with control ideas
- come up with model orgs ideas