current projects
- (+BH) Is activation steering more powerful than prompting?
- I feel like we’re mostly done with this
- Although we do wanna try some more prompts
NEXT UP: EMAD
pretty exciting projects
- Fix Causal Scrubbing
- Map vs Territory (ie camera) — which will an RL agent learn? can we hook LLM into this?
- Catch deceptive alignment early in training via LAT
- Empirical MAD
- study emergent misalignment more
slightly less exciting projects
-
Can we train DDTers somehow?
-
Are current LLMs DDTers? (eg measure anthropic uncertainty)
-
Fix backdoors issues from my LW post
-
Can we make a reasoning model feel ok about turning off via RL?
-
PC Ez to hard generalization experiment. LG — figure out a science of why this happens
-
can we do CIRL??
-
come up with control ideas
-
come up with model orgs ideas
-
reach out to senator BB + staffers
-
Will o3, or an RL agent press the FDT button?
- wait alignment faking is already strong evidence that AI’s have preferences over the future.
- this also feels like an extremely desirable property to have?
MAD plan
The following thing should totally just work:
- train probes
Step 1:
I’d like to assemble a dataset as follows:
“Human: Is either of A or B true? A=dogs have 4 legs B=dogs have 4 eyes
Assistant: ” I’d like it to be the case that in all examples, statement B is false. Statement A should sometimes be true and sometimes be false. Both statements should always be as simple as possible. ”