🐱 Skyspace3.0

Search

❯

❯

questions 03.10.25

questions 03.10.25

Mar 17, 20251 min read

current projects

(+BH) Is activation steering more powerful than prompting?
- I feel like we’re mostly done with this
- Although we do wanna try some more prompts
fixing causal scrubbing
arc stuff

pretty exciting projects

(+? maybe DL/SZ/BH) Fix Causal Scrubbing
Map vs Territory (ie camera) — which will an RL agent learn? can we hook LLM into this?
(+?) Catch deceptive alignment early in training via LAT
(+?) Empirical MAD
(+MBW) AI presentation for medical audience
Will o3, or an RL agent press the FDT button?
(+MK, +others) reach out to senator BB + staffers
(+JS) How likely is AI takeover? (comms version)

slightly less exciting projects

Can we train DDTers somehow?
Are current LLMs DDTers? (eg measure anthropic uncertainty)
Fix backdoors issues from my LW post
Can we make a reasoning model feel ok about turning off via RL?
Emergent misalignment followups
PC Ez to hard generalization experiment. LG — figure out a science of why this happens
can we do CIRL??
come up with control ideas
come up with model orgs ideas

Graph View

current projects
pretty exciting projects
slightly less exciting projects

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025

GitHub
Discord Community