interesting ideas

Project ideas that I’d be excited to work on:

Influence functions for FT data.
Understanding AIs via FTing on a data point.
Data filtering for misalignment
- Build a classifier
- Study conditional pre-training. (Ant has a bpost about this. See also Turntrout’s bpost)
- Study hyperstitioning, etc (or just check how compelling the literature is).
Fixing scalable oversight.

Projects that I’d be excited to see happen (But seem pretty gross, so I’d only want to do them if I had a team):

Data poisoning
- Maybe distilling from a deceptively aligned model gives you an aligned model.
- But there are spooky results like the Owl paper.
Use lack of adversarial robustness to catch schemers.
Don’t change the player, change the game.
Shoggoth / Face / Paraphraser
Ryan’s interp proposals: decoding neuralese, imitative generalization, consistency lenses.

Other interesting thoughts:

Transparency can be bad: may incentivize companies to not look for evidence of risk, or may cause safety team to waste their time in suboptimal places.
AI companies are startups (different labs to different degrees), and partially “do-ocracys”.
Seems like a good idea to keep logs from what the AI’s do, and then get buffer TMs to look at them, or get other labs’ AIs to look at them (maybe the misalignments are uncorrelated enough that this is good).

🐱 Skyspace3.0