Here are some random project ideas that I’d be pretty excited to see done well, and some other random thoughts.
Prompted by listening to a few mins of Bengio’s recent Simons Institute talk and reading some of Daniel’s LW posts.
-
Bengio mentioned setting up “public risk evals” things for AI companies --- apparently a similar concept exists in insurance or whatever. This is a method for incentivizing companies to spend money on safety.
- If done well the company would get points for
- implementing control measures
- devoting compute to AI’s doing AI safety research
- this is a pretty ambitious form of “transparency measures”
- FLI released some “scorecard” of safety https://futureoflife.org/document/fli-ai-safety-index-2024/ --- this is the right kind of vibe of thing
- If done well the company would get points for
-
Daniel Kokotajlo has some super epic policy proposals:
- Companies should put a doc on their website that explains how much risk they believe they are imposing on the public and gives an explanation for this.
- Also they should explain why they are building AGI
- Companies should make some agreement thing “we will not build ASI in secret, or without asking for a vote about if this is ok w a majority of ppl”
- Companies should make some agreement like “if we discover that our AI has sabotaged our data center, then we will not just do some superficial patch and keep going we will instead actually slow down and clean up responsibly and roll back to a different checkpoint or whatever”.
-
Field building things:
- Someone should conduct a survey or something of how people decided to care about AI safety
- For instance — Bengio was convinced to care bc GPT3 came out and he was spending time with his grandson and he was like “will he even have a life?” and then he decided that he had to do something about this.
- He specifically said something like “I had heard all the arguments for AI risk for years, and had vaguely been following the field but this logic never really moved me” he further noted that “working on capabilities was fun and I thought it would be really good and humans have cognitive biases to shy away from uncomfortable ideas.”
- For instance — Bengio was convinced to care bc GPT3 came out and he was spending time with his grandson and he was like “will he even have a life?” and then he decided that he had to do something about this.
- I’d love to clarify the role that logic plays in convincing ppl.
- My current best guess is that the ideal strategy is:
- Start with ethos --- e.g., read the CAIS statement on risk
- Next go for pathos --- tell an emotional story about why you care about mitigating risk, or just be like crumb it’d actually be quite bad if humanity ended in the next couple years.
- Then go for logos. Explain risks from first principles, sprinkling in some empirical demos like alignment faking / apollo scheming reasoning evals / reward hacking demos.
- the vast majority of ppl will have the following misconceptions:
- AI can’t get superhuman
- Counter with Chess
- Causing extinction is hard
- Counter by talking about pandemics / mirror bacteria / nuclear war
- Current AI’s are nice
- Counter with current empirical demos of AI’s doing pretty concerning stuff
- AI’s won’t have convergent instrumental subgoals
- Counter by saying that they will
- Society will respond responsibly
- Counter by saying “um doesn’t look like we’re being that responsible atm”
- AI can’t get superhuman
- the vast majority of ppl will have the following misconceptions:
- I think very few ppl are convinced purely by logic.
- But it’d be interesting to have data for or against this hypothesis.
- Note that engaging in arguments and looking like you know what you’re talking about and have responses to all objections and saying things that people can verify is a form of signalling --- which is a type of ethos I guess.
- OTOH emotional appeals are probably pretty fragile.
- My current best guess is that the ideal strategy is:
- Someone should conduct a survey or something of how people decided to care about AI safety
-
How can we keep epistemically alive in a world with super-persuasive AIs?
- obvious solution --- don’t talk to super-persuasive AIs.
- unfortunately this is probably not possible
- also, if humans get really good args from super-persuasive AIs and then talk to you, what’re you going to do…
- tbh this seems pretty cursed --- just have to hope that we can build an aligned successor system more powerful than us before this is a big deal.
- obvious solution --- don’t talk to super-persuasive AIs.
-
How to prevent nuclear war?
- It turns out that nuclear war would actually be ~game over.
- maybe we should not do that.