I’ve seen off-switch discussed in two places:
- MIRI’s strategic landscape thing
- Corin’s theory of victory LW post
I hope that I’ll say something novel here.
Here are some questions that I might address:
- What features are needed for an effective Off Switch, and how can the world implement them?
- Another question is what is the type of scenario in which you would actually use an off switch?
- Another question: are we screwed if we don’t have an off-switch and alignment turns out to be non-trivial?
1: What is an off switch?
Maybe the biggest question I have here is
Are we talking about halting training of more powerful systems or halting running inference with current AI systems.
My guess is that an off-switch is mostly about halting inference on some misaligned AIs.
This could look like:
-
You discover that misaligned AIs have launched an internal rogue deployment.
- and you have some protocol in place by which you’ll take reasonable actions about this.
-
You have your AIs deployed all over society, like they are running all the companies and stuff
- and you turn them off
One thing I’m noticing is that turning your AIs off in these situations is going to be extremely expensive. And there are going to be really strong pressures not to do it.
2: Is this even a good idea?
You could argue that having an off switch raises the bar for when the AI will takeover, so the takeover attempt will be more aggro.
But I think that actually we’re much better off in worlds where the AI thinks that humans are taking things seriously. And I don’t think we’re going to trick them about this. (Although you could try giving them synthetic docs, e.g., saying “95% of our engineers work on safety!”) so maybe we can.
But anyways what are the benefits of AIs thinking that we take things seriously?
Well maybe we can make deals with the AI more easily then? e.g., giving the AI 10% of the universe if it admits that it’s misaligned or something.
Also, things like an off switch give society more time to react.
And having an off switch means that ppl are admitting that this could be dangerous. So seems pretty good.
3: WHEN DO you off switch
Technically there are two options:
- After a fairly large AI caused disaster. Or maybe even if you just get extremely crisp evidence that your AIs are super evil.
- You just agree to do this.
- For instance, maybe US + China were like “we don’t want you to dev ASI, and you don’t want us to dev ASI, and you’ve done threat of sabotage on us. So we’ll just stop.
- Or “there’s a huge public outcry because everyone has lost their jobs”.
I am quite concerned about if AIs get good at persuasion / propaganda, and if they start to be super useful and besties with ppl that then it becomes harder to do this kind of stuff.
What features does an off switch need?
- Quick reaction time?
- Infosec
- like, it seems like it’d be bad for the AIs to know that you’re about to press the off switch, seems like it’d make them desperate?
- but it’s not like we’re killing them
- it’s more like a pause button.
- We’d want to be pretty explicit about this.
- TBC: threatening AIs is vvv bad.
- I’m pretty sure we should basically never threaten AIs.
- It just doesn’t seem like a very nice thing to do.
- I really do think there is an asymmetry between conditional threats and conditional rewards.
- Like, if you do X I’ll give you 10
. - I’m not sure if this actually makes any sense.
- But maybe I’ll explore that later.
- Like, if you do X I’ll give you 10