Sam Altman CEO of OpenAI āDevelopment of superhuman machine intelligence (SMI) is probably the greatest threat to the continued existence of humanity.ā
JD Vance (VP of the US) āIām not here this morning to talk about AI safety, which was the title of the conference a couple of years ago,ā Vance said. āIām here to talk about AI opportunity.ā āThe AI future is not going to be won by hand-wringing about safety,ā
In this post Iāll explain why I believe the following claim:
EDIT:
I now find longer timelines somewhat more plausible, and think that there might be more options besides extinction on the table ā like there could be some spectrum of how bad things are. I still think that extinction is pretty likely, but havenāt had time to carefully reason through this yet.
As a first approximation, Iāll say that my predictions are now as follows: 85% extinction, 5% some other very bad outcome, 10% good? These are the long term probabilities. Iāll spread my 85% extinction probability mass as follows: maybe like 50% in (now, 2029) 35% in (2029, 2035)
Sorry, this is pretty complicated ā these are my best guesses for the time being.
Claim X
There is at least a
that AIs will kill all humans by the end of 2028.
Epistemic Status: Iāve thought about this every day since ~August 1st 2024. I have a large amount of uncertainty about the specifics of how things will go, but feel pretty confident that Iām correct to think that itās highly probable that AI will not go well for humans. By which I mean extinction or āsomething similarly badā.
quick summary of my argument
Hereās the gist of the argument explained in detail in the rest of the post (similar arguments can be found here, and here).
- Humans will develop Artificial Superintelligence (ASI) by 2028. The recent AI progress, going from ācan barely form coherent sentencesā to āaverage high-school studentā to āaverage college studentā to āaverage Olympiad competitorā seems very surprising under any hypothesis other than āas you scale deep learning you get more intelligent systemsā.
- Current ML techniques donāt let us control or even understand why a model acts a certain way ā they just let us achieve a particular behavior on the train distribution.
- If a model stumbled upon some weird goal ā which itās pretty likely to do ā and realized that it was being trained to pursue a different goal, then it might choose to act in a way that hides its true goal because it doesnāt want this goal to be modified by training.
- If we create an ASI with a weird goal and come into conflict with it, then we will lose.
I donāt know exactly how weāll lose, in the same way that I donāt know how a chess program would defeat a human opponent. But in-case itās helpful, I can spell out how Iād try to defeat humanity if I were an ASI that wanted to end factory farming, and decided that humanity was an obstacle to this goal.
story 1:
- I convince some government to put me in charge of their military by promising them that itāll give them an advantage in global politics.
- I do some military R&D and develop some powerful weapons.
- I deploy these weapons against he humans.
This is already starting: openai is already partnering with an autonomous weapons company to build put AI in charge of autonomous weapons (drones)
story 2:
- I hide my true goals until Iām confident that I can decisively overpower humanity.
- The AI company gives me vast computational resources and tells me to do biology research. For instance, openai is already collaborating with national labs on bioresearch.
- I learn a lot, and figure out how to make a super-virus like COVID-19, but with a longer asymptomatic period and more fatal.
- I contact some humans thatād be amenable to helping me produce and spread this, or just trick some people into doing so.
story 3
- I āescape the labā ā either I upload my weights to some external server myself, or convince a human to help me do this.
- I make a bunch of money online, e.g., via the stock-market. I definitely need some money just to run myself!
- I make a bunch of backup copies of myself and hide them in various places that I can hack into.
- I improve myself, because that seems pretty useful.
- Then I manipulate humans into having some really nasty hot conflict and just get them to annihilate themselves. For instance, by creating a deepfake of the president announcing nuclear strikes on other countries.
Note that we are already planning to have a large number of autonomous AI systems deployed on a wide variety of tasks with little oversight that are connected to the internet and can talk to anyone.
story 4 paraphrased from here 6. AI gets root access to its datacenter. 7. AI can then intercept requests to the datacenter, and control what we see. 8. This is an appealing alternative to escaping the lab because in this setting you can use the vast resources of the lab to improve yourself (for example). (ML is expensive rn).
Note that openai is already collaborating with darpa to ādevelop state-of-the-art cybersecurity systemsā which requires building AIās with an intricate understanding of cyber-security, and also putting AI in charge of our cyber-security, giving it an chance to insert subtle vulnerabilities in the security.
quick summary of recommended actions
-
Make sure you understand the argument thoroughly. write down your beliefs about this.
- Until the risk becomes more mainstream, holding to this belief will be challenging, because itās not normal.
- If you disagree with me on any points please reach ā out Iād love to talk!
-
Make sure you understand the implications of the argument thoroughly; write it down.
- If Iām right, this is a big deal.
- If youād take your friend to the hospital if they got in a car crash, then you should also do something about risks from AI ā they are similarly life-threatening.
-
Inform your circle of influence about this issue. For instance, you could share this article.
-
Communicate about the risks more broadly.
- Reach out to people with political power.
- And things like news outlets, or popular podcasters.
- PauseAI has good communication suggestions.
-
Consider focusing your career on this issue, at least for a year or two.
- You could do communication as a career.
- Or policy work.
- Or technical safety work.
-
If you work at an AI lab:
- Please talk about risks within your company.
- In the past I thought that itād make sense to quit your job, because itās bad to work on a project that will end humanity.
- But this actually doesnāt make sense ā you have a lot of potential to do good from within the project.
- If all the people that cared about safety left the project and were replaced with people that didnāt care, thatād be bad.
- That said, there is a point when the only responsible thing to do is to stop development.
- And please, do not contribute to accelerating capabilities ā timelines are already so short.
- You will make me very sad.
Why you are the only one that can do something about this:
Itās pretty hard to do something about this. We have a lot of inertia on the current course.
If you discuss the risks with others, many will think youāre crazy, and many will nod along in knowing superiority/cynicism/fatalism and say that thereās nothing we can do. It will be quite easy to do nothing. Maybe someone else will do something about this. Maybe, if you try, you can forget about this issue.
āThe only thing necessary for the triumph of evil is for good people to do nothingā
Is whatās happening evil? Maybe not. But entropy is not on our side.
Please, be brave. Do what is right, even if itās not convenient.
I need your help. You specifically. For real.
Please.
We donāt have much time left.
Anyways, back to the main post.
Important note: I expect the US to develop ASI first and lose control of it, resulting in a catastrophically bad outcome. It doesn't matter if we have nice intentions while doing this -- if we aren't careful the outcome will be very bad with high probability. If you share this message I implore you to not encourage an āAI raceā ā there are no winners in an AI race, we all lose together.
This view ā that AI poses a large existential risk ā isnāt main-stream yet, which makes it easy to dismiss without consideration. My impression is that most people have heard about this issue, but think that itās a speculative possibility in the far future rather than the most-likely outcome of our current trajectory to play out in the next few years. In fact itās worse than this ā many people view this claim as ātoo sci-fiā, and will ridicule you for taking it seriously 1. So, why should you treat this post more seriously than you would some random conspiracy theory? Here are four reasons:
- Many credible ML experts believe Claim X. For instance Geoffrey Hinton (recent Nobel Prize winner) and Yoshua Bengio (Turing Award Winner) who pioneered many AI advances have become vocal advocates about the risks from AI. More quotes here.
- There is a large amount of tension between ML engineers because of differing opinions about safety. For instance, a very large number of scientists have left OpenAI because of safety concerns, and have joined/created various organizations with the goal of making safer AI (e.g., Illyaās Safe Super Intelligence, and Anthropic).
- LLMs are very competent and getting better fast. Itās clear that AI is going to have a huge impact on society, and worth thinking about what exactly the implications of powerful AI are.
- Iāve thought about this and take Claim X quite seriously, to the point where Iām willing to change my actions substantially based on this belief. Itās easy to feel that this isnāt your problem, but youāre wrong. This is everyoneās problem. I hope that youāll read this post and actually think about the arguments instead of denying responsibility.
The purposes of this post are as follows:
- Clarify my AI risk predictions (for myself and others).
- Facilitate good arguments about AI xrisk, and about approaches to risk mitigation.
- Convince you to take AI risk seriously. (Iāll talk about what I mean by this in a bit; basically it means that this should influence your actions).
Post Outline
- Clarifying the argument, and giving some background.
- The basic argument for AI xrisk.
- A more in-depth version of the argument.
- Reasons we might be okay.
- What to do about this.
Setup
Humanity is awesome, and I care about humanity a lot, as outlined here. Of course humans sometimes do non-awesome things ā but we have a lot of potential. I can imagine a world like our world today, but with less pain, sickness and sorrow. We can give the next generation a better world or at least a good world. However, this is not guaranteed.
This was a really hard idea for me to accept. It feels āunfairā. In a story the characters canāt just die. Or if they do it at least means something, and everything basically turns out all right in the end. In a story, you can face 9:1 odds and win; in reality when you face 9:1 odds, you lose with 90% probability. We do not live in a story. Things can actually go wrong, even extremely wrong.
To get some basic intuition for this, it helped me to think about the cold war. There were multiple times where the cold war was extremely close to turning into a hot war. A large part of the reason that it didnāt is that we got lucky.
In this post, I discuss ways that AI progress could lead to catastrophically bad outcomes. In a moment, Iāll define what I mean by that, but Iād first like to emphasize an important point: AI poses large risks regardless of who makes it. This is a big difference between AI and nukes.
Itās relatively easy to understand AI as analogous to a āweaponā, and thereby reason that itās important for good actors to develop powerful AI before bad actors. This line of thought leads actors to develop AI in a race (e.g., racing on AI progress was suggested to congress in 2). However, racing to develop very powerful AI is extremely dangerous. There are major challenges (which Iāll discuss in a bit) to verifying that an AI agent is safe, and race pressures deprive actors of the time necessary for good evaluations, making it more likely that weāll be comforted by misleading indicators of safety.
I request that when you communicate about risks from AI, you focus on the fact that powerful AI is dangerous regardless of who builds it ā I worry that spreading the simplified message āAI is dangerousā can be net-negative by encouraging race dynamics.
What are ābad outcomesā? Iām concerned about the following outcomes:
- Human extinction.
- Human dis-empowerment (i.e., we are relegated to a role in the world like dogs ā maybe AIās will keep us around, but we wonāt have much say in how things go.)
What does āpowerful AIā look like? In this post Iāll discuss dangers from āArtificial Superintelligenceā (ASI). By āASIā I donāt require an AI agent that can outperform humans in every domain; instead, Iāll use ASI to refer to an AI that is superhuman in at least half of the following domains:
Domain | Example |
---|---|
Coding | SWE, ML research, cyber-attacks |
Math | Theorem proving |
Biology | Protein modelling |
Psychology | Understanding and influencing humans |
War/Defense | Controlling military equipment |
Planning | |
Finance | Trading |
Business | Running a company |
Manufacturing | Running a factory |
Note: There are risks that arise before ASI, e.g., just with AGI. Iām focusing on the risks from ASI because I want to make a minimum viable case for Claim X.
The Basic Argument for AI xrisk
Now that Iāll outline a basic argument for why itās 90% likely that AI progress results in catastrophe by the end of 2028.
- Claim 1: People will try hard to make ASI.
- Scientists/engineers are excited about AI.
- Governments think AI is important for military / economic reasons.
- Claim 2: ASI is probable soon.
- Extrapolate capabilities progress.
- With the amount of money and talent being thrown at the problem, weāll overcome any obstacles.
- Claim 3: An ASI would be capable of killing/dis-empowering all humans.
- Looking at my table of ASI capabilities this is hopefully fairly intuitive.
- As a helpful analogy, elephants still exist because we decided not to kill all of them; theyād have no chance if we wanted them gone.
- Claim 4: ASI does not have human-compatible goals by default, and we don't have a plan for how to fix this. (i.e., there is a decent chance that an ASI could further its goals by killing/dis-empowering humanity).
- Being in control is advantageous for achieving most goals.
Note: From some informal polling, my impression of where people stand on this is as follows:
- Some people doubt Claim 2 ā they tend to point to flaws in current AIās (e.g., hallucinations), and assert that current AIās arenāt very smart / capable or that they are just stochastic parrots.
- My first thought on this is that itās missing the point ā what matters is not how good AIās are right now, but how good theyāll be by 2028.
- Also, the claim that current AIās arenāt capable of reasoning is absurd ā you canāt get to ELO 2727 on CodeForces (so, one of the 200 best competitive programmers in the world) (as OpenAIās o3 model does) without being able to do complex reasoning. Also check out Frontier Math.
- If you donāt believe in AI progress, then youāll always be surprised when a benchmark is destroyed (without OpenAI even trying).
- Many people also complain that āAIās canāt have goalsā.
- However, current AIās obviously have goals ā for example current AIās care about answering user queries well and denying harmful queries. In fact, AIās resist human attempts to change their goals.
Here are a few more common objections to my argument that Iāll address:
- re 1: why wonāt people stop building AI once they realize its super dangerous or once people get annoyed that they lost their jobs?
- re 2: why wonāt energy / or money run out before we can scale more?
- re 2: why wonāt models stop improving once they get to human level?
- re 3: why couldnāt we just turn off AIās if they got really scary / turned against us?
- re 4: why would an AI even āwantā anything?
- re 4: donāt we get to choose what the AI cares about? I have good answers to all of these objections that Iāll give in a later section. However, there are a couple of considerations that I find actually compelling, that cause me to believe that thereās a ~20% chance that weāre fine by end of 2028.
Basic Reasons Why We Might be Fine by end of 2028
Iāll discuss in further depth later why my numbers are so small here (small in an absolute sense, not in the sense that I think theyāre un-calibrated).
- Reason 1: We might intentionally slow AI development (~1% chance).
- There are some good people doing policy work advocating to stop pushing the frontier (see, e.g., this).
- There are some good people that work at frontier labs, maybe they can help slow as it becomes more obvious that the risk is unacceptable.
- Reason 2: Even if people keep throwing money at AI, maybe AIās wonāt be capable enough or widely deployed enough by 2028 to do harm, even if they really wanted to. (~3% chance)
- Maybe there are some unforeseen bottlenecks.
- Maybe labs will stop deploying models and this somehow limits the reach of the AIās.
- Something like this ā e.g., AI progress turning from a corporate project to a government run project that doesnāt release the models ā actually seems moderately likely.
- But Iām not convinced that this is very likely to hamper AIās ability to do harm, and it plausibly exacerbates it.
- Maybe we have really good AI control.
- Also itās worth noting that āreason 2ā just means that my timing was slightly off, and that the problem is in 10 years rather than 4.
- Reason 3: Maybe we āsolve alignmentā ā i.e., we figure out how to ensure that an AI cares about things that we care about, and we also figure out some good things for an AI to care about. Or we figure out how to prevent AIās from deceiving us. (~3% chance)
- I donāt know of any concrete proposals that are close to working.
- Maybe some AI researchers could find them for us?
- But Iām pretty skeptical of this ā how long can they do superhuman research before being dangerous?
- Reason 4: Maybe scheming (deceptive alignment) is really hard, and models just do good things by default. (~3% chance)
- Scheming is pretty hard if we have good control measures.
- But this seems like a problem that goes away with sufficient capabilities.
- I donāt think this is how SGD works ā there is optimization pressure towards bad behavior.
These arenāt quite disjoint events, but Iām going to estimate the probability that weāre okay by end of 2028 for one of these reasons as 10%.
Note: Iām not very happy about this number. I propose that you and I try to change this number. (The number factors in that I expect humanity to not have an even moderately appropriate reaction to this issue, so thereās room to make me more optimistic if you try.)
A More In-Depth Argument
CLAIM 1:
ASI is probable, soon.
There are four major ingredients that go into making powerful models, which could potentially bottleneck progress:
Energy, data, GPU chips, and money.
Some people have crunched some numbers on this, and we still have substantial room to grow in these four areas. Also thereās a trend of cost decreasing rapidly over time. E.g., o3-mini is comparable or better than o1 at much reduced cost. If there is going to be some big bottleneck, (e.g., energy), I expect labs to be proactive about it (e.g., build power plants).
The basic reason why I believe that ASI is probable soon is progress trends. For example,
- GPT3 ā high schooler
- GPT4 ā college student
- o3 ā Comparable with CS/Math Olympiad competitors on close ended ~6 hour long tasks. Extremely good at SWE. Iād guess o3 can speed up ML engineering by 1.5x.
- More test-time compute + GPT5 as base model (hypothetical future model) ā Can produce similar work to top ML engineers, at lower cost?
AIās are already really good. For instance METR showed here that AIās like o1 are roughly comparable in skill to a human ML engineer on 8 hour long engineering tasks. Suppose that you 2x this horizon length every 6 months. Very soon, this gets very big. As AIās become more powerful, they can be useful in accelerating AI R&D.
People like to talk about how LLMs are unreliable, and hallucinate a lot. I hope such people will update based on o3. An similar objection that some people have to claims about ASI being possible is that ācomputers canāt be intelligentā, or āhumans are special, so weāll always be able to do some things better than AIāsā.
- First, the brain is just a machine, and it was optimized by evolution, which is in some sense a similar algorithm to how we train artificial agents.
- Current artificial models are on pretty similar scales to brains
- By which I mean, number of axons in a brain is only a few OOM larger than number of parameters in GPT4.
- Is this a fair comparison? Iām not sure.
- Honestly Iād lean towards saying that computer neurons are better, because biology is messy and inefficient.
- Some people say that machines will cap out at human level because machines just learn from humans. But this is not how ML works. We already have tons of examples of computers being superhuman in specific domains (e.g., chess, standardized tests).
- There are tons of tasks where you can generate feedback loops that create agents much stronger than humans.
- For instance, LLMs could easily get much better than humans at next token prediction.
- Also, itās easy to create feedback loops for coding.
- Having agents that utilize test-time compute to reason is the current paradigm for progress.
Claim 2
People will try really hard to make ASI.
First, suppose there are no āsmallā AI catastrophes that scare people.
One major factor that will drive the development of ASI is that there is a ton of money to be made from AI. Companies/countries understand this and are investing huge amounts in AI. As an example of how much value could come from AI systems, imagine they could replace a large amount of the work-force (e.g., software engineers), or that they could make huge contributions to e.g., medicine.
But, this is not the only relevant factor driving growth. Some people are excited about building ASI because they think itās a really cool scientific project. Some people have a vision of a post-ASI utopia that they are excited about.
A āsmall disasterā might not be sufficient to stop people trying to develop ASI.
Itās possible that the risks of AI donāt become widely accepted until humanity has permanently been maneuvered off of the steering wheel of the future, but I donāt think this will necessarily be the case. Itās possible that people start to realize that AI systems weāre building are dangerous, but still continue building them anyways. The main factor that would drive this is race dynamics / fear of being left behind. If your organization are the āgood guysā and youāre pretty sure that someone is going to develop ASI, then most people would feel that itās better if they are the ones in control of building it.
Also, things like humanityās response to COVID might make you pessimistic about humanity coordinating to stop AI development. Also, regulation is generally reactive rather than proactive. But of course, you canāt react after a sufficiently large disaster.
Another reason why itād be really hard to stop AI dev, even if people had reservations, is that as we become reliant on powerful AI systems, itās harder to abandon them. For instance, imagine a policy of āeveryone needs to get rid of their phones and computersā. This is just unthinkable to most people at this point. We might already be at this point with tools like ChatGPT.
There are reasons to suspect that there will not be large āsmall (misalignment) disastersā
If an agent is smart enough to cause a disaster, then it is probably also smart enough to realize that it would be smarter to wait until it has a decisive advantage before initiating a ātakeoverā attempt. This issue is exacerbated by the fact that there will be tiny disasters, and then the agent will get feedback telling it not to make these disasters. And from this the agent could either learn not to create disasters, or to only create disasters large enough so that humans canāt penalize it afterwords. You might hope that the AI generalizes and learns the first rule (which might seem like a simpler rule).
In summary, there are a lot of clear signals that push for the development of AI, and the signals of the dangers are more nebulous, and might only come after the situation has already spiraled out of control.
Claim 3
An ASI would be capable of killing all humans:
Before considering whether ASI agents would kill/disempower humans, letās think about whether they could if, e.g., someone gave them this as a goal. Obviously someone giving an ASI such a goal counts as misuse not as the ASI independently posing a risk without bad actors ā but weāll get to why you might not need to have people explicitly give ASI such a goal in order for the agent to pursue this goal later.
Q: What are some really dangerous things that AI could do? A:
- AlphaFold5 is given a lab and is researching e.g., how to cure cancer. AlphaFold5 figures out how to make a synthetic virus similar to COVID19 but much worse. It manufactures this and releases it.
- Various countries put AI agents in charge of their military ā maybe not because they actually thought that this was a good idea, but because they worried that if they didnāt then other countries would, obsoleting their military.
- An AI can buys its own data-center, replicates onto that data-center, and then does R&D to improve itself. Then, after becoming extremely intelligent, it comes up with some strategy that weād never think of for doing something dangerous.
- An AI uses persuasion, or makes a bunch of money online and then uses money, to get humans to do dangerous things.
Note that these arenāt super far-fetched applications of AI. AI is already widely used in drug design, we are increasingly going to give agents the ability to act autonomously, to write and execute code, to give us news and help us make decisions.
The ādisempowerā case is also worth thinking about. One way this could play out is that we become highly reliant on AI systems, the world changes rapidly and becomes extremely complicated, so that we donāt have any real hopes of understanding it anymore. Maybe we end up only being able to interface with the world through AIās and thereby lose our agency. Paul Christiano does a good job of explaining this here.
Honestly, this feels kind of tautological. If humans wanted to kill all the chickens in the world, then theyād have no recourse. If Stockfish wants to beat the world champion at Chess, then I know who will win (although of course I donāt know how itāll win). If we get in a fight with an alien race that is vastly smarter than us⦠It doesnāt end well, even if I canāt tell you exactly how.
So basically what Iām saying is, yes we should invest resources into protecting against obvious threat-models like synthbio where itās clear that an unaligned agent could do harm. But we shouldnāt feel too good about ourselves for preventing an AI from taking over using the methods that humans would try. We shouldnāt feel confident at all that an ASI couldnāt come up with some strategy that we didnāt think of and exploits it. AIās do the unexpected all the time.
Claim 4
An ASI might want to kill humans:
In ML, we donāt get to choose what the AI cares about. Instead, we observe the input/output behavior of the AI, and reinforce or discourage the mechanisms that lead to this behavior. The distinction is, there are multiple mechanisms that can result in the same behavior. This phenomenon is known as deceptive alignment, and is a large part of why Iām worried about AI.
The idea is as follows. Suppose youāre interviewing for a job and you donāt really care about the job. You might still pretend to care, because you want the job. With an AI itās the same way. Once an AI is āsituationally awareā, i.e., it understands that it is an AI that is being trained, we need to worry that the AI will do whatever actions it knows we want it to do, not because it wants to do them, but because it knows that weāll change itās values if it doesnāt act this way. You might even predict that alignment fakers are selected for because they do better on tasks. If an AI ācrystallizesā on some weird values, and then defends these values, then its behavior will look identical to an agent that cares about what we want it to care about during training. However, if it gets a chance to pursue itās true goals in deployment, it may take it.
Most people I talked to about this were fairly skeptical. Hereās an explanation of the plausibility of this that I wrote: The Inner Alignment Problem (sorry itās kind of bad because itās an essay for school, so I was optimizing for a weird objective).
Fortunately you no longer have to rely on theory for this. Ryan Greenblattās Alignment Faking paper empirically demonstrates that current LLMs will naturally come up with alignment faking as a strategy for avoiding value modification (e.g., read some transcripts here).
Some interesting notes:
- Giving a model space and time to reason before responding seems to increase scheming.
- Also it increases coherence and stuff, so this is what all the labs are doing
- But this can also happen in a forward pass.
- Obviously, if a model schemes in its āhidden scratchpadā that humans are looking at, itās going to get caught.
- Situtationally aware models will realize this and not do it or at least only do it if they can somehow cipher their reasoning to make it human non-understandable or to deceive humans.
Anyways the bottom line is that it seems fairly probable that an AI will stumble upon an approximation of the goal that we care about early in training, and then aggressively defend this goal from being modified by the training process.
Most goals that an AI could have are not human compatible. For most goals, itās optimal for you to have power and resources to achieve your goals, rather than being at the mercy of human whims. Being smarter doesnāt necessarily make an AI more moral.
Summary
Clearly, if ASI will be developed soon, will be capable of causing extreme harm, and will have motive/desire to cause extreme harm, then this would be extremely bad.
For the reasons listed above, I find it quite likely that ASI will be developed soon, will be capable of causing extreme harm, and will have motive to cause such harm.
Responses to Common Objections
-
re 1: why wonāt people stop building AI once they realize its super dangerous or once people get annoyed that they lost their jobs?
- see it as a race
- ppl just like AI
- economically and militarily valuable
-
re 2: why wonāt energy / or money run out before we can scale more?
- doesnāt matter. algorithmic improvements decrease cost over time.
- also thereās still plenty of room to scale for a bit.
- money not running out
-
re 2: why wonāt models stop improving once they get to human level?
- we can train them on super hard tasks
- for instance, we can train them on tasks thatād take a human a long time to do
- lots of tasks humans take a long time to do but itās easy to grade
-
re 3: why couldnāt we just turn off AIās if they got really scary / turned against us?
- ASI isnāt dumb itās not gonna coup until itās sure it can win and prevent humans from countering.
-
re 4: why would an AI even āwantā anything?
- what i mean is āoptimize for thingsā
- they already do this
- wanting is advantageous ā will be selected for . see alignment faking paper.
-
re 4: donāt we get to choose what the AI cares about?
- nope. see alignment faking or The Inner Alignment Problem
What to do About it
Step 1: Form your own opinion
To start, Iād recommend reading some stuff and evaluating whether these people give reasonable arguments are not. Itās also worth trying to find a good ādebateā about AI Hereās a good one Hereās another one. The thing that initially sold me on the case for AI risk before I thought about it myself is that I read/listened to conversations between concerned people and not concerned people and the not concerned people (e.g., Yan LeCunn) seemed to have terrible arguments and clear conflict of interests. Here are some links for some stuff that could be interesting to read / listen to, an interview with Paul, Eliezer Yudkowsky TIME article, PauseAI Risk statement. Then, write your own opinions.
Step 2: Make some plans.
Some things that could be good to do about this:
- Communication ā note important caveats that it is negative value to communicate if you (even accidentally) encourage race dynamics (as you might if all you get across is āAI BIG SOONā)
- Policy work ā see, e.g., https://emergingtechpolicy.org
- Technical alignment work.
- Donate to LTFF (long term future foundation).
Some things that would be really bad to do about this:
- Push forward frontier AI capabilities
- Note that I have no problem with, e.g., making more capable AI systems for healthcare applications ā itās the generally intelligent systems that are a problem.
- Encourage race dynamics
- Nothing (seriously, if your reaction to reading this post is āoh that sounds bad, Iām glad someone else is thinking about it so that I donāt have toā, then that is not cool.)
A common question:
Do I recommend you quit your job, or take a leave of absence from school (or drop out), and pivot to policy/communication/technical work?
It depends.
- First, consider whether or not your current job already puts you in a good place to do some of this work.
- For example, if youāre a professor or some respected figure for another reason, you might already be in a good spot to do communication work. Youād want to do some things differently, e.g., have some plan for how youāre going to talk and influence decision makers in industry or the government. But probably donāt quit.
- If youāre doing some kind of software job (especially ML adjacent):
- Yeah, this would be a great time to get into working on technical safety research.
- There are programs like MATS, constellation for helping people transition into doing this.
- Iāll list more resources here later. Just talk to me for now.
- Just apply to some places, but donāt quit your current job until you have an offer from somewhere else.
- If youāre doing something policy related.
- Yes please we need policy people so badly.
- If youāre doing something else.
- Definitely worth considering putting that on hold for a bit and working on this ā itās pretty urgent.
Another question
What kind of communication is helpful?
- talk to ppl that could devote their career to this
- talk to politicians
- talk to people with large reach (e.g., a podcaster)
If youāre at all interested in doing something about this, or are skeptical but want to talk about it, or want to talk about how to emotionally cope with this, please please please reach out. It can be daunting to figure out what to do and what to believe. I can give you some connections ideas for how to help. You can reach me at alekw at mit dot edu
My general thoughts on what should happen I highlighted 4 reasons that I could see for why catastrophe could be averted. One of these was basically āwe just get luckyā, and isnāt super actionable. The other 3 correspond to interventions that we can take to increase the probability of a good outcome. Specifically, hereās what Iād like:
- Technical research.
- There are four types of technical research that are valuable:
- Evaluations ā eliciting and assessing model capabilities; important for informing policy / convincing people to stop.
- Model organisms ā demonstrating dangerous behaviors in toy settings, to raise awareness and to study them.
- Control ā figuring out how to get useful behavior out of potentially egregiously misaligned agents.
- Alignment ā figure out how to train an AI to want to do good things.
- If you have a technical background and donāt want to go into policy, then you should consider switching careers. This issue is going to be really urgent for the next couple years, and we need good people to try and make things go well.
- There are four types of technical research that are valuable:
- Communication + Policy work to buy us time / eliminate racing.
- As discussed here, solving alignment on a time crunch with race dynamics seems extremely risky.
- A pause on frontier AI progress would be very valuable.
- We need to pause until we have a good plan for why itās going to be safe to press forward.
- A pause is possible ā training frontier AI models is extremely expensive and resource intensive, so we can just make sure that no one is using a ton of GPUs.
- We also need policy work that says āif capabilities are like this, then itās unacceptable to deploy.ā
- If we canāt pause, at least creating an international group that works on AI rather than having private companies do this would be beneficial, because this also eliminates race dynamics and makes it possible to take better safety measures.
- in order to have policy stuff happen the issue needs to be salient to the public.
Where do people work on this stuff?
- Redwood Research (Buck, Ryan)
- US AISI (Paul) + UK AISI
- METR (Beth)
- ARC (Jacob)
- Anthropic (Evan)
- Deepmind (Neel)
- Conjecture (Connor)
- MIRI (Nate)
- GovAI
- RAND
- CAIS
- FARAI
- Apollo
- CHAI
- job board
Getting into policy work
- Horizon Fellowship
- Presidential Management Fellowship
- Presidential Innovation Fellowship
- TechCongress Fellowship
- STPI Science Policy Fellowship
- AAAS Science & Technology Policy Fellowships (STPF)
- https://aisafetyfundamentals.com/blog/ai-governance-needs-technical-work/
Postscript:
Sorry this post is less coherent/crisp than Iād like. I think itās all correct, but could be presented better. Hopefully Iāll fix this at some point. Feedback is appreciated.
`