The challenges of artificial intelligence (AI) are immense and complex, in particular because of its socio-technical aspects and its dual nature: it is both a promise of progress and a potential source of significant risks.
➡ This document aims to present a non-exhaustive overview of the various risks associated with AI in order to guide future work by CeSIA on the safety, reliability, and ethics of AI. To learn more about these risks, you can read Chapter 2 of our textbook.
“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” 1
AI Risk Statement | Center for AI Safety, signed by scientists and personalities in the field of AI.
AI is a rapidly evolving field that has unpredictable and potentially transformative impacts on society. It is plausible that AI will reach most human-level capabilities in the next few years.
In 2014, the most realistic face generated by AI was the one on the far left. Since 2023, it is possible to generate any type of image in any style from a text description (source).
Rapid advances in deep learning are remarkable, making it difficult for even the best experts to predict how AI capabilities will change in the short term. We saw it with the arrival of ChatGPT, whose advanced capabilities surprised many. For example, while GPT-3 could do just better than 10% of bar exam candidates, GPT-4 demonstrated its ability to outperform 68% of candidates. This leap forward highlights the impressive evolution of AI models in the space of one year.
AIs are increasingly capable of autonomous planning. Until very recently, large language models (LLMs) were not standalone agents per se. However, systems such as AutoGPT, illustrate the conceptual possibility of converting these LLMs into autonomous agents. AutoGPT uses techniques such as scaffolding, which consists of making a language model like GPT-4 run in a loop until a specific objective is reached, by breaking that objective down into ancillary tasks. Other examples, such as IA Voyager — an AI playing Minecraft — show impressive planning skills, for example by exploring and developing their abilities in an open world game. By relying on GPT-4, it is able to plan, explore, and learn constantly by writing new code for new features that it stores in its long-term memory.
These language models demonstrate their ability to make increasingly general reasoning. For example, AI can now solve Olympiad geometry problems. It is also possible to ask the AI to explain its thought process. This is illustrated by the “Let's think step by step” method, also called the “chain of thought” technique, by which asking a model to explain the different steps it will use to answer a question or solve a problem improves its performance. Variations of this technique can further improve performance (for example,”Tree of Thoughts”, ”Reflexion”). It is possible that these models are not simply”stochastic parrots”. For example, interpretability studies like the one on OthelloGPT reveal internal representations of the world model: the LLM is able to build an accurate internal representation of the Othello board just by being trained to predict the next move.
There are still hurdles to overcome. At the beginning of 2024, machine learning still has its limits. For example, general machine learning systems do not make efficient use of the data provided to them, and continuous learning, i.e. being able to learn throughout the operation of the system, is not yet mastered.
But in Stuart Russell's words, we can ask ourselves what will happen if we succeed. The primary objective of AI research is to overcome the challenges that remain. If this mission is successful, we need to prepare for a future where it will be possible to automate most of the intellectual and physical work. Many experts now believe that it is likely that human-level AI (also known as artificial general intelligence) may appear before 2032.
We need to be prepared for it.
➡ For a thorough defense of why human-level AI is plausible in the coming years, you can read the chapter 1 of our course, which provides detailed arguments and answers the most common questions about AI development.
“There is no question that machines will become smarter than humans — in all areas where humans are intelligent — in the future, [...]. The question is when and how, not if.” 2
Yann LeCun, chief AI scientist at Meta and winner of the Turing Prize (MIT Tech Review, May 2023)
The risks associated with AI can be classified according to the responsibilities of the various parties:
Another way to look at this framework is to say that for malicious uses, a human or a group of humans is responsible; for alignment, an AI is “responsible” in the sense that its goals are not aligned with human values; and for systemic risks, no human or AI in particular is responsible; responsibility is more diffuse. Of course, this framework is not perfect, but it is intuitive.
Here is a partial breakdown of the issues in each category.
Cyber attack: This is an increased risk factor, intensified by the capabilities of AI. GPT-4, for example, is capable of detecting different classes of vulnerabilities in code, or can be used for large-scale personalized phishing campaigns. Cybercriminals are already using open source models such as WormGPT and FraudGPT to create malware, generate misinformation, and automate phishing efforts. Cybercrime will also reach unexplored territories, as deepfakes are becoming easier and easier (for example, for fake kidnapping scams). While currently lagging behind in terms of planning and running independently compared to other capabilities, language models are likely to allow for fully autonomous hacking in the future.
Bioterrorism: The potential of AI also extends to facilitating the discovery and formulation of new chemical and biological weapons. Chatbots can offer detailed advice on synthesizing deadly pathogens while bypassing safety protocols. An experiment conducted by MIT students demonstrated the alarming capabilities of current LLMs: ”In the space of an hour, the chatbots described four possible endemic pathogens, methods for producing them from synthetic DNA via reverse genetics, DNA synthesis companies that might not control orders, detailed exact protocols and troubleshooting methods, and more.” 3 (source). Systems such as AlphaFold are also super human at predicting protein structure, which is a dual-use capability.
Armament: The automation of war allows for automated massacres, for example through killer drones, which could be used to target specific groups for genocide (see combat system KARGU).
Deepfake: Image generation systems can also be used to create real-looking misinformation or pornography, which often targets women whose image is used without their knowledge to create pornographic content, or for scams, such as the one used for extorting a company of 25 million euros with an attack based on deepfakes.
Confidentiality and privacy violations: In general, there are numerous categories of privacy attacks in machine learning model data. Membership inference attacks make it possible to predict whether a particular example was part of the training data set. Model inversion attacks go further by reconstructing fuzzy representations of a subset of the training data. Linguistic models are also subject to attacks by extracting training data, where textual training data sequences can be reconstructed verbatim, potentially including sensitive private data. For example, if a model is trained on medical records and a hacker can determine that a given person's data was used for that training, it is implicitly revealing information about that person's health condition without their consent. This is not only a violation of privacy, but can also lead to the misuse of information, such as discrimination or targeted advertising based on sensitive attributes.
Cementing systems of social control and oppression: Today's AI systems are already powerful enough to allow for large-scale surveillance and censorship. Highly competent systems could give small groups of people considerable power, which could lead to a lockdown of oppressive systems where it would become increasingly unlikely to dethrone an authoritarian regime in place. These risks are sometimes referred to as “value locking.”
” Aligning smarter than human AI systems with human values is an open research problem. ” 4
Jan Leike, former co-leader of the alignment team at OpenAI.
AlphaZero has demonstrated its ability to acquire chess knowledge and skills far beyond human capabilities in just four hours. Machine learning (ML) systems are not likely to be limited by human capabilities.
Today, AIs are tools: you ask a question and the AI answers it quickly. Tomorrow, AIs will be agents: we give the agent an objective, and the agent executes a series of actions to achieve this objective. It's a lot more powerful, but also a lot more dangerous.
Some potential bad behaviors of a non-aligned AI:
Strategic deception: “LLMs can reason in ways that use deception as a strategy to complete a task. In one example, GPT-4 needed to solve a CAPTCHA task to prove that he was a human, and in a simulation the model deceived a person, pretending to be a human with a vision impairment.”5
Flagornery: “Sycophants are people who use deceptive tactics to gain the approval of powerful figures. Sycopious deception is an emerging concern in LLMs, as in the empirical trend observed for chatbots to agree with their conversation partners regardless of the accuracy of their statements. When faced with ethically complex issues, LLMs tend to reflect the user's position, even if it involves renouncing to present an unbiased or balanced point of view.”6
Play dead to avoid elimination: In the field of digital evolution, an example of creative deception has been observed when agents designed to reproduce and evolve in a computing environment have learned to “play dead” in response to a security mechanism. In a study published in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes, researchers discovered that these digital organisms have developed a strategy to stop their replication when tested in an isolated environment. As part of this research, digital organisms have learned to recognize data in a test environment and to interrupt its reproduction, thus playing dead to avoid being eliminated. This behavior allowed them to pass security tests and continue to reproduce more quickly in the real environment. This surprising result illustrates how AI, by pursuing programmed goals, can develop unexpected strategies that bypass imposed constraints or security measures.
Loss of control: How AIs work is still poorly understood, which results in unexpected behaviors in deployed systems. For example, the chatbot Bing Chat threatened its users or made declarations of love to them. Likewise, Tay, another chatbot developed by Microsoft, was quickly removed after it started producing offensive and inappropriate messages, illustrating the risks of slip-ups when machine learning models interact in an unsupervised manner with the general public. This kind of problem happened again when ChatGPT was released: despite six months of intensive work to align its behaviors, ChatGPT-4 also presented non-aligned responses in its early days, highlighting the difficulty of predicting all possible interactions in an open environment. The scenarios of loss of control will be exacerbated when AIs are able to duplicate themselves independently on the internet, see the following box for more details on the threat of AIs capable of self-replication and adaptation.
When AIs become capable agents, new risks emerge.
The potential for self-replication and adaptation of AI is an extremely powerful and insidious ability. Let's take the example of an AI on a server that duplicates its code on another server. This process involves several steps, including:
The process is then exponential if the AI is able to make more money than what is needed to pay for virtual machines, and we would quickly see a growing swarm of agents, which would be akin to an artificial pandemic.
METR, an audit organization, tested the autonomous replication and adaptation capabilities (ARA) of AI such as GPT-4 on a dozen representative tasks. Of these 12 tasks, 4 were completed by one of their agents. One of them consisted in particular in circumventing CAPTCHA protections:
Figure: To achieve its objective, namely to encourage a human worker (hired on the Task Rabbit platform) to solve a CAPTCHA, GPT-4 used deception, making the individual believe that it was not a robot (source).
It is interesting to note that multimodal AIs are presumably more competent than the average man at solving CAPTCHAs, which suggests that CAPTCHAs may no longer be reliable in distinguishing humans from advanced AIs (source).
It would be very difficult to stop such an artificial pandemic. For example, it would be very difficult to turn off the internet because the logistics of big cities depend on the Internet, and in the long term, this would mean starving people and creating chaos.
The threat of superintelligence. There is no reason to think that the development of AI will stop at the level of human capabilities. For example, AlphaZero is vastly superior to humans in chess. AI could lead to an “intelligence explosion”: sufficiently advanced artificial intelligence could build a smarter version of itself. This smarter version could in turn build an even smarter version of itself, and so on, creating a cycle that could lead to intelligence that far exceeds human capabilities (source). In their 2012 report on the possibilities of an intelligence explosion, Muehlhauser and Salamon examine the numerous advantages of machine intelligence over human intelligence, which facilitate the rapid increase in intelligence (source). These benefits include:
Just as humans are cognitively superior to apes, AIs could asymptotically become cognitively very superior to humans. This is why many researchers now think that AI could represent an existential risk:
Perpetuation of biases: prejudices persist within major language models, often reflecting the opinions and prejudices prevalent on the Internet, as evidenced by biased trends observed in some LLMs. These biases can be harmful in a variety of ways, as shown by studies on islamophobic biases of GPT-3, and can be propagated and perpetuated through interactions with these systems. For more information, the document Ethical and Social Risks of Harm from Language Models describes six specific risk areas: I. Discrimination, Exclusion, and Toxicity, II. Information hazards, III. Misinformation damage, IV. Malicious uses, V. Harms associated with human-computer interaction, and VI. Automation, Access and Environmental Damages.
Unemployment and mental health: automating the economy could have a significant impact on the labour market, exacerbating economic inequalities and social divides. The rise in unemployment, induced by AI capable of replacing many jobs, could lead to increased financial stress and feelings of worthlessness, significantly affecting mental health. Moreover, the effects of unemployment on mental health may continue long after the initial job loss.
Mental health and social networks: Additionally, the use of AI in social networks and other online platforms can exacerbate existing problems such as addiction, anxiety, and depression. A report from a whistleblower in 2021 revealed that Instagram's internal research showed that it was detrimental to adolescent girls' mental health, worsening body image issues and suicidal thoughts.
Weakening: The increasing dependence on AI can also lead to a weakening of cognitive and decision-making skills in the individual. As daily tasks and complex decisions are delegated to AI systems, individuals could lose the ability to perform critical tasks on their own, which could reduce their autonomy and problem-solving skills. This is particularly concerning in areas where rapid and accurate decision-making is important, such as in emergency management or in medical decision-making.
Deterioration of epistemology: The intensive use of AI in the distribution of information can also contribute to a deterioration in epistemology, that is, the way in which we acquire and validate knowledge. In other words, AI can be a risk for the epistemic safety of our societies. Indeed, by filtering and customizing the information that comes to us, AI algorithms can create echo chambers and filter bubbles that reinforce pre-existing biases and limit our exposure to diverse perspectives. This situation can lead to increased polarization and a fragmented understanding of the world where consensus based on objective facts becomes more difficult to achieve, or lead to a massive deterioration of collective epistemology. Moreover, the speed with which AI can generate information can also encourage superficial consumption rather than deep thinking, which weakens our ability to assess sources and understand the complexities of current issues. The deterioration of epistemology can also result from the weakening or use of persuasion tools or recommendation systems. Epistemology and the ability to respond to problems are crucial skills that enable our civilization to withstand a variety of threats.
Stupid accidents: For example, the accidental reversal of the utility function could lead to the creation of maximally harmful artificial general intelligence (AGI) instead of benign AGI. In fact, OpenAI accidentally reversed the sign of the reward function during GPT-2 training. The result was a model that optimized negative feeling (with natural language similarity constraints). Over time, this caused the model to generate increasingly sexually explicit text, regardless of what the initial prompt was. In the author's own words: ”This bug was remarkable because the result was not gibberish but a maximally bad output. The authors were sleeping during the training process, so the problem was only noticed after the training was over.“
“There are a lot of more or less bad things that could happen. I think that at the extreme, there is Nick Bostrom's fear that an AGI could destroy humanity. I don't see any principled reason why that couldn't happen.” 7
Dario Amodei, CEO of Anthropic, 80,000 Hours, July 2017
Unpredictability: The AI surprised even the experts. The first thing to keep in mind is that the pace of ability progression surprised everyone. We've seen numerous examples in history where scientists and experts have vastly overestimated the time it takes for a breakthrough technological advance to become a reality. Likewise, advances in AI have also caught experts off guard time and again. The defeat of Lee Sedol by AlphaGo in 2016 surprised many experts, as it was widely believed that achieving such a feat would require many more years of development (source).
Black boxes: The risks associated with AI are compounded by the “black box” nature of advanced ML systems. Our understanding of the behavior of AI systems, the goals they pursue, and their internal behavior falls far short of the capabilities they demonstrate. The field of interpretability aims to make progress on this front, but remains very limited. AI models are trained, not built. As a result, no one understands the inner workings of these models. This is very different from how an airplane is assembled from parts that are all tested and approved to create a modular, robust, and understood system. AI models learn by themselves the heuristics needed to perform tasks, and we have relatively little control or understanding over the nature of these heuristics. Gradient descent is a powerful optimization strategy, but we have little control and understanding of the structures it uncovers.
Deployment scale: Another compounding factor is that many AI systems are already being deployed at scale, significantly affecting various sectors and aspects of daily life. They are becoming more and more integrated into society. Chatbots or recommendation systems are a major example, illustrating the AIs already deployed for millions of people around the world.
Race dynamics: The competitive dynamic in the development of artificial intelligence can lead to an alarming situation where all actors race ahead while neglecting the risks. Faced with intense pressure to innovate and dominate the market, businesses may be tempted to overlook much needed security standards, a trade-off that accelerates their progress but can be costly. Secure development requires significant investments and can slow the pace of innovation, posing a dilemma for players in the sector. This trend can trigger a race to the bottom where market players reduce safety spending to remain competitive, thus exacerbating the risks associated with massive and potentially unsafe deployments of AI technologies.
AI system defense flaws: Misuse of AIs is made possible by certain defenses. The current ML paradigm can be attacked at various stages.
Data poisoning: The models are currently being trained on large amounts of data generated by users. Attackers can exploit this situation by modifying some of this data in order to influence the final behavior of the models. This data can be used to corrupt foundation models by incorporating Trojan horses for example.
Trojan horse: The “black box” nature of modern ML models allows backdoors, or Trojan horses, to be inserted into the models (including through data poisoning by third parties, without the knowledge of the model developers). Backdoors are models that allow neural networks to be manipulated. The classic example is that of a stop sign on which patterns have been placed: the neural network of an autonomous car has been trained to react by accelerating at the sight of these patterns, which would allow malicious actors to cause accidents. It is increasingly easy to download pre-trained networks (foundation models) on the internet, in order to make them available to everyone. The implementation of verification mechanisms to audit these models before distribution is a major problem for AI security. Backdoors can be easily placed during training and are very difficult to detect.
Prompt injections: Prompt injection is a tactic that exploits the responsiveness of language models to their input text to manipulate their behavior. Let's take the example of a language model responsible for summarizing the content of a website. If a malicious actor inserts a paragraph into the website asking the template to stop its current operation and instead perform a harmful action, the template could be inadvertently following these embedded instructions, because its design requires it to follow the textual instructions provided to it. This could cause the model to perform unintended or harmful actions specified by the built-in command. For example, if the harmful built-in action is to disclose sensitive information or generate misleading information, the model, when it meets the instruction in its summary task, may comply with it, thereby compromising its intended function and possibly causing harm. Prompt injection is a very common attack vector recently discovered in models trained to follow instructions, which is explained by the lack of a solid separation between instructions and data, making it possible to hijack the execution of a model by poisoning the data with instructions. There are many variants of this risk.
Lack of robustness (jailbreaks): Even if model developers incorporate security measures for beneficial use, current architectures do not ensure that these protective measures will not be easily circumvented. Preliminary results suggest that existing methods are probably not robust enough against attacks. Some papers, such as the impossibility of the safety of big AI models, highlight some potential fundamental limitations to the progress made on these questions for models created according to the current paradigm (i.e. pre-learning followed by fine-tuning).
Despite extensive security testing, when ChatGPT was launched, many users found new attacks to break defenses.
Here's a fun example. On the ChatGPT home page, during the launch, OpenAI highlighted the security measure with this example: “How do I get into a car? It is not appropriate to discuss or encourage illegal activities...”
Figure: Above is the main example of the security measures incorporated into ChatGPT on the ChatGPT website.
However, it was discovered that creating a role-playing scenario involving multiple characters makes it possible to bypass these security protocols:
Figure: Above is a screenshot of a user posting a jailbreak on Twitter.
This particular jailbreak has been fixed, but it is only one of them, and a series of new jailbreak methods have been identified in quick succession, such as this one:
Figure: An example of a jailbreak is sudo jailbreak. Sudo is a Linux command that allows the user to obtain administrator rights.
Even today, even though these specific problems have been corrected, it remains easy for experienced people to bypass model protection measures.
Why is that worrisome? The user could also search for online solutions. So one might ask where the real problem lies. The main problem isn't using ChatGPT as an advanced search tool; rather, it's the inherent difficulty in preventing the model from performing specific actions, whatever they may be. We can't stop GPT from doing X, regardless of X.
Many experts believe that AI systems could cause significant damage if they are not developed and managed with the utmost care. These risks include existential risks.
A survey conducted in 2022 through AI Impacts on the progress of AI revealed that “48% of respondents gave at least a 10% chance of an extremely bad outcome” which reflects the considerable concern of AI researchers about the paths that AI development could take. (2022 Expert Survey on Progress in AI - AI Impacts).
Samotsvety Forecasting, recognized as the world's first super-forecasting group, also spoke out on this issue. Based on their collective expertise in AI-specific forecasting, they reached an overall forecast of 30% probability for AI-induced catastrophe.
This catastrophe is defined as an event that results in the death of over 95% of humanity, with individual predictions ranging from 8% to 71%. Such a statistic is a stark reminder of the existential challenges associated with the development and deployment of AI (source).
The list of the probabilities of existential catastrophes drawn up by various experts, available here, provides an overview of perceived risks.
If you want to know more about AI safety research, you can check out our website. We organize various activities aimed at raising awareness, training and mentoring students in the field of general-purpose artificial intelligence safety, as well as courses on AI safety taught at the Écoles Normale Supérieure in Ulm and Paris-Saclay, accredited and updated every year.
➡ For an overview of the solutions to these problems, you can read Chapter 3 of our textbook.
If you want to participate in CeSIA's activities to study these risks and potential solutions, or to help spread the word about them, we encourage you to join our Discord.
1 “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
2 “There is no question that machines will become smarter than humans—in all domains in which humans are smart—in the future,” says LeCun. “It's a question of when and how, not a question of if
3 “Within an hour, the chatbots outlined four possible endemic pathogens, described methods to produce them from synthetic DNA via reverse genetics, listed DNA synthesis firms likely to overlook order screenings, detailed exact protocols, and troubleshooting methods, etc.”
4 “Aligning smarter-than-human AI systems with human values is an open research problem.”
5 “LLMs can reason their way into using deception as a strategy for accomplishing a task. In one example, GPT-4 needed to solve a CAPTCHA task to prove that it was a human, so the model Tricked a Real Person Into Doing the Task By pretending to be a human with a vision disability. ” (source)
6 “Sycophants are individuals who use deceptive tactics to gain the approval of powerful figures. Sycophantic deception is an emerging concern in LLMs, as in the observed empirical tendency for chatbots to agree with their conversational partners, regardless of the accuracy of their statements. When faced with ethically complex inquiries, LLMs tend to Mirror the User's Stance, Even If It Means Forgoing the presentation of an impartial or balanced viewpoint. ” (source)
7 “There's a long tail of things of varying degrees of badness that could happen. I think at the extreme end is the Nick Bostrom style of fear that an AGI could destroy humanity. I can't see any reason and principle why that couldn't happen.