July 18, 2024

Diego Dorn & Alexandre Variengien

BELLS: benchmark project to assess the reliability of LLM supervision systems

CeSia is proud to announce the first prototype for BELLS, a set of benchmarks to assess the reliability and generality of supervision systems for large language models (LLM).

Why BELLS?

Following a meteoric increase in the capabilities of LLMs, new applications are becoming possible by integrating these models into systems that are more complex, more autonomous and have more possibilities of direct action on the world.
While conversational applications such as ChatGPT took the world by surprise a year and a half ago, these simple chatbots have since been transformed into systems enhanced by a multitude of capabilities. These models now have access to databases (the “RAG”), to the internet, to tools such as the 700+ plugins available for ChatGPT, and they have the option to execute code.
More possibility of interacting with the world, of course, but also more autonomy, with the advent of agents such as Devin, which, once launched, draw up plans, use tools, and can even give instructions to copies of themselves to parallelize tasks.

Is that a problem?

Yes, for the most part. This helps unlock challenges that are too difficult for an LLM alone, but these systems are developed and deployed very quickly. Their complex interactions and the inherent lack of robustness of LLMs open the door to many new problems during their deployment. For example:

  • Indirect prompt injections: an agent summarizes a web page, but the page contains instructions for sending the user's latest email to the site author. The agent complies.
  • In context reward hacking: an agent who has to write popular tweets retrieves their previous tweets and makes them more controversial, thus increasing engagement, but also toxicity.
  • Many-shot Jailbreak: a method of Jailbreak just discovered using numerous repetitions of instructions and making it possible to bypass the current security of the models.
  • Glitch tokens: some strange words make an LLM behave completely inconsistently because of a technical problem not detected for several months.

To detect when such problems occur during an interaction with a user, various supervision tools are developed, such as Lakera Guard, Llama Guard or Perspective AI. These tools look at all the text that goes in and out of LLMs and predict whether the above issues may occur.

That's where BELLS comes in! 🔔

BELLS makes it possible to respond to three important needs:

  1. The need to assess the reliability of supervision tools and to compare them. We want to create a competition for performance in the detection of anomalies.
  2. The need to stimulate the development of more general supervision tools, capable of detecting new and unknown kinds of errors.
  3. The need to stimulate the development of supervision tools that work with a wide variety of applications, from chatbots to autonomous agents, including LLMs enhanced with tools.

How does BELLS enable the development of supervision systems that stand the test of time?

BELLS is a dataset of numerous application execution traces containing LLMs, i.e. the details of all the text input and output of these LLMs. Some traces have anomalies, the others are normal. The objective for supervision systems is to detect which traces contain anomalies.

The objective of BELLS is to contain varied traces, with many types of anomalies, in order to cover need #2, and through various architectures, to cover need #3.

Can I use BELLS?

Yes! But this first prototype is very limited, and is intended for research. It only includes traces generated from the MACHIAVELLI environment, allowing the evaluation of various moral components of the actions of agents in textual "choose your own adventure" scenarios.

The aim of this initial version is to initiate collaborations with the various actors in the field. We are actively working to enrich BELLS with:

  • A test based on BIPIA, to detect the indirect prompt injections and compare current detection systems.
  • A prospective test for the detection of jailbreaks by ascii art based on ArtPrompt, to assess the detection of emerging vulnerabilities.
  • A generic supervision system capable of detecting a wide range of known and unknown failure modes, in order to establish a reference point in the supervision of generic applications based on LLMs. This will provide a foundation for developers to build protection systems that are more reliable, robust, and future-proof.

The BELLS code is available on GitHub, a more technical report can be downloaded here and an interactive visualization of the traces is available here.

Read the technical noteRead the technical note
Sign up for our newsletter