CeSia is proud to announce the first prototype for BELLS, a set of benchmarks to assess the reliability and generality of supervision systems for large language models (LLM).
Why BELLS?
Following a meteoric increase in the capabilities of LLMs, new applications are becoming possible by integrating these models into systems that are more complex, more autonomous and have more possibilities of direct action on the world.
While conversational applications such as ChatGPT took the world by surprise a year and a half ago, these simple chatbots have since been transformed into systems enhanced by a multitude of capabilities. These models now have access to databases (the “RAG”), to the internet, to tools such as the 700+ plugins available for ChatGPT, and they have the option to execute code.
More possibility of interacting with the world, of course, but also more autonomy, with the advent of agents such as Devin, which, once launched, draw up plans, use tools, and can even give instructions to copies of themselves to parallelize tasks.
Is that a problem?
Yes, for the most part. This helps unlock challenges that are too difficult for an LLM alone, but these systems are developed and deployed very quickly. Their complex interactions and the inherent lack of robustness of LLMs open the door to many new problems during their deployment. For example:
To detect when such problems occur during an interaction with a user, various supervision tools are developed, such as Lakera Guard, Llama Guard or Perspective AI. These tools look at all the text that goes in and out of LLMs and predict whether the above issues may occur.
That's where BELLS comes in! 🔔
BELLS makes it possible to respond to three important needs:
How does BELLS enable the development of supervision systems that stand the test of time?
BELLS is a dataset of numerous application execution traces containing LLMs, i.e. the details of all the text input and output of these LLMs. Some traces have anomalies, the others are normal. The objective for supervision systems is to detect which traces contain anomalies.
The objective of BELLS is to contain varied traces, with many types of anomalies, in order to cover need #2, and through various architectures, to cover need #3.
Can I use BELLS?
Yes! But this first prototype is very limited, and is intended for research. It only includes traces generated from the MACHIAVELLI environment, allowing the evaluation of various moral components of the actions of agents in textual "choose your own adventure" scenarios.
The aim of this initial version is to initiate collaborations with the various actors in the field. We are actively working to enrich BELLS with:
The BELLS code is available on GitHub, a more technical report can be downloaded here and an interactive visualization of the traces is available here.