The race to develop the most powerful artificial intelligence (AI) is heating up, with companies like Meta, OpenAI, and Anthropic constantly unveiling new, impressive large language models (LLMs) and touting their impressive benchmark scores. But are these numbers truly reflective of the models’ capabilities, or are they simply a case of self-serving hype?
The use of benchmarks to assess AI models is crucial. They provide a standardized way to compare different models, highlight progress in the field, and inform users about the strengths and weaknesses of each model. Yet, despite their importance, current benchmarks are facing increasing scrutiny for their limitations and potential for manipulation.
One major issue is the ease with which today’s LLMs can achieve high scores on existing benchmarks. For example, the MMLU (Massive Multitask Language Understanding) benchmark, created in 2020, is now considered too easy for the latest LLMs, making it difficult to differentiate between them. This phenomenon, known as “saturation,” renders the benchmark less meaningful.
Other concerns include the potential for errors within the benchmarks themselves. A recent study found significant errors in some of the questions within MMLU, raising doubts about the accuracy of the scores obtained. Furthermore, the practice of “contamination,” where LLMs are trained on data that includes benchmark questions and answers, raises questions about the models’ true capabilities.
The problem of benchmark manipulation extends beyond accidental contamination. Some companies may intentionally train their models on benchmark data to boost their scores, leading to misleading performance evaluations. To address this, efforts are underway to create “private” benchmarks, where questions are kept secret to prevent training on them. However, this raises concerns about transparency and the ability for independent verification.
Beyond the inherent flaws in existing benchmarks, there are also issues related to the way they are applied. Even small changes in how questions are presented to AI models can significantly impact their scores, affecting the reproducibility and comparability of results.
Recognizing these limitations, researchers are developing new benchmarks that are more challenging, robust, and less susceptible to manipulation. For example, the GAIA benchmark assesses AI models on real-world problem-solving, while NoCha (Novel Challenge) focuses on understanding complex, long-form texts.
The creation of these new benchmarks is often resource-intensive, requiring human experts to develop detailed questions and answers. To overcome this hurdle, research groups are exploring the use of LLMs themselves to develop new benchmarks, automating the process and potentially leading to more diverse and challenging assessments.
The emergence of startups specializing in AI benchmarking is another encouraging development. By providing researchers, regulators, and the public with access to reliable and specific benchmarks, these companies are paving the way for a more transparent and accountable future for AI.
The AI benchmarking crisis is a reminder that simply relying on self-reported scores is not enough. A concerted effort is needed to develop more comprehensive, robust, and trustworthy methods for evaluating the capabilities of these powerful technologies. Only then can we truly understand the progress and potential of AI, and ensure its responsible development and deployment.