How reliable are AI benchmark tests really?

How reliable are AI benchmark tests really?

AI companies like to flaunt impressive benchmark results, but how reliable are these numbers? Researchers from the European Commission prepared a report on the subject.

Researchers at the European Commission’s Joint Research Center argue that benchmarks should be viewed as critically as the models they evaluate. They found that many methods are flawed and can be misleading.

For example, OpenAI claimed their GPT-4o model scored 75.7 percent on the ARC-AGI test, a puzzle-oriented AI intelligence measurement. Google’s Gemini 2.0 Pro reportedly achieved 79.1 percent on MMLU-Pro, and Meta’s Llama-3 70B achieved 82 percent on MMLU 5-shot. How fair are these tests really?

European study

The researchers analyzed 100 studies on benchmarking methods and extracted several problems: lack of transparency, data contamination and tests that do not measure what they promise. Another major problem is “sandbagging,” where AI models deliberately underperform on certain tests to show “improvement” later.

In addition, benchmarks often reflect the interests of AI companies rather than the effective capability of AI models. Yet these scores are increasingly used as the basis for regulations, such as the AI Act.

The researchers found that benchmarks have no standard, yet have a major impact on policy and public perception of AI models. Collaborators from various fields, such as cybersecurity, linguistics, computer science and sociology, have frequently criticized the way benchmarks are used and the impact they have on AI development.

read also

‘AI investments in EMEA double, but challenges remain’

The analysis revealed nine major problems:

  • Unclear how, when and by whom benchmark datasets were created.
  • Not measuring what should be measured.
  • Tests manipulated to get better results.
  • Tests that do not clarify the social, economic and cultural context in which they were taken.
  • Tests that “reinforce certain methods and research goals” at the expense of others.
  • Tests not adapted to rapidly changing technology.
  • Assessing models as they become increasingly complex.
  • Tests designed to make AI interesting to investors.
  • Failure to test on different data sets.

Without improvements, AI benchmark results remain a marketing tool rather than a reliable measure of AI performance. “AI benchmarks should be subject to the same requirements regarding transparency, fairness and explainability as AI models in general,” the researchers conclude.