Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Intelligence is omnipresent, but its measure seems subjective. At best, we approximate its measurement through tests and references. Think of the college entry exams: each year, countless students register, memorize tests of testing tests and sometimes leave with perfect scores. Only one number, he says 100%, means those who obtained it share the same intelligence-or that they have somehow mastered their intelligence? Of course not. The benchmarks are approximations, no exact measures of someone’s capacities – or something -.
The AI community has long been based on benchmarks such as Mmlu (Massive understanding of multitasking language) to assess the capacities of the model through multiple choice questions in academic disciplines. This format allows simple comparisons, but fails to really capture intelligent capabilities.
Claude 3.5 Sonnet and GPT-4.5, for example, obtain similar scores on this reference. On paper, this suggests equivalent capacities. However, people who work with these models know that there are substantial differences in their real performance.
What does it mean to measure “intelligence” in AI?
On the heels of the new Bow Reference liberation – A test designed to push models to general reasoning and creative problem resolution – there is a renewed debate on what it means to measure “intelligence” in AI. Although not everyone has yet tested the ARC-Agi reference, the industry welcomes this and other efforts to develop test frames. Each reference has its merit, and Arc-Agi is a promising step in this wider conversation.
Another recent notable development in AI assessment is’The last examination of humanity“A complete reference containing 3,000 questions in several stages evaluated by peers in various disciplines. Although this test represents an ambitious attempt to question AI systems to the reasoning at the expert level, the first results show rapid progress – Openai would have reached a score of 26.6% in the month following its release. However, like other traditional benchmarks, it mainly evaluates knowledge and reasoning in isolation, without testing practical capacities and using tools which are increasingly crucial for real world applications.
In an example, several peak models fail to count the number of “R” in the word strawberry properly. In another, they incorrectly identify 3.8 as being less than 3,1111. These types of failures – on tasks that even a young child or a basic calculator could resolve – expose a gap between progress based on the references and the robustness of the real world, reminding us that intelligence is not only to pass exams, but in reliably navigating on daily logic.
The new standard to measure the capacity of AI
As models have progressed, these traditional references have shown their limits – GPT -4 with tools only reaches 15% on more complex and real tasks in the Benchmark GaiaDespite impressive scores on multiple choice tests.
This disconnection between reference performance and practical capacity has become more and more problematic as IA systems move from research environments to commercial applications. Traditional benchmarks Test of Knowledge Reminder but lack the crucial aspects of intelligence: the ability to collect information, execute code, analyze data and synthesize solutions in several areas.
Gaia is the necessary change in the AI assessment methodology. Created by the collaboration between the meta-fair teams, Meta-Genaï, Embraints and Autogpt, the reference includes 466 questions carefully designed at three levels of difficulty. These questions test on the web, multimodal understanding, code execution, file management and complex reasoning – essential capacities for real world applications.
Level 1 issues require around 5 steps and a tool to solve humans. Level 2 issues require 5 to 10 steps and several tools, while level 3 issues may require up to 50 discreet steps and a number of tools. This structure reflects the real complexity of commercial problems, where solutions rarely come from a single action or tool.
By prioritizing flexibility on complexity, an AI model reached 75%precision on Gaia – outperforming the microsoft’s Magnetic -1 industry (38%) and the Google Langfun agent (49%). Their success stems from using a combination of specialized models for audiovisual understanding and reasoning, with Anthropic Sonnet 3.5 as the main model.
This evolution in the assessment of AI reflects a broader change in industry: we go from autonomous SaaS applications to AI agents which can orchestrate several tools and workflows. As companies are counting more and more on AI systems to manage complex tasks and in several stages, references like Gaia offer a more significant capacity measure than traditional multiple choice tests.
The future of AI assessment does not reside in isolated knowledge tests but in complete evaluations of problem solving capacity. Gaia establishes a new standard to measure the ability of AI – that which better reflects the challenges and opportunities for the deployment of real world.
Sri Ambati is the founder and CEO of H2O.AI.