By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
inkeinspires.cominkeinspires.cominkeinspires.com
Notification Show More
Font ResizerAa
  • Home
  • Breaking News
    Breaking NewsShow More
    Transcript: Rafael Mariano Grossi, IAEA director general, on “Face the Nation with Margaret Brennan,” June 28, 2025
    June 28, 2025
    Pakistan slams climate ‘injustice’ as deadly floods hit country again | Climate News
    June 28, 2025
    1975 frontman Matty Healy vows to end politics despite contradictory show
    June 28, 2025
    Jeffrey Bland’s daily routine as the ‘father of functional medicine’
    June 28, 2025
    Hungary Pride goes ahead, defying Orban threat of ‘legal consequences’
    June 28, 2025
  • Business
    BusinessShow More
    Meta taps four OpenAI researchers for Superintelligence team
    June 28, 2025
    Minting billionaires overnight: CoreWeave’s 300% stock surge propels CEO to $10 billion fortune in just 12 days
    June 28, 2025
    “Gotta Ring the Register a Little”
    June 28, 2025
    Wealthy migration drives retail surge in West Palm Beach since pandemic
    June 28, 2025
    Lotus reverses plan to shut factory after UK offers fresh support
    June 28, 2025
  • Entertainment
    EntertainmentShow More
    Paige DeSorbo Found an $18 Top That Makes Capris Look Chic
    June 28, 2025
    Shanteari Young Breaks Silence After Early Prison Release
    June 28, 2025
    Orlando Bloom Feels Free Again After Katy Perry Split
    June 28, 2025
    Why Sterling K. Brown Was Cut From M. Night Shyamalan’s Split
    June 28, 2025
    Lauren Sánchez Swiftly Debuts New Name After Marrying Jeff Bezos
    June 28, 2025
  • Gadgets
    GadgetsShow More
    CES 2025: 41 Products You Can Buy Right Now
    January 13, 2025
    I can’t wait try out these 3 great plant tech gadgets that I saw at CES 2025
    January 13, 2025
    6 on Your Side Consumer Confidence: Kitchen gadgets to upgrade family recipes – ABC 6 News
    January 13, 2025
    35+ Best New Products, Tech and Gadgets
    January 13, 2025
    These gadgets kept me connected and working through a 90-mile backpacking trip
    January 13, 2025
  • Health
    HealthShow More
    A 10-Week Professional Running Plan inkeinspires
    June 28, 2025
    Best Products for Energy + Recovery inkeinspires
    June 27, 2025
    The Ultimate Beginner’s Guide To Long-Distance Running inkeinspires
    June 27, 2025
    A New Study Finds An 8-Hour Eating Window May Help Burn Fat—But Is It Safe? inkeinspires
    June 27, 2025
    184: Crafting a Morning Routine That Works For YOU inkeinspires
    June 26, 2025
  • Sports
    SportsShow More
    Newcastle advancing talks for “magical” £60m gem
    June 28, 2025
    Monaco sign Paul Pogba on two-year deal after doping ban ends
    June 28, 2025
    Benfica vs Chelsea – Club World Cup Free Bets & Betting Sites
    June 28, 2025
    Jurgen Klopp slams Club World Cup as ‘worst idea ever’ and claims players deserve time off
    June 28, 2025
    Paris Saint-Germain v Inter Miami: Preview, line-ups and stats
    June 28, 2025
  • Technology
    TechnologyShow More
    AI agents are hitting a liability wall. Mixus has a plan to overcome it using human overseers on high-risk workflows
    June 28, 2025
    Tesla shows off its first fully autonomous delivery to convince us its self-driving cars work
    June 28, 2025
    UFC 317: Topuria vs. Oliveira — Everything to Know to Watch via Livestream
    June 28, 2025
    FBI, cybersecurity firms say a prolific hacking crew is now targeting airlines and the transportation sector
    June 28, 2025
    Get three months of Audible for only $3 in this early Prime Day deal
    June 28, 2025
  • Posts
    • Post Layouts
    • Gallery Layouts
    • Video Layouts
    • Audio Layouts
    • Post Sidebar
    • Review
      • User Rating
    • Content Features
    • Table of Contents
  • Contact US
  • Pages
    • Blog Index
    • Search Page
    • Customize Interests
    • My Bookmarks
    • 404 Page
Reading: Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark
Share
Font ResizerAa
inkeinspires.cominkeinspires.com
  • Entertainment
Search
  • Home
  • Categories
    • Breaking News
    • Business
    • Sports
    • Technology
    • Entertainment
    • Gadgets
    • Health
  • Contact
Have an existing account? Sign In
Follow US
inkeinspires.com > Technology > Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark
Technology

Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark

MTHANNACH
Last updated: April 14, 2025 1:32 am
MTHANNACH Published April 14, 2025
Share
SHARE

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


Intelligence is omnipresent, but its measure seems subjective. At best, we approximate its measurement through tests and references. Think of the college entry exams: each year, countless students register, memorize tests of testing tests and sometimes leave with perfect scores. Only one number, he says 100%, means those who obtained it share the same intelligence-or that they have somehow mastered their intelligence? Of course not. The benchmarks are approximations, no exact measures of someone’s capacities – or something -.

The AI ​​community has long been based on benchmarks such as Mmlu (Massive understanding of multitasking language) to assess the capacities of the model through multiple choice questions in academic disciplines. This format allows simple comparisons, but fails to really capture intelligent capabilities.

Claude 3.5 Sonnet and GPT-4.5, for example, obtain similar scores on this reference. On paper, this suggests equivalent capacities. However, people who work with these models know that there are substantial differences in their real performance.

What does it mean to measure “intelligence” in AI?

On the heels of the new Bow Reference liberation – A test designed to push models to general reasoning and creative problem resolution – there is a renewed debate on what it means to measure “intelligence” in AI. Although not everyone has yet tested the ARC-Agi reference, the industry welcomes this and other efforts to develop test frames. Each reference has its merit, and Arc-Agi is a promising step in this wider conversation.

Another recent notable development in AI assessment is’The last examination of humanity“A complete reference containing 3,000 questions in several stages evaluated by peers in various disciplines. Although this test represents an ambitious attempt to question AI systems to the reasoning at the expert level, the first results show rapid progress – Openai would have reached a score of 26.6% in the month following its release. However, like other traditional benchmarks, it mainly evaluates knowledge and reasoning in isolation, without testing practical capacities and using tools which are increasingly crucial for real world applications.

In an example, several peak models fail to count the number of “R” in the word strawberry properly. In another, they incorrectly identify 3.8 as being less than 3,1111. These types of failures – on tasks that even a young child or a basic calculator could resolve – expose a gap between progress based on the references and the robustness of the real world, reminding us that intelligence is not only to pass exams, but in reliably navigating on daily logic.

The new standard to measure the capacity of AI

As models have progressed, these traditional references have shown their limits – GPT -4 with tools only reaches 15% on more complex and real tasks in the Benchmark GaiaDespite impressive scores on multiple choice tests.

This disconnection between reference performance and practical capacity has become more and more problematic as IA systems move from research environments to commercial applications. Traditional benchmarks Test of Knowledge Reminder but lack the crucial aspects of intelligence: the ability to collect information, execute code, analyze data and synthesize solutions in several areas.

Gaia is the necessary change in the AI ​​assessment methodology. Created by the collaboration between the meta-fair teams, Meta-Genaï, Embraints and Autogpt, the reference includes 466 questions carefully designed at three levels of difficulty. These questions test on the web, multimodal understanding, code execution, file management and complex reasoning – essential capacities for real world applications.

Level 1 issues require around 5 steps and a tool to solve humans. Level 2 issues require 5 to 10 steps and several tools, while level 3 issues may require up to 50 discreet steps and a number of tools. This structure reflects the real complexity of commercial problems, where solutions rarely come from a single action or tool.

By prioritizing flexibility on complexity, an AI model reached 75%precision on Gaia – outperforming the microsoft’s Magnetic -1 industry (38%) and the Google Langfun agent (49%). Their success stems from using a combination of specialized models for audiovisual understanding and reasoning, with Anthropic Sonnet 3.5 as the main model.

This evolution in the assessment of AI reflects a broader change in industry: we go from autonomous SaaS applications to AI agents which can orchestrate several tools and workflows. As companies are counting more and more on AI systems to manage complex tasks and in several stages, references like Gaia offer a more significant capacity measure than traditional multiple choice tests.

The future of AI assessment does not reside in isolated knowledge tests but in complete evaluations of problem solving capacity. Gaia establishes a new standard to measure the ability of AI – that which better reflects the challenges and opportunities for the deployment of real world.

Sri Ambati is the founder and CEO of H2O.AI.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.


You Might Also Like

Meta introduces Llama 4 with two new AI models available now, and two more on the way

xAI, Elon Musk's AI company, just purchased X, Elon Musk's social media company

The best security cameras for 2025

La Liga Soccer Livestream: How to Watch Real Betis vs. Real Madrid From Anywhere

Judge Ends One Man’s 11-Year Quest to Recover $765 Million in Bitcoin by Digging Up a Landfill

Share This Article
Facebook X Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
loader

Email Address*

Name

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Popular News
Technology

Best DNA Test for 2025

MTHANNACH MTHANNACH January 21, 2025
Trump's FCC is coming from NPR and PBS now too
Nothing Phone 3a and 3a Pro review: Rising above the boring competition
Father Films Video Of Whale Swallowing His Kayaker Son In Chile
Reform U.K. Wins Special Election in Runcorn by Six Votes
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Categories

  • Business
  • Breaking News
  • Entertainment
  • Technology
  • Health
  • Sports
  • Gadgets
We influence 20 million users and is the number one business and technology news network on the planet.
Quick Link
  • My Bookmark
  • InterestsNew
  • Contact Us
  • Blog Index
Top Categories
  • Entertainment

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

 

All Rights Reserved © Inkinspires 2025
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?