By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
inkeinspires.cominkeinspires.cominkeinspires.com
Notification Show More
Font ResizerAa
  • Home
  • Breaking News
    Breaking NewsShow More
    Hungary Pride goes ahead, defying Orban threat of ‘legal consequences’
    June 28, 2025
    At least 34 people killed in Israeli strikes in Gaza as ceasefire prospects inch closer – National
    June 28, 2025
    Confessions of a gun smuggler: Former trafficker reveals how she brought weapons into Canada
    June 28, 2025
    Iran holds first state funerals for military leaders, nuclear scientists killed in Israeli strikes
    June 28, 2025
    Record attendance expected at Budapest Pride march despite Orban warning | European Union News
    June 28, 2025
  • Business
    BusinessShow More
    Wealthy migration drives retail surge in West Palm Beach since pandemic
    June 28, 2025
    Lotus reverses plan to shut factory after UK offers fresh support
    June 28, 2025
    Photos: Big brands are pulling out of Pride. Here’s how their involvement has changed over the years
    June 28, 2025
    OFA Group signs LOI to acquire controlling stake in Aspire Homes, RateDNA
    June 28, 2025
    A self-described Democratic Socialist could be NYC’s next mayor, and the ultra rich are in revolt— ‘hot commie summer’
    June 28, 2025
  • Entertainment
    EntertainmentShow More
    Why Sterling K. Brown Was Cut From M. Night Shyamalan’s Split
    June 28, 2025
    Lauren Sánchez Swiftly Debuts New Name After Marrying Jeff Bezos
    June 28, 2025
    Why Reese Witherspoon Left Pixar’s Brave
    June 28, 2025
    Matty Healy Breaks Down After The 1975 Headlines Glastonbury
    June 28, 2025
    The Disturbing True Story That Inspired Taron Egerton’s Apple TV+ Series
    June 28, 2025
  • Gadgets
    GadgetsShow More
    CES 2025: 41 Products You Can Buy Right Now
    January 13, 2025
    I can’t wait try out these 3 great plant tech gadgets that I saw at CES 2025
    January 13, 2025
    6 on Your Side Consumer Confidence: Kitchen gadgets to upgrade family recipes – ABC 6 News
    January 13, 2025
    35+ Best New Products, Tech and Gadgets
    January 13, 2025
    These gadgets kept me connected and working through a 90-mile backpacking trip
    January 13, 2025
  • Health
    HealthShow More
    Best Products for Energy + Recovery inkeinspires
    June 27, 2025
    The Ultimate Beginner’s Guide To Long-Distance Running inkeinspires
    June 27, 2025
    A New Study Finds An 8-Hour Eating Window May Help Burn Fat—But Is It Safe? inkeinspires
    June 27, 2025
    184: Crafting a Morning Routine That Works For YOU inkeinspires
    June 26, 2025
    Endurance Exercise and Longevity – BionicOldGuy inkeinspires
    June 26, 2025
  • Sports
    SportsShow More
    Paris Saint-Germain v Inter Miami: Preview, line-ups and stats
    June 28, 2025
    Flamengo v Bayern Munich: Preview, line-ups and stats
    June 28, 2025
    Sunderland beat Leeds and Milan to Senegal star Diarra
    June 28, 2025
    Ravindra Jadeja to be dropped? Gautam Gambhir adds his like-for-like replacement to India Test squad
    June 28, 2025
    Mukesh Kumar and wife Divya Singh blessed with baby boy; wishes pour in
    June 28, 2025
  • Technology
    TechnologyShow More
    FBI, cybersecurity firms say a prolific hacking crew is now targeting airlines and the transportation sector
    June 28, 2025
    Get three months of Audible for only $3 in this early Prime Day deal
    June 28, 2025
    Protect Yourself From Sketchy Calls: Unknown Caller vs. No Caller ID
    June 28, 2025
    Fancy Airplane Seats Have Nowhere Left to Go—So What Now?
    June 28, 2025
    Today’s NYT Mini Crossword Answers for June 28
    June 28, 2025
  • Posts
    • Post Layouts
    • Gallery Layouts
    • Video Layouts
    • Audio Layouts
    • Post Sidebar
    • Review
      • User Rating
    • Content Features
    • Table of Contents
  • Contact US
  • Pages
    • Blog Index
    • Search Page
    • Customize Interests
    • My Bookmarks
    • 404 Page
Reading: Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark
Share
Font ResizerAa
inkeinspires.cominkeinspires.com
  • Entertainment
Search
  • Home
  • Categories
    • Breaking News
    • Business
    • Sports
    • Technology
    • Entertainment
    • Gadgets
    • Health
  • Contact
Have an existing account? Sign In
Follow US
inkeinspires.com > Technology > Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark
Technology

Beyond ARC-AGI: GAIA and the search for a real intelligence benchmark

MTHANNACH
Last updated: April 14, 2025 1:32 am
MTHANNACH Published April 14, 2025
Share
SHARE

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


Intelligence is omnipresent, but its measure seems subjective. At best, we approximate its measurement through tests and references. Think of the college entry exams: each year, countless students register, memorize tests of testing tests and sometimes leave with perfect scores. Only one number, he says 100%, means those who obtained it share the same intelligence-or that they have somehow mastered their intelligence? Of course not. The benchmarks are approximations, no exact measures of someone’s capacities – or something -.

The AI ​​community has long been based on benchmarks such as Mmlu (Massive understanding of multitasking language) to assess the capacities of the model through multiple choice questions in academic disciplines. This format allows simple comparisons, but fails to really capture intelligent capabilities.

Claude 3.5 Sonnet and GPT-4.5, for example, obtain similar scores on this reference. On paper, this suggests equivalent capacities. However, people who work with these models know that there are substantial differences in their real performance.

What does it mean to measure “intelligence” in AI?

On the heels of the new Bow Reference liberation – A test designed to push models to general reasoning and creative problem resolution – there is a renewed debate on what it means to measure “intelligence” in AI. Although not everyone has yet tested the ARC-Agi reference, the industry welcomes this and other efforts to develop test frames. Each reference has its merit, and Arc-Agi is a promising step in this wider conversation.

Another recent notable development in AI assessment is’The last examination of humanity“A complete reference containing 3,000 questions in several stages evaluated by peers in various disciplines. Although this test represents an ambitious attempt to question AI systems to the reasoning at the expert level, the first results show rapid progress – Openai would have reached a score of 26.6% in the month following its release. However, like other traditional benchmarks, it mainly evaluates knowledge and reasoning in isolation, without testing practical capacities and using tools which are increasingly crucial for real world applications.

In an example, several peak models fail to count the number of “R” in the word strawberry properly. In another, they incorrectly identify 3.8 as being less than 3,1111. These types of failures – on tasks that even a young child or a basic calculator could resolve – expose a gap between progress based on the references and the robustness of the real world, reminding us that intelligence is not only to pass exams, but in reliably navigating on daily logic.

The new standard to measure the capacity of AI

As models have progressed, these traditional references have shown their limits – GPT -4 with tools only reaches 15% on more complex and real tasks in the Benchmark GaiaDespite impressive scores on multiple choice tests.

This disconnection between reference performance and practical capacity has become more and more problematic as IA systems move from research environments to commercial applications. Traditional benchmarks Test of Knowledge Reminder but lack the crucial aspects of intelligence: the ability to collect information, execute code, analyze data and synthesize solutions in several areas.

Gaia is the necessary change in the AI ​​assessment methodology. Created by the collaboration between the meta-fair teams, Meta-Genaï, Embraints and Autogpt, the reference includes 466 questions carefully designed at three levels of difficulty. These questions test on the web, multimodal understanding, code execution, file management and complex reasoning – essential capacities for real world applications.

Level 1 issues require around 5 steps and a tool to solve humans. Level 2 issues require 5 to 10 steps and several tools, while level 3 issues may require up to 50 discreet steps and a number of tools. This structure reflects the real complexity of commercial problems, where solutions rarely come from a single action or tool.

By prioritizing flexibility on complexity, an AI model reached 75%precision on Gaia – outperforming the microsoft’s Magnetic -1 industry (38%) and the Google Langfun agent (49%). Their success stems from using a combination of specialized models for audiovisual understanding and reasoning, with Anthropic Sonnet 3.5 as the main model.

This evolution in the assessment of AI reflects a broader change in industry: we go from autonomous SaaS applications to AI agents which can orchestrate several tools and workflows. As companies are counting more and more on AI systems to manage complex tasks and in several stages, references like Gaia offer a more significant capacity measure than traditional multiple choice tests.

The future of AI assessment does not reside in isolated knowledge tests but in complete evaluations of problem solving capacity. Gaia establishes a new standard to measure the ability of AI – that which better reflects the challenges and opportunities for the deployment of real world.

Sri Ambati is the founder and CEO of H2O.AI.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.


You Might Also Like

Scientists Observe Carbon Dioxide on Planets Outside the Solar System for the First Time

Today’s NYT Mini Crossword Answers for Jan. 20

The best Super Bowl 2025 TV deals we could find

Could deeptech serve as Europe’s path to autonomy from the US?

Buy one ticket at $210 savings, get the second at 50% off at Sessions: AI

Share This Article
Facebook X Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
loader

Email Address*

Name

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Popular News
Sports

Newcastle found a way to win says Howe

MTHANNACH MTHANNACH March 11, 2025
Joann Fabrics and Crafts stores shutting down all locations after bankruptcy filing
Donald Trump says $175bn ‘Golden Dome’ will be completed during his term
6 Best Sunrise Alarm Clocks (2025), Tested and Reviewed
Europe’s emboldened far right lauds Trump at Madrid rally
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Categories

  • Business
  • Breaking News
  • Entertainment
  • Technology
  • Health
  • Sports
  • Gadgets
We influence 20 million users and is the number one business and technology news network on the planet.
Quick Link
  • My Bookmark
  • InterestsNew
  • Contact Us
  • Blog Index
Top Categories
  • Entertainment

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

 

All Rights Reserved © Inkinspires 2025
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?