The Arc Prize Foundation, a non -profit organization co -founded by the researcher of the eminent AI François Chollet, announced in a blog On Monday, he created a difficult new test to measure the general intelligence of the main models of AI.
Until now, the new test, called Arc-Agi-2, has perplexed most models.
The “reasoning” models like O1-Pro of Openai and the Deepseek R1 score between 1% and 1.3% on Arc-Agi-2, according to the Arc prices classification. Powerful unreal models, including GPT-4.5, Claude 3.7 Sonnet and Gemini 2.0 Flash score of around 1%.
Arc-agi tests consist of puzzle-type problems where an AI must identify the visual models from a collection of squares of different colors and generate the correct “response” grid. The problems have been designed to force an AI to adapt to new problems that he has never seen before.
The Arc Prize Foundation had more than 400 people taking Arch-Agi-2 to establish a human reference base. On average, the “panels” of these people obtained 60% of the test questions – much better than the models of the models.
In a PublishChollet affirmed that Carte-Agi-2 is a better measure of the real intelligence of an AI model than the first iteration of the test, Arc-Agi-1. ARC Price Foundation Tests aim to assess whether an AI system can effectively acquire new skills outside the data on which it has been trained.
Chollet said that unlike Arc -Agi -1, the new test prevents AI models from relying on the “brute force” – an extensive calculation power – to find solutions. Chollet previously recognized that it was a major arc-agi-1 defect.
To approach the faults of the first test, Arc-Agi-2 has a new metric: efficiency. It also requires models to interpret the models on the fly instead of counting on memorization.
“Intelligence is not only defined by the ability to solve problems or achieve high scores,” wrote the co-founder of the Arc Foundation blog. “The efficiency with which these capacities are acquired and deployed is a crucial and decisive component. The main question asked is not only ” [the] Competence to solve a task? But also, “to what efficiency or at this cost?” »»
ARC-AGI-1 was undefeated for about five years until December 2024, when Openai published its advanced reasoning model, O3, which outperformed all the other models of AI and has adorned human performance on the evaluation. However, as we noted at the time, O3’s performance gains on Arc-Agi-1 came with a high price.
The version of the O3 model of Openai-O3 (Bas)-which was the first to reach new peaks on Arc-Agi-1, marking 75.7% in the test, obtained a meager 4% on Arc-Agi-2 using $ 200 of computing power per task.

The arrival of Arc-Agi-2 occurs because many in the technology industry call for new unsaturated benchmarks to measure the progress of AI. Thomas Wolf, co-founder of Hugging Face, recently told Techcrunch that the AI industry did not have enough tests to measure the key features of artificial artificial intelligence, including creativity.
In addition to the new reference, the Arc Prize Foundation has announced A new 2025 Arc competitionDifficult developers to reach an 85% precision on the Arc-Agi-2 test while spending only $ 0.42 per task.