The debates on AI’s references – and how they are reported by the AI - laboratories spread in the public.
This week, an OpenAi employee accused The AI of Elon Musk, XAI, of the publication of the deceptive reference results for its latest IA model, Grok 3. One of the co-founders of Xai, Igor Babushkin, insisted that the company was on the right.
The truth is somewhere between the two.
In a Publish on Xai’s blogThe company has published a graph showing the performances of Grok 3 on Aime 2025, a collection of mathematical questions difficult for a recent mathematics exam by invitation. Some experts have questioned the validity of the love as a reference AI. Nevertheless, the versions like 2025 and older of the test are commonly used to probe the mathematical capacity of a model.
Xai’s graph has shown two variants of Grok 3, Grok 3 reasoning Beta and Grok 3 Mini reasoning, beating the best efficient model of Openai, O3-Mini-High, on Aime 2025. But the employees of Openai on X quickly pointed out that XAI’s graphic did not include the AIME 2025 score from O3-Mini-High to “Cons @ 64”.
What could you ask? Well, it is short for “consensus @ 64”, and it essentially gives a 64 model trying to respond to each problem in a reference and takes the most frequently generated answers as the final responses. As you can imagine, Cons @ 64 tends to stimulate the reference scores of models a little, and omit it from a graph could give the impression that a model exceeds another while in reality , this is not the case.
Grok 3 Reasoning Beta and Grok 3 Mini Reasoning Scores for Like 2025 at “@ 1” – which means that the first score that the models have obtained on the reference – fall below the score of O3 -Mini -High. Grok 3 Reashing Beta also always follows so slightly behind the O1 model of Openai on “average” computing. However, Xai is Grok 3 advertising Like “the most intelligent AI in the world”.
Babykin Articulated on x This Openai has published deceptive reference graphics in the past – although graphics comparing the performance of its own models. A more neutral part in the debate has set up a more “precise” graphic showing almost the performance of all models at Cons @ 64:
Hilarious how some people see my intrigue as an attack on Openai and others as an attack on Grok when in reality it is a deep propaganda
(I actually believe that Grok looks good there, and TTC Chicanery of Openai behind O3-Mini- * High * -pass @ “” “1” “” deserves more exam.) pic.twitter.com/3wh8foufic– Teortaxes ▶ ️ (Deepseek 推特🐋铁粉 2023 – ∞) (@teortaxestex) February 20, 2025
But as a researcher in AI Nathan Lambert underlined in a postPerhaps the most important metric remains a mystery: the cost of calculation (and monetary) that took each model to obtain its best score. This simply shows how most IA markers do little on the limits of the models – and their strengths.