Meta exec denies the company artificially boosted Llama 4's benchmark scores

A META Exec denied a rumor on Monday that the company formed its new AI models to present on specific references while concealing the weaknesses of the models.

The executive, Ahmad al-Dahle, vice-president of the generative AI in Meta, said in a post on x That it is “simply not true” that Meta formed her models Llama 4 Maverick and Llama 4 Scout on “test sett”. In AI benchmarks, test sets are data collections used to assess the performance of a model after its training. The training on a set of tests could inflate the reference scores of a model, which makes the model more capable than it really is.

During the weekend, an unfounded rumor The fact that the reference results of its new models have started to circulate on X and Reddit. The rumor seems to come from an article on a Chinese social media site of a user claiming to have resigned from Meta to protest against the comparative analysis practices of the company.

Reports that Maverick and Scout perform poorly on Some tasks Food the rumor, as is Meta’s decision to use an experimental and unprecedented version of Maverick to get better scores on the LM Arena reference. Researchers on X have Observed Stark Differences in behavior Du Maverick publicly downloaded from the model hosted on LM Arena.

Al-Dahle recognized that some users saw the “mixed quality” of Maverick and Scout through the various cloud suppliers hosting the models.

“Since we abandoned the models as soon as they were ready, we expect all public implementations to be made up of all public implementations,” said al-Dahle. “We will continue to work on our bugs and integration partners.”