Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
The new Meta Flagship NHIPHIP LANGUAGE MODEL LLAMA 4 came suddenly during the weekend, with the Facebook, Instagram, Whatsapp and VR quest (Among the other services and products) revealing not one, not two, but three versions – all upgraded to be more powerful and efficient using the popular architecture “mix of experts” and a new training method involving fixed hyperparammeters, known as metap.
In addition, the three are equipped with massive context windows – the amount of information that AI language model can manage in an input / output exchange with a user or a tool.
But following the surprise announcement and the public exit of two of these download and use models – the LLAMA 4 scout 4 and the mid -level 4 Maverick – Saturday, the response of the AI community on social networks was less than Adores.
Llama 4 is expanding confusion and criticism among AI users
An unbenacked job On the forum of the North American Chinese Language Community, 1POINT3acres was heading for the R / Localllama subdreddit On Reddit claiming to be a researcher from the Meta Genai organization who said that the model had poorly performed on internal third -party landmarks and that the company leadership “Sets of mixing tests suggested from various benchmarks during the post-training process, aimed at reaching targets through various measures and producing a” presentable “result.”
The message was greeted by community skepticism in its authenticity, and an email Venturebeat to a Meta spokesperson has not yet received an answer.
But other users have found reasons to doubt the references anyway.
“At this point, I strongly suspect that Meta has missed something in the released weights … Otherwise, they should dismiss all those who worked on it and then use money to acquire us“, Commented @cto_junior on X, in reference to an independent user test showing the poor performance of Llama 4 (16%) on a Benchmark known as Help Polyglotwhich performs a model through 225 coding tasks. It is below the performance of older models of comparable size such as Deepseek V3 and Claude 3.7 Sonnet.
By referring to the metal-fertility of context of 10 million people boasted for Llama 4 Scout, AI PhD and author Andriy Burkov wrote on X In part that: “The context of 10m declared is virtual because no model has been formed on prompts more than 256K.
Also on the Sanderddit R / Localllama, the user Dr_Karmimski wrote that “I am incredibly disappointed with Llama-4,And has demonstrated its poor performances compared to the non -reassembled V3 model of Deepseek on coding tasks such as the simulation of bouncing balls around a heptagon.
Former meta-researcher meta-researcher and current AI2 (Allen Institute for Artificial Intelligence), the main researcher, Nathan Lambert His blog on interconnections Monday to emphasize that a reference comparison published by META to its own LMAMA 4 MAVERICK LAMAMA Download site to other models, depending on the cost of performance on the third-party comparison tool LMARENA ELO Aka Chatbot Arena, actually used a different Version of Llama 4 Maverick that the company itself made public – an “optimized for conversation”.
As Lambert wrote: “Sneaky. The results below are false, and it is a slight major for the Meta community so as not to publish the model they have used to create their major marketing thrust. We have seen many open models that arise to maximize Chatbotarena while destroying model performance on important skills such as mathematics or code.”
Lambert then noted that while this particular model on the arena was “Tanking the technical reputation of the liberation because his character is juvenile”, “ including a lot of emojis and a frivolous emotional dialogue, “The real model on other accommodation providers is quite intelligent and has a reasonable tone!”
In response to the torrent of criticisms and charges of reference cooking, The vice-president and chief of Genai Ahmad al-Dahle de Meta went to X to declare:
“We are happy to start getting Llama 4 in your hands. We already hear many excellent results with these models.
That said, we also hear certain mixed quality reports between different services. Since we abandoned the models as soon as they were ready, we expect it to take several days at all public implementations to be composed. We will continue to work on our bugs and integration partners.
We have also heard claims that we have trained on test tests – it is simply not true and we would never do that. Our best understanding is that people of variable quality are due to the stabilization of implementations.
We believe that Llama 4 models are significant progress and we look forward to working with the community to unlock their value.“”
However, even this answer has encountered a lot poor performance complaints and calls for more information, as more technical documentation Describe Llama 4 models and their training processes, as well as additional questions about the reasons why this version compared to all previous versions of Llama particularly riddled with problems.
He also comes to the heels of the number two at Meta Vice-President of Research Joelle Pineau, who worked in the adjacent organization of research on artificial intelligence (fair), announcing his departure from the company On LinkedIn last week to “nothing other than admiration and the deep gratitude for each of my managers”. Pineau, it is also advisable to note Promoted the release of the Llama 4 Model family This weekend.
Llama 4 continues to spread to other inference suppliers with mixed results, but it is sure to say that the initial version of the model family was not a Slam Dunk with the AI community.
And the next Meta Llamacon on April 29The first celebration and the gathering for third -party developers in the model family, will probably have a lot of fodder for the discussion. We will follow everything, stay attentive.