Meta has recently introduced its Llama 4 series of AI models, making waves by outperforming GPT-4o and Gemini 2.0 Pro in the Chatbot Arena (previously known as LMSYS). The company boasts that its Llama 4 Maverick model, which uses a mixture of experts (MoE) to activate just 17 billion parameters out of a total of 400 billion across 128 experts, scored an impressive ELO of 1,417 on the Chatbot Arena benchmark.
This achievement caught the attention of the AI community, especially since the smaller MoE model surpassed significantly larger language models like GPT-4.5 and Grok 3. The unexpected results prompted many AI enthusiasts to independently evaluate the model. However, it turned out that Llama 4 Maverick’s real-world performance did not live up to Meta’s benchmark claims, particularly in coding tasks.
On 1Point3Acres, a widely used forum for North American Chinese users, a post from someone claiming to be a former Meta employee stirred controversy. This post, which has since been translated into English on Reddit, alleged that Meta’s leadership may have combined various test sets during the post-training phase to artificially boost the benchmark scores and fulfill internal objectives.
The departing employee expressed their disapproval of this practice and subsequently resigned. They also requested that their name be removed from the Llama 4 technical report. Furthermore, the employee asserted that the recent resignation of Meta’s AI research head, Joelle Pineau, is closely tied to the alleged manipulations surrounding the Llama 4 benchmarks.
In light of these allegations, Ahmad Al-Dahle, who leads Meta’s Generative AI division, responded with a post on X, strongly refuting claims that Llama 4 had been post-trained on test sets. Al-Dahle stated:
We’ve also heard claims that we trained on test sets — that’s simply not true, and we would never do that. Our best understanding is that the variable quality people are seeing results from is due to the need for stabilizing implementations.
He acknowledged the varying performance of Llama 4 across different platforms, urging the AI community to allow a few days for the implementation to stabilize.
LMSYS Addresses Allegations of Llama 4 Benchmark Manipulation
In response to the rising concerns from the AI community, LMSYS — the organization behind the Chatbot Arena leaderboard — released a statement aimed at enhancing transparency. LMSYS clarified that the model submitted to Chatbot Arena was “Llama-4-Maverick-03-26-Experimental,” a custom variant tailored for human preferences.
LMSYS admitted that “style and model response tone were significant factors” which may have inadvertently benefited the custom Llama 4 Maverick model. They acknowledged that this crucial information was not communicated clearly by Meta. Additionally, LMSYS noted that “Meta’s interpretation of our policy did not align with our expectations from model providers.”
To be fair, Meta did mention in its official Llama 4 blog that an “experimental chat version” achieved an ELO of 1,417 on Chatbot Arena, but they did not provide additional details.
To promote transparency further, LMSYS included the Hugging Face version of Llama 4 Maverick in Chatbot Arena and has also released more than 2,000 comparative battle results for public scrutiny. These results cover prompts, model responses, and user preferences.
Upon reviewing the battle results, it was surprising to find that users frequently preferred Llama 4’s answers, which often contained inaccuracies and verbosity. This raises significant questions about the reliability of community-driven benchmarks like Chatbot Arena.
This is not the first instance of Meta facing accusations of manipulating benchmarks through data contamination, which involves mixing benchmark datasets into the training corpus. Earlier this year, Susan Zhang — a former Meta AI researcher who now works at Google DeepMind — revealed a study in reaction to a post by Yann LeCun, Meta AI’s chief scientist.
The study indicated that over 50% of test samples from major benchmarks were included in Meta’s Llama 1 pretraining data. The paper reported significant contamination in key benchmarks like Big Bench Hard, HumanEval, HellaSwag, MMLU, PiQA, and TriviaQA.
Now, with the new allegations surrounding Llama 4’s benchmarks, Zhang sharply remarked that Meta should at least credit their “previous work” from Llama 1 for this “unique approach.” Her comment suggests that such benchmark manipulation is not incidental but rather a deliberate strategy by Meta aimed at artificially inflating performance metrics.