Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Each version of the AI model inevitably includes graphics praising how it has outperformed its competitors in this reference test or this assessment matrix.
However, these landmarks often test general capacities. For organizations that wish to use models and agents based on a language model, it is more difficult to assess how the agent or model really includes their specific needs.
Model repository Face spear Your benchAn open source tool where developers and companies can create their own benchmarks to test model performance compared to their internal data.
Sumuk Shashidhar, who is part of the research team on Hugging Face evaluations, announced your bench on x. The functionality offers “the generation of personalized comparative analysis and the generation of synthetic data from one of your documents. It is a big step towards improving the functioning of model assessments. ”
He added that the embraced face knows “that for many use cases, which really matters, is the way a model performs your specific task. Yourbench allows you to assess the models on what matters to you. ”
Creation of personalized evaluations
Face said in a newspaper That your Bench works by reproducing subsets of the massive reference of the understanding of multitasking language (MMLU) “using a minimum source text, reaching there for less than $ 15 in total inference cost while perfectly preserving the classification of the performance of the relative model.”
Organizations must pretensate their documents before your bench can operate. This implies three steps:
- Document ingestion To “normalize” file formats.
- Semantic chunking To decompose documents to respect the limits of context windows and concentrate the attention of the model.
- Summary of documents
Next comes the process of generation of questions and answers, which creates questions based on information on documents. This is where the user brings his LLM chosen to see which best answers questions.
Houging Face tested yourbench with V3 and R1 Deepseek models, the Qwen models from Alibaba, including the Qwen QWQ reasoning model, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4O, GPT-4O, O3 Mini, and Caude 3.7 and Claude 3.5 Haiku.
Shashidhar said that the face of hugs also offers cost analysis on models and found that Qwen and Gemini 2.0 Flash “produce a considerable value for very very low costs”.
Calculate the limits
However, the creation of personalized LLM benchmarks based on the documents of an organization has a cost. Your bench requires a lot of computing power to operate. Shashidhar said on X as the company “adds a capacity” as quickly as they could.
Hugging Face performs several GPUs and partners with companies like Google to use their cloud services For inference tasks. VentureBeat has stretched out with the embrace with the face of your use of calculating yourbench.
Comparative analysis is not perfect
References and other evaluation methods allow users an idea of model performance, but these do not perfectly capture the way models will work daily.
Some have even expressed skepticism that reference tests show the limits of the models and can lead to false conclusions on their safety and their performance. A study also warned that comparative analysis agents could be “misleading”.
However, companies cannot avoid evaluating the models now that there are many choices on the market, and technology leaders justify increasing the cost of using AI models. This led to different methods to test the performance and reliability of the model.
Google Deepmind has introduced the land settings, which tests the ability of a model to generate factually precise responses depending on the information from the documents. Some researchers from the University of Yale and Tsinghua have developed self-mentioning code benchmarks to guide companies for which the Coding LLMS works for them.