Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more

Each version of the AI model inevitably includes graphics praising how it has outperformed its competitors in this reference test or this assessment matrix.

However, these landmarks often test general capacities. For organizations that wish to use models and agents based on a language model, it is more difficult to assess how the agent or model really includes their specific needs.

Model repository Face spear Your benchAn open source tool where developers and companies can create their own benchmarks to test model performance compared to their internal data.

Sumuk Shashidhar, who is part of the research team on Hugging Face evaluations, announced your bench on x. The functionality offers “the generation of personalized comparative analysis and the generation of synthetic data from one of your documents. It is a big step towards improving the functioning of model assessments. ”

He added that the embraced face knows “that for many use cases, which really matters, is the way a model performs your specific task. Yourbench allows you to assess the models on what matters to you. ”

Creation of personalized evaluations

Face said in a newspaper That your Bench works by reproducing subsets of the massive reference of the understanding of multitasking language (MMLU) “using a minimum source text, reaching there for less than $ 15 in total inference cost while perfectly preserving the classification of the performance of the relative model.”

Organizations must pretensate their documents before your bench can operate. This implies three steps:

Document ingestion To “normalize” file formats.
Semantic chunking To decompose documents to respect the limits of context windows and concentrate the attention of the model.
Summary of documents

Next comes the process of generation of questions and answers, which creates questions based on information on documents. This is where the user brings his LLM chosen to see which best answers questions.

Houging Face tested yourbench with V3 and R1 Deepseek models, the Qwen models from Alibaba, including the Qwen QWQ reasoning model, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4O, GPT-4O, O3 Mini, and Caude 3.7 and Claude 3.5 Haiku.

Shashidhar said that the face of hugs also offers cost analysis on models and found that Qwen and Gemini 2.0 Flash “produce a considerable value for very very low costs”.

Calculate the limits

However, the creation of personalized LLM benchmarks based on the documents of an organization has a cost. Your bench requires a lot of computing power to operate. Shashidhar said on X as the company “adds a capacity” as quickly as they could.

Hugging Face performs several GPUs and partners with companies like Google to use their cloud services For inference tasks. VentureBeat has stretched out with the embrace with the face of your use of calculating yourbench.

Comparative analysis is not perfect

References and other evaluation methods allow users an idea of model performance, but these do not perfectly capture the way models will work daily.

Some have even expressed skepticism that reference tests show the limits of the models and can lead to false conclusions on their safety and their performance. A study also warned that comparative analysis agents could be “misleading”.

However, companies cannot avoid evaluating the models now that there are many choices on the market, and technology leaders justify increasing the cost of using AI models. This led to different methods to test the performance and reliability of the model.

Google Deepmind has introduced the land settings, which tests the ability of a model to generate factually precise responses depending on the information from the documents. Some researchers from the University of Yale and Tsinghua have developed self-mentioning code benchmarks to guide companies for which the Coding LLMS works for them.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Creation of personalized evaluations

Calculate the limits

Comparative analysis is not perfect

Leave a Reply Cancel reply

Follow US

Popular News

How to watch Super Bowl 2025: Chiefs vs. Eagles on Sunday, February 9

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US

Creation of personalized evaluations

Calculate the limits

Comparative analysis is not perfect

You Might Also Like

Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs

Best Smart Displays of 2025

Garmin Vivoactive 6 Review: Reliable, Real Intelligence

Best Apple CarPlay Head Unit Car Stereos for 2025

The new Tamagotchi game for Nintendo Switch will connect with Tamagotchi Uni

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

How to watch Super Bowl 2025: Chiefs vs. Eagles on Sunday, February 9

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US