By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
inkeinspires.cominkeinspires.cominkeinspires.com
Notification Show More
Font ResizerAa
  • Home
  • Breaking News
    Breaking NewsShow More
    Brazil’s outspoken first lady comes under fire, but refuses to stop speaking out
    June 27, 2025
    2 charged with murder after bride shot dead, groom and 13-year-old nephew wounded at wedding party in France
    June 27, 2025
    Political violence is quintessentially American | Donald Trump
    June 27, 2025
    19 Virginia sheriffs endorse Miyares over Democrat Jones in attorney general race
    June 27, 2025
    China battery giant CATL is expanding globally: Here’s why it matters
    June 27, 2025
  • Business
    BusinessShow More
    Canara Bank hands over Rs 2,283 cr dividend to Centre amid record profits, joins SBI, BoB in robust payouts
    June 27, 2025
    Foreign stocks are crushing US shares, even with the new record high
    June 27, 2025
    Videos reveal driving issues with Tesla’s robotaxi fleet in Austin
    June 27, 2025
    US stocks hit record high as markets recover from Trump tariff shock
    June 27, 2025
    Renewables leaders parse the damage to their industry as Senate finalizes vote on ‘big beautiful bill’
    June 27, 2025
  • Entertainment
    EntertainmentShow More
    Terminator’s Forgotten First Attempt To Save Itself
    June 27, 2025
    Meghan Markle’s $658 Weekender Tote Look Is $36 on Amazon
    June 27, 2025
    Armed Elderly Woman Blocks Texas Highway In 5-Hour Standoff
    June 27, 2025
    Inside Kevin Spacey’s ‘Substantial’ Hollywood Return
    June 27, 2025
    12 Best Movies Like M3GAN
    June 27, 2025
  • Gadgets
    GadgetsShow More
    CES 2025: 41 Products You Can Buy Right Now
    January 13, 2025
    I can’t wait try out these 3 great plant tech gadgets that I saw at CES 2025
    January 13, 2025
    6 on Your Side Consumer Confidence: Kitchen gadgets to upgrade family recipes – ABC 6 News
    January 13, 2025
    35+ Best New Products, Tech and Gadgets
    January 13, 2025
    These gadgets kept me connected and working through a 90-mile backpacking trip
    January 13, 2025
  • Health
    HealthShow More
    A New Study Finds An 8-Hour Eating Window May Help Burn Fat—But Is It Safe? inkeinspires
    June 27, 2025
    184: Crafting a Morning Routine That Works For YOU inkeinspires
    June 26, 2025
    Endurance Exercise and Longevity – BionicOldGuy inkeinspires
    June 26, 2025
    How Zone 2 Cardio Can Burn Fat And Boost Longevity inkeinspires
    June 26, 2025
    What to do when an exercise doesn’t feel right inkeinspires
    June 25, 2025
  • Sports
    SportsShow More
    Brentford appoint former Wolves midfielder Andrews as boss
    June 27, 2025
    Real Betis still hopeful over ‘very complex’ deal for Manchester United’s Antony
    June 27, 2025
    Sri Lanka ODI squad vs Bangladesh announced, Matheesha Pathirana dropped
    June 27, 2025
    Rohit Sharma reveals the unsung hero behind India’s T20 World Cup 2024 triumph
    June 27, 2025
    Keyshawn Davis Under Fire: Fans Blast “Truth Will Reveal Itself” Apology After Missed Weight & Stripped Title
    June 27, 2025
  • Technology
    TechnologyShow More
    US Supreme Court Upholds Texas Porn ID Law
    June 27, 2025
    SCOTUS porn ruling opens door to sweeping internet age verification
    June 27, 2025
    Early Prime Day deals include our favorite mesh Wi-Fi router for a record-low price
    June 27, 2025
    Best Smart Home Safes for 2025: We Cracked the Code
    June 27, 2025
    Mattress Shopping Terms to Know (2025)
    June 27, 2025
  • Posts
    • Post Layouts
    • Gallery Layouts
    • Video Layouts
    • Audio Layouts
    • Post Sidebar
    • Review
      • User Rating
    • Content Features
    • Table of Contents
  • Contact US
  • Pages
    • Blog Index
    • Search Page
    • Customize Interests
    • My Bookmarks
    • 404 Page
Reading: Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data
Share
Font ResizerAa
inkeinspires.cominkeinspires.com
  • Entertainment
Search
  • Home
  • Categories
    • Breaking News
    • Business
    • Sports
    • Technology
    • Entertainment
    • Gadgets
    • Health
  • Contact
Have an existing account? Sign In
Follow US
inkeinspires.com > Technology > Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data
Technology

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

MTHANNACH
Last updated: April 2, 2025 11:06 pm
MTHANNACH Published April 2, 2025
Share
SHARE

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


Each version of the AI ​​model inevitably includes graphics praising how it has outperformed its competitors in this reference test or this assessment matrix.

However, these landmarks often test general capacities. For organizations that wish to use models and agents based on a language model, it is more difficult to assess how the agent or model really includes their specific needs.

Model repository Face spear Your benchAn open source tool where developers and companies can create their own benchmarks to test model performance compared to their internal data.

Sumuk Shashidhar, who is part of the research team on Hugging Face evaluations, announced your bench on x. The functionality offers “the generation of personalized comparative analysis and the generation of synthetic data from one of your documents. It is a big step towards improving the functioning of model assessments. ”

He added that the embraced face knows “that for many use cases, which really matters, is the way a model performs your specific task. Yourbench allows you to assess the models on what matters to you. ”

Creation of personalized evaluations

Face said in a newspaper That your Bench works by reproducing subsets of the massive reference of the understanding of multitasking language (MMLU) “using a minimum source text, reaching there for less than $ 15 in total inference cost while perfectly preserving the classification of the performance of the relative model.”

Organizations must pretensate their documents before your bench can operate. This implies three steps:

  • Document ingestion To “normalize” file formats.
  • Semantic chunking To decompose documents to respect the limits of context windows and concentrate the attention of the model.
  • Summary of documents

Next comes the process of generation of questions and answers, which creates questions based on information on documents. This is where the user brings his LLM chosen to see which best answers questions.

Houging Face tested yourbench with V3 and R1 Deepseek models, the Qwen models from Alibaba, including the Qwen QWQ reasoning model, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4O, GPT-4O, O3 Mini, and Caude 3.7 and Claude 3.5 Haiku.

Shashidhar said that the face of hugs also offers cost analysis on models and found that Qwen and Gemini 2.0 Flash “produce a considerable value for very very low costs”.

Calculate the limits

However, the creation of personalized LLM benchmarks based on the documents of an organization has a cost. Your bench requires a lot of computing power to operate. Shashidhar said on X as the company “adds a capacity” as quickly as they could.

Hugging Face performs several GPUs and partners with companies like Google to use their cloud services For inference tasks. VentureBeat has stretched out with the embrace with the face of your use of calculating yourbench.

Comparative analysis is not perfect

References and other evaluation methods allow users an idea of ​​model performance, but these do not perfectly capture the way models will work daily.

Some have even expressed skepticism that reference tests show the limits of the models and can lead to false conclusions on their safety and their performance. A study also warned that comparative analysis agents could be “misleading”.

However, companies cannot avoid evaluating the models now that there are many choices on the market, and technology leaders justify increasing the cost of using AI models. This led to different methods to test the performance and reliability of the model.

Google Deepmind has introduced the land settings, which tests the ability of a model to generate factually precise responses depending on the information from the documents. Some researchers from the University of Yale and Tsinghua have developed self-mentioning code benchmarks to guide companies for which the Coding LLMS works for them.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.


You Might Also Like

Scientists Claim to Have Brought Back the Dire Wolf

Therabody PowerDot 2.0 Duo Review: Can’t Connect to the App

World of Tanks Blitz gets Reforged update with Unreal Engine 5 visuals

Meta and UNESCO team up to improve translation AI

Meta brings ‘teen accounts’ to Facebook and Messenger

Share This Article
Facebook X Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
loader

Email Address*

Name

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Popular News
Sports

Martin Zubimendi prefers Real Madrid transfer over Arsenal

MTHANNACH MTHANNACH April 18, 2025
Trump claims he can fire Federal Reserve chair ‘if I want him out’
Waffle House Charges Diners Extra For Eggs Amid Raging Bird Flu
The 6 best air fryers for 2025, tested and reviewed
Elliott goal gives Reds win after Alisson heroics
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Categories

  • Business
  • Breaking News
  • Entertainment
  • Technology
  • Health
  • Sports
  • Gadgets
We influence 20 million users and is the number one business and technology news network on the planet.
Quick Link
  • My Bookmark
  • InterestsNew
  • Contact Us
  • Blog Index
Top Categories
  • Entertainment

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

 

All Rights Reserved © Inkinspires 2025
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?