By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
inkeinspires.cominkeinspires.cominkeinspires.com
Notification Show More
Font ResizerAa
  • Home
  • Breaking News
    Breaking NewsShow More
    Brazil’s outspoken first lady comes under fire, but refuses to stop speaking out
    June 27, 2025
    2 charged with murder after bride shot dead, groom and 13-year-old nephew wounded at wedding party in France
    June 27, 2025
    Political violence is quintessentially American | Donald Trump
    June 27, 2025
    19 Virginia sheriffs endorse Miyares over Democrat Jones in attorney general race
    June 27, 2025
    China battery giant CATL is expanding globally: Here’s why it matters
    June 27, 2025
  • Business
    BusinessShow More
    Canara Bank hands over Rs 2,283 cr dividend to Centre amid record profits, joins SBI, BoB in robust payouts
    June 27, 2025
    Foreign stocks are crushing US shares, even with the new record high
    June 27, 2025
    Videos reveal driving issues with Tesla’s robotaxi fleet in Austin
    June 27, 2025
    US stocks hit record high as markets recover from Trump tariff shock
    June 27, 2025
    Renewables leaders parse the damage to their industry as Senate finalizes vote on ‘big beautiful bill’
    June 27, 2025
  • Entertainment
    EntertainmentShow More
    Terminator’s Forgotten First Attempt To Save Itself
    June 27, 2025
    Meghan Markle’s $658 Weekender Tote Look Is $36 on Amazon
    June 27, 2025
    Armed Elderly Woman Blocks Texas Highway In 5-Hour Standoff
    June 27, 2025
    Inside Kevin Spacey’s ‘Substantial’ Hollywood Return
    June 27, 2025
    12 Best Movies Like M3GAN
    June 27, 2025
  • Gadgets
    GadgetsShow More
    CES 2025: 41 Products You Can Buy Right Now
    January 13, 2025
    I can’t wait try out these 3 great plant tech gadgets that I saw at CES 2025
    January 13, 2025
    6 on Your Side Consumer Confidence: Kitchen gadgets to upgrade family recipes – ABC 6 News
    January 13, 2025
    35+ Best New Products, Tech and Gadgets
    January 13, 2025
    These gadgets kept me connected and working through a 90-mile backpacking trip
    January 13, 2025
  • Health
    HealthShow More
    A New Study Finds An 8-Hour Eating Window May Help Burn Fat—But Is It Safe? inkeinspires
    June 27, 2025
    184: Crafting a Morning Routine That Works For YOU inkeinspires
    June 26, 2025
    Endurance Exercise and Longevity – BionicOldGuy inkeinspires
    June 26, 2025
    How Zone 2 Cardio Can Burn Fat And Boost Longevity inkeinspires
    June 26, 2025
    What to do when an exercise doesn’t feel right inkeinspires
    June 25, 2025
  • Sports
    SportsShow More
    Brentford appoint former Wolves midfielder Andrews as boss
    June 27, 2025
    Real Betis still hopeful over ‘very complex’ deal for Manchester United’s Antony
    June 27, 2025
    Sri Lanka ODI squad vs Bangladesh announced, Matheesha Pathirana dropped
    June 27, 2025
    Rohit Sharma reveals the unsung hero behind India’s T20 World Cup 2024 triumph
    June 27, 2025
    Keyshawn Davis Under Fire: Fans Blast “Truth Will Reveal Itself” Apology After Missed Weight & Stripped Title
    June 27, 2025
  • Technology
    TechnologyShow More
    US Supreme Court Upholds Texas Porn ID Law
    June 27, 2025
    SCOTUS porn ruling opens door to sweeping internet age verification
    June 27, 2025
    Early Prime Day deals include our favorite mesh Wi-Fi router for a record-low price
    June 27, 2025
    Best Smart Home Safes for 2025: We Cracked the Code
    June 27, 2025
    Mattress Shopping Terms to Know (2025)
    June 27, 2025
  • Posts
    • Post Layouts
    • Gallery Layouts
    • Video Layouts
    • Audio Layouts
    • Post Sidebar
    • Review
      • User Rating
    • Content Features
    • Table of Contents
  • Contact US
  • Pages
    • Blog Index
    • Search Page
    • Customize Interests
    • My Bookmarks
    • 404 Page
Reading: Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks inkeinspires
Share
Font ResizerAa
inkeinspires.cominkeinspires.com
  • Entertainment
Search
  • Home
  • Categories
    • Breaking News
    • Business
    • Sports
    • Technology
    • Entertainment
    • Gadgets
    • Health
  • Contact
Have an existing account? Sign In
Follow US
inkeinspires.com > Technology > Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks inkeinspires
Technology

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks inkeinspires

MTHANNACH
Last updated: January 10, 2025 3:30 pm
MTHANNACH Published January 10, 2025
Share
SHARE

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more


As large language models (LLMs) continue to improve at coding, the criteria used to evaluate their performance become less and less useful.

Indeed, even though many LLMs score similarly high on these criteria, it can be difficult to understand which ones to use on specific software development projects and companies.

A new paper from Yale University and Tsinghua University presents a new method for testing the ability of models to address “self-invoking code generation» problems that require reasoning, code generation, and reuse of existing code to solve problems.

Generating self-invoking code much more closely resembles realistic programming scenarios and provides insight into the ability of current LLMs to solve real-world coding problems.

Generating self-invoking code

Two popular benchmarks used to assess LLM coding abilities are HumanEval And MBPP (Mostly basic Python problems). These are hand-crafted problem datasets that require the model to write code for simple tasks.

However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code: they also need to understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and then exploit one’s own generated code, namely self-invoked code generation, plays an important role for LLMs to exploit their reasoning abilities to generate code that current benchmarks cannot achieve. not to be captured,” write the researchers.

To test the ability of LLMs to generate self-invoking code, the researchers created two new benchmarks, HumanEval Pro and MBPP Prowhich extend existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem.

Generation of self-calling code (source: arXiv)

For example, the initial problem might be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character.

The extended problem would be to write a function that modifies occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem.

“This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, going beyond the scope of single-issue code generation,” the researchers write.

LLMs perform poorly when generating self-invoking code

Researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini, Claude 3.5 Sonnet, as well as the Qwen, DeepSeek, and Codestral series.

Their results show a significant disparity between traditional coding tests and self-invoking code generation tasks. “Although frontier LLMs excel at generating individual code snippets, they often struggle to effectively use their own generated code to solve more complex problems,” the researchers write.

For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.

Another interesting finding is that while instruction fine-tuning brings significant improvements to simple coding tasks, it shows diminishing returns on generating self-invoking code. The researchers note that “current instruction-based fine-tuning approaches are not effective enough for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train code models. basis for coding and reasoning tasks.

To advance research on self-invoking code generation, researchers propose a technique to automatically reuse existing coding references for self-invoking code generation. The approach uses boundary LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by running the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort.

Automatically generate self-invoking code generation issues (source: arXiv)

A complex landscape

This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1 and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.

At the same time, there are more complex benchmarks such as SWE benchwhich evaluate the capabilities of models in end-to-end software engineering tasks that require a wide range of skills, such as using external libraries and files and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models show modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.

Self-invoking code generation lies somewhere between simple benchmarks and SWE-Bench. It assesses a very specific type of reasoning ability: using existing code in a module to solve complex problems. Self-calling code testing may prove to be a very practical indicator of the usefulness of LLMs in real-world settings, where human programmers are in control and AI co-pilots help them complete tasks. specific coding in the software development process.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related assessments and to inspire future LLM development by highlighting gaps in current models and encouraging innovation in training methodologies,” write the researchers.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our privacy policy

Thank you for subscribing. Check out more VB newsletters here.

An error has occurred.


You Might Also Like

18 Best Toiletry Bags, Tested Over Many Miles (2025)

New Grand Theft Auto VI trailer drops shortly after release date delay

AI lie detector: How HallOumi’s open-source approach to hallucination could unlock enterprise AI adoption

Billionaire founder of Luminar replaced as CEO following ethics inquiry

Slice-of-life soccer game Despelote kicks off on May 1

Share This Article
Facebook X Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
loader

Email Address*

Name

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Popular News
Sports

Clone Rumble LTM, Galacta’s Cosmic Adventure event, and more

MTHANNACH MTHANNACH March 6, 2025
US and Russia agree to ‘lay the groundwork’ for ending Ukraine war
Israel threatens with annexation after ground invasion in north, south Gaza | News
Federal judge blocks Elon Musk’s access to US Treasury data
India-US to start discussions on proposed trade deal soon
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Categories

  • Business
  • Breaking News
  • Entertainment
  • Technology
  • Health
  • Sports
  • Gadgets
We influence 20 million users and is the number one business and technology news network on the planet.
Quick Link
  • My Bookmark
  • InterestsNew
  • Contact Us
  • Blog Index
Top Categories
  • Entertainment

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

 

All Rights Reserved © Inkinspires 2025
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?