Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
As large language models (LLMs) continue to improve at coding, the criteria used to evaluate their performance become less and less useful.
Indeed, even though many LLMs score similarly high on these criteria, it can be difficult to understand which ones to use on specific software development projects and companies.
A new paper from Yale University and Tsinghua University presents a new method for testing the ability of models to address “self-invoking code generation» problems that require reasoning, code generation, and reuse of existing code to solve problems.
Generating self-invoking code much more closely resembles realistic programming scenarios and provides insight into the ability of current LLMs to solve real-world coding problems.
Generating self-invoking code
Two popular benchmarks used to assess LLM coding abilities are HumanEval And MBPP (Mostly basic Python problems). These are hand-crafted problem datasets that require the model to write code for simple tasks.
However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code: they also need to understand and reuse existing code and create reusable components to solve complex problems.
“The ability to understand and then exploit one’s own generated code, namely self-invoked code generation, plays an important role for LLMs to exploit their reasoning abilities to generate code that current benchmarks cannot achieve. not to be captured,” write the researchers.
To test the ability of LLMs to generate self-invoking code, the researchers created two new benchmarks, HumanEval Pro and MBPP Prowhich extend existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem.
For example, the initial problem might be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character.
The extended problem would be to write a function that modifies occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem.
“This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, going beyond the scope of single-issue code generation,” the researchers write.
LLMs perform poorly when generating self-invoking code
Researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini, Claude 3.5 Sonnet, as well as the Qwen, DeepSeek, and Codestral series.
Their results show a significant disparity between traditional coding tests and self-invoking code generation tasks. “Although frontier LLMs excel at generating individual code snippets, they often struggle to effectively use their own generated code to solve more complex problems,” the researchers write.
For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.
Another interesting finding is that while instruction fine-tuning brings significant improvements to simple coding tasks, it shows diminishing returns on generating self-invoking code. The researchers note that “current instruction-based fine-tuning approaches are not effective enough for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train code models. basis for coding and reasoning tasks.
To advance research on self-invoking code generation, researchers propose a technique to automatically reuse existing coding references for self-invoking code generation. The approach uses boundary LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by running the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort.
A complex landscape
This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1 and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.
At the same time, there are more complex benchmarks such as SWE benchwhich evaluate the capabilities of models in end-to-end software engineering tasks that require a wide range of skills, such as using external libraries and files and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models show modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.
Self-invoking code generation lies somewhere between simple benchmarks and SWE-Bench. It assesses a very specific type of reasoning ability: using existing code in a module to solve complex problems. Self-calling code testing may prove to be a very practical indicator of the usefulness of LLMs in real-world settings, where human programmers are in control and AI co-pilots help them complete tasks. specific coding in the software development process.
“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related assessments and to inspire future LLM development by highlighting gaps in current models and encouraging innovation in training methodologies,” write the researchers.