Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
The models of large languages (LLM) are increasingly capable of complex reasoning thanks to a “time -scale scaling”, a set of techniques that allocate more calculation resources during inference to generate responses. However, a new study According to Microsoft, research reveals that the effectiveness of these scaling methods is not universal. Performance increases vary considerably depending on different models, tasks and problems of problems.
The basic discovery is that the simple fact of launching more calculation on a problem during inference does not guarantee better or more effective results. The results can help companies better understand the volatility of costs and the reliability of the model while they seek to integrate the advanced reasoning of AI in their applications.
Put test methods to the test
Microsoft’s search team has carried out an empirical analysis extended on nine peak foundation models. This included both “conventional” models such as GPT-4O, Claude 3.5 Sonnet, Gemini 2.0 Pro and Llama 3.1 405B, as well as models specifically refined for improved reasoning thanks to an inference scaling. This included O1 and O3-Mini of Openai, the Sonnet Claude 3.7 of Anthropic, the Gemini 2 Flash thought of Google and Deepseek R1.
They evaluated these models using three distinct time scaling approaches:
- Standard thought chain (COT): The basic method where the model is invited to respond step by step.
- Parallel scale: The model generates several independent answers for the same question and uses an aggregator (such as the majority vote or the selection of the most score response) to reach a final result.
- Sequential scale: The model generates an answer and uses the comments of a critic (potentially from the model itself) to refine the answer in the following attempts.
These approaches have been tested on eight difficult reference data sets covering a wide range of tasks that benefit from a problem-by-step problem solving: Math and Stem reasoning (AIM, OMNI-MATH, GPQA), Planning of the calendar (BA-Calendar), NP problems (3SAT), navigation (labyrinth) and space reasoning (spatialmap).
Several benchmarks included problems with different levels of difficulty, allowing a more nuanced understanding of how the scale behaves as problems become more difficult.
“The availability of difficulty labels for Omni-Math, TSP, 3SAT and Ba-Calendar allows us to analyze how the accuracy and the scale of use of tokens with difficulties in the scale of time inference, which is a prospect which is still under-explored” paper detailing their results.
The researchers evaluated the Pareto border of LLM reasoning by analyzing both the precision and the cost of calculation (that is to say the number of tokens generated). This helps to identify the effectiveness of models to obtain their results.

They also introduced the measurement of the “conventional to tradition” difference, which compares the best possible performance of a conventional model (using an ideal “best of N”) compared to the average performance of a reasoning model, considering potential gains achievable thanks to better training or verification techniques.
No more calculation is not always the answer
The study provided several crucial information which question the current hypotheses on the scale of the inference:
The advantages vary considerably: Although the models are set for reasoning generally surpasses those conventional on these tasks, the degree of improvement varies considerably depending on the domain and the specific task. Gains often decrease as the complexity of the problem increases. For example, improvements in performance observed on mathematical problems were also also reflected in scientific or planning tasks.
The ineffectiveness of the tokens is commonplace: The researchers observed a great variability in the consumption of tokens, even between the models reaching a similar precision. For example, on the mathematical reference likes 2025, Deepseek-R1 used more than five times more tokens than Claude 3.7 Sonnet for an almost comparable average precision.
More tokens do not lead to higher precision: Contrary to the intuitive idea that longer reasoning chains mean better reasoning, the study revealed that it was not always true. “Surprisingly, we also observe that longer generations compared to the same model can sometimes be an indicator of models in difficulty rather than improved reflection,” said the article. “Likewise, when comparing different reasoning models, a higher use of tokens is not always associated with better precision.
Non -determinism cost: Perhaps the most worrying for business users, repeated requests to the same model for the same problem can cause very variable use of tokens. This means that the cost of execution of a request can fluctuate considerably, even when the model systematically provides the correct answer.

The potential of verification mechanisms: The scaling performance has systematically improved on all models and benchmarks when it is simulated with a “perfect verifier” (using the best results).
Conventional models sometimes correspond to reasoning models: By considerably increasing inference calls (up to 50 times more in certain experiences), conventional models like GPT-4O could sometimes approach the performance levels of dedicated reasoning models, in particular on less complex tasks. However, these earnings quickly decreased in very complex contexts, which indicates that the brute force scale has its limits.

Implications for the company
These results have a significant weight for developers and corporate adopters of the LLM. The question of “non-determinism of costs” is particularly striking and makes budgeting difficult. As researchers point out, “ideally, developers and users would prefer models for which the standard deviation on the use of tokens per instance is low for cost predictability.”
“The profiling we make [the study] Could be useful for developers as a tool for choosing less volatile models for the same prompt or for various prompts, ”told Venturebeat Besmira Nushi, head of the main research of main research at Microsoft.

The study also provides good information on the correlation between the accuracy of a model and the response length. For example, the following diagram shows that mathematical queries greater than ~ 11,000 token lengths have a very thin risk of being correct, and these generations must be stopped at this point or restarted with sequential feedback. However, Nushi points out that the models allowing these post hoc attenuations also have a cleaner separation between correct and incorrect samples.

“In the end, it is also the responsibility of model manufacturers to think of reducing precision and costing non-determinism, and we expect it to happen largely as the methods become more mature,” said Nushi. “In addition to the non-determinism of costs, the precision of non-determinism also applies.”
Another important discovery is the coherent stimulation of the performance of perfect auditors, which highlights a critical field for future work: building robust and largely applicable verification mechanisms.
“The availability of stronger auditors can have different types of impact,” said Nushi, such as improving fundamental training methods for reasoning. “If it is used effectively, these can also shorten the traces of reasoning.”
Strong auditors can also become a central element of corporate agental solutions. Many business stakeholders already have such auditors in place, who may have to be reused for more agentic solutions, such as SAT resolvers, logistical validity verifiers, etc.
“Questions for the future are the way in which these existing techniques can be combined with AI -centered interfaces and what is the language that connects both,” said Nushi. “The need to connect the two comes from the fact that users will not always formulate their requests in a formal way, they will want to use an interface in natural language and expect solutions in a similar format or in a final action (for example, propose a meeting invitation).”