By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
inkeinspires.cominkeinspires.cominkeinspires.com
Notification Show More
Font ResizerAa
  • Home
  • Breaking News
    Breaking NewsShow More
    Brazil’s outspoken first lady comes under fire, but refuses to stop speaking out
    June 27, 2025
    2 charged with murder after bride shot dead, groom and 13-year-old nephew wounded at wedding party in France
    June 27, 2025
    Political violence is quintessentially American | Donald Trump
    June 27, 2025
    19 Virginia sheriffs endorse Miyares over Democrat Jones in attorney general race
    June 27, 2025
    China battery giant CATL is expanding globally: Here’s why it matters
    June 27, 2025
  • Business
    BusinessShow More
    Canara Bank hands over Rs 2,283 cr dividend to Centre amid record profits, joins SBI, BoB in robust payouts
    June 27, 2025
    Foreign stocks are crushing US shares, even with the new record high
    June 27, 2025
    Videos reveal driving issues with Tesla’s robotaxi fleet in Austin
    June 27, 2025
    US stocks hit record high as markets recover from Trump tariff shock
    June 27, 2025
    Renewables leaders parse the damage to their industry as Senate finalizes vote on ‘big beautiful bill’
    June 27, 2025
  • Entertainment
    EntertainmentShow More
    Terminator’s Forgotten First Attempt To Save Itself
    June 27, 2025
    Meghan Markle’s $658 Weekender Tote Look Is $36 on Amazon
    June 27, 2025
    Armed Elderly Woman Blocks Texas Highway In 5-Hour Standoff
    June 27, 2025
    Inside Kevin Spacey’s ‘Substantial’ Hollywood Return
    June 27, 2025
    12 Best Movies Like M3GAN
    June 27, 2025
  • Gadgets
    GadgetsShow More
    CES 2025: 41 Products You Can Buy Right Now
    January 13, 2025
    I can’t wait try out these 3 great plant tech gadgets that I saw at CES 2025
    January 13, 2025
    6 on Your Side Consumer Confidence: Kitchen gadgets to upgrade family recipes – ABC 6 News
    January 13, 2025
    35+ Best New Products, Tech and Gadgets
    January 13, 2025
    These gadgets kept me connected and working through a 90-mile backpacking trip
    January 13, 2025
  • Health
    HealthShow More
    A New Study Finds An 8-Hour Eating Window May Help Burn Fat—But Is It Safe? inkeinspires
    June 27, 2025
    184: Crafting a Morning Routine That Works For YOU inkeinspires
    June 26, 2025
    Endurance Exercise and Longevity – BionicOldGuy inkeinspires
    June 26, 2025
    How Zone 2 Cardio Can Burn Fat And Boost Longevity inkeinspires
    June 26, 2025
    What to do when an exercise doesn’t feel right inkeinspires
    June 25, 2025
  • Sports
    SportsShow More
    Brentford appoint former Wolves midfielder Andrews as boss
    June 27, 2025
    Real Betis still hopeful over ‘very complex’ deal for Manchester United’s Antony
    June 27, 2025
    Sri Lanka ODI squad vs Bangladesh announced, Matheesha Pathirana dropped
    June 27, 2025
    Rohit Sharma reveals the unsung hero behind India’s T20 World Cup 2024 triumph
    June 27, 2025
    Keyshawn Davis Under Fire: Fans Blast “Truth Will Reveal Itself” Apology After Missed Weight & Stripped Title
    June 27, 2025
  • Technology
    TechnologyShow More
    US Supreme Court Upholds Texas Porn ID Law
    June 27, 2025
    SCOTUS porn ruling opens door to sweeping internet age verification
    June 27, 2025
    Early Prime Day deals include our favorite mesh Wi-Fi router for a record-low price
    June 27, 2025
    Best Smart Home Safes for 2025: We Cracked the Code
    June 27, 2025
    Mattress Shopping Terms to Know (2025)
    June 27, 2025
  • Posts
    • Post Layouts
    • Gallery Layouts
    • Video Layouts
    • Audio Layouts
    • Post Sidebar
    • Review
      • User Rating
    • Content Features
    • Table of Contents
  • Contact US
  • Pages
    • Blog Index
    • Search Page
    • Customize Interests
    • My Bookmarks
    • 404 Page
Reading: When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems
Share
Font ResizerAa
inkeinspires.cominkeinspires.com
  • Entertainment
Search
  • Home
  • Categories
    • Breaking News
    • Business
    • Sports
    • Technology
    • Entertainment
    • Gadgets
    • Health
  • Contact
Have an existing account? Sign In
Follow US
inkeinspires.com > Technology > When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems
Technology

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

MTHANNACH
Last updated: April 16, 2025 4:30 am
MTHANNACH Published April 16, 2025
Share
SHARE

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


The models of large languages ​​(LLM) are increasingly capable of complex reasoning thanks to a “time -scale scaling”, a set of techniques that allocate more calculation resources during inference to generate responses. However, a new study According to Microsoft, research reveals that the effectiveness of these scaling methods is not universal. Performance increases vary considerably depending on different models, tasks and problems of problems.

The basic discovery is that the simple fact of launching more calculation on a problem during inference does not guarantee better or more effective results. The results can help companies better understand the volatility of costs and the reliability of the model while they seek to integrate the advanced reasoning of AI in their applications.

Put test methods to the test

Microsoft’s search team has carried out an empirical analysis extended on nine peak foundation models. This included both “conventional” models such as GPT-4O, Claude 3.5 Sonnet, Gemini 2.0 Pro and Llama 3.1 405B, as well as models specifically refined for improved reasoning thanks to an inference scaling. This included O1 and O3-Mini of Openai, the Sonnet Claude 3.7 of Anthropic, the Gemini 2 Flash thought of Google and Deepseek R1.

They evaluated these models using three distinct time scaling approaches:

  1. Standard thought chain (COT): The basic method where the model is invited to respond step by step.
  2. Parallel scale: The model generates several independent answers for the same question and uses an aggregator (such as the majority vote or the selection of the most score response) to reach a final result.
  3. Sequential scale: The model generates an answer and uses the comments of a critic (potentially from the model itself) to refine the answer in the following attempts.

These approaches have been tested on eight difficult reference data sets covering a wide range of tasks that benefit from a problem-by-step problem solving: Math and Stem reasoning (AIM, OMNI-MATH, GPQA), Planning of the calendar (BA-Calendar), NP problems (3SAT), navigation (labyrinth) and space reasoning (spatialmap).

Several benchmarks included problems with different levels of difficulty, allowing a more nuanced understanding of how the scale behaves as problems become more difficult.

“The availability of difficulty labels for Omni-Math, TSP, 3SAT and Ba-Calendar allows us to analyze how the accuracy and the scale of use of tokens with difficulties in the scale of time inference, which is a prospect which is still under-explored” paper detailing their results.

The researchers evaluated the Pareto border of LLM reasoning by analyzing both the precision and the cost of calculation (that is to say the number of tokens generated). This helps to identify the effectiveness of models to obtain their results.

Pareto scale inference
Pareto Frontier Frontier credit in lower time: Arxiv

They also introduced the measurement of the “conventional to tradition” difference, which compares the best possible performance of a conventional model (using an ideal “best of N”) compared to the average performance of a reasoning model, considering potential gains achievable thanks to better training or verification techniques.

No more calculation is not always the answer

The study provided several crucial information which question the current hypotheses on the scale of the inference:

The advantages vary considerably: Although the models are set for reasoning generally surpasses those conventional on these tasks, the degree of improvement varies considerably depending on the domain and the specific task. Gains often decrease as the complexity of the problem increases. For example, improvements in performance observed on mathematical problems were also also reflected in scientific or planning tasks.

The ineffectiveness of the tokens is commonplace: The researchers observed a great variability in the consumption of tokens, even between the models reaching a similar precision. For example, on the mathematical reference likes 2025, Deepseek-R1 used more than five times more tokens than Claude 3.7 Sonnet for an almost comparable average precision.

More tokens do not lead to higher precision: Contrary to the intuitive idea that longer reasoning chains mean better reasoning, the study revealed that it was not always true. “Surprisingly, we also observe that longer generations compared to the same model can sometimes be an indicator of models in difficulty rather than improved reflection,” said the article. “Likewise, when comparing different reasoning models, a higher use of tokens is not always associated with better precision.

Non -determinism cost: Perhaps the most worrying for business users, repeated requests to the same model for the same problem can cause very variable use of tokens. This means that the cost of execution of a request can fluctuate considerably, even when the model systematically provides the correct answer.

Variance of model outputs
Variance of the response length (tips show a smaller variance) credit: Arxiv

The potential of verification mechanisms: The scaling performance has systematically improved on all models and benchmarks when it is simulated with a “perfect verifier” (using the best results).

Conventional models sometimes correspond to reasoning models: By considerably increasing inference calls (up to 50 times more in certain experiences), conventional models like GPT-4O could sometimes approach the performance levels of dedicated reasoning models, in particular on less complex tasks. However, these earnings quickly decreased in very complex contexts, which indicates that the brute force scale has its limits.

GPT-4O Inference time scale
On certain tasks, the precision of GPT-4O continues to improve with a parallel and sequential scale. Credit: Arxiv

Implications for the company

These results have a significant weight for developers and corporate adopters of the LLM. The question of “non-determinism of costs” is particularly striking and makes budgeting difficult. As researchers point out, “ideally, developers and users would prefer models for which the standard deviation on the use of tokens per instance is low for cost predictability.”

“The profiling we make [the study] Could be useful for developers as a tool for choosing less volatile models for the same prompt or for various prompts, ”told Venturebeat Besmira Nushi, head of the main research of main research at Microsoft.

The models that graze blue on the left systematically generate the same number of tokens to the given task credit: Arxiv

The study also provides good information on the correlation between the accuracy of a model and the response length. For example, the following diagram shows that mathematical queries greater than ~ 11,000 token lengths have a very thin risk of being correct, and these generations must be stopped at this point or restarted with sequential feedback. However, Nushi points out that the models allowing these post hoc attenuations also have a cleaner separation between correct and incorrect samples.

“In the end, it is also the responsibility of model manufacturers to think of reducing precision and costing non-determinism, and we expect it to happen largely as the methods become more mature,” said Nushi. “In addition to the non-determinism of costs, the precision of non-determinism also applies.”

Another important discovery is the coherent stimulation of the performance of perfect auditors, which highlights a critical field for future work: building robust and largely applicable verification mechanisms.

“The availability of stronger auditors can have different types of impact,” said Nushi, such as improving fundamental training methods for reasoning. “If it is used effectively, these can also shorten the traces of reasoning.”

Strong auditors can also become a central element of corporate agental solutions. Many business stakeholders already have such auditors in place, who may have to be reused for more agentic solutions, such as SAT resolvers, logistical validity verifiers, etc.

“Questions for the future are the way in which these existing techniques can be combined with AI -centered interfaces and what is the language that connects both,” said Nushi. “The need to connect the two comes from the fact that users will not always formulate their requests in a formal way, they will want to use an interface in natural language and expect solutions in a similar format or in a final action (for example, propose a meeting invitation).”

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.


You Might Also Like

RSAC 2025: Why the AI agent era means more demand for CISOS

8 Best Water Leak Detectors (2025), Tested and Reviewed

Spry Fox's next Netflix Games title is Spirit Crossing

FTC pushes the enforcement of its ‘click-to-cancel’ rule back to July

TikTok is back on US app stores

Share This Article
Facebook X Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
loader

Email Address*

Name

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Popular News
Business

WATCH: Dhanashree Verma drops new music video on betrayal, domestic violence on day of divorce from Yuzvendra Chahal

MTHANNACH MTHANNACH March 21, 2025
Chrishell Stause Details ‘Bumpy Road’ to Starting Family With G Flip
Unveiling The Earnings Of The ‘Fast & Furious’ Star
Centre’s capex slows in February, April to February capex 80% of full year target
Page Not Found – 101GREATGOALS.COM
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Categories

  • Business
  • Breaking News
  • Entertainment
  • Technology
  • Health
  • Sports
  • Gadgets
We influence 20 million users and is the number one business and technology news network on the planet.
Quick Link
  • My Bookmark
  • InterestsNew
  • Contact Us
  • Blog Index
Top Categories
  • Entertainment

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

 

All Rights Reserved © Inkinspires 2025
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?