Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
The reasoning through the chain of thoughts (COT) – The process by which the models divide the problems into manageable “thoughts” before deducting the responses – has become an integral part of the latest generation of large -language models (LLM).
However, the costs of inference of reasoning models can quickly accumulate because the models generate excess bedboards. In a new paperResearchers at Carnegie Mellon University offer an LLM training technique which gives developers more control over the length of the bed
Called optimization of length controlled policies (LCPO), the technique conditions the model to provide correct answers while keeping its “thoughts” in a predetermined token budget. Experiences show that the models formed on the LCPO offer a fluid compromise between precision and costs and can surprisingly surpass larger models on equal reasoning lengths. LCPO can help to considerably reduce the costs of inference in corporate applications by saving thousands of tokens in each conversation cycle with an LLM.
LLM performance leads to longer beds
Reasoning models such as OpenAi O1 and Deepseek-R1 are formed by strengthening learning (RL) to use Test Time scaling and generate bed bed beds before producing a response. Empirical evidence shows that when the models “think” longer, they tend to work better on reasoning tasks.
For example, R1 was initially formed on the pure RL without examples marked by man. One of the ideas was that as the model’s performance improved, it also learned to generate longer cot traces.
Although in general, long cot channels lead to more precise answers, they also create a neck of calculation strangulation in the application of large -scale reasoning models. There is currently very little control over the testing budget for testing time, and the sequences can easily extend to tens of thousands of tokens without providing significant gains. There have been some efforts to control the duration of reasoning chains, but they generally degrade the performance of the model.
Optimization of the policy controlled (LCPO) explained
The classic RL method leads to LLM only to get the right answer. LCPO modifies this paradigm by introducing two training objectives: 1) Get the correct result and 2) Keep the bed chain in a specific token length. Therefore, if the model produces the right answer but generates too many bed tokens, it will receive a penalty and will be forced to offer a reasoning chain which reaches the same answer but with a smaller token budget.
“The models formed by LCPO learn to satisfy length constraints while optimizing the performance of reasoning, rather than counting on a hand -designed heuristics,” write the researchers.
They offer two LCPO flavors: (1) LCPO-Exact, which requires that the reasoning generated is exactly equal to the target length, and (2) LCPO-Max, which requires that the output is no more than the target length.
To test the technique, the researchers refined a 1.5b-parameter reasoning model (distilled-R1-1.5b) on the two LCPO diagrams offered to create the L1-Max and L1-Exact models. The training was based on mathematical problems with distinct and verifiable results. However, the evaluation included mathematical problems as well as outside distribution tasks such as the measurement of massive understanding of multitasking language (Mmlu) Technique and benchmark to the test of Q&R to the test of graduates (Gpqa).
Their results show that the L1 models can precisely balance the budget of the tokens and the performance of reasoning, gently interpoving between short and effective reasoning and longer and more precise reasoning by inviting the model with different length constraints. Above all, on certain tasks, L1 models can reproduce the performance of the original reasoning model to a lower token budget.
Compared to S1 – the only other method which limits the duration of COT – L1 models appears up to 150% performance gains on different token budgets.
“This substantial difference can be attributed to two key factors,” write the researchers. “(1) L1 intelligently adapts its COT to adapt to specified length constraints without disturbing the reasoning process, while S1 trunca often in the middle of the season; and (2) L1 is explicitly formed to generate high quality reasoning chains of variable lengths, effectively distilling the reasoning diagrams of longer chains to the shortest. »»
L1 also surpasses its non-raised counterpart by 5% and GPT-4O of 2% over an equal generation length. “As for the best of our knowledge, it is the first demonstration that a 1.5B model can surpass border models such as GPT-4O, despite the use of the same generation length,” write researchers.
Interestingly, the model’s cot shows that he learns to adjust his reasoning process according to his token budget. For example, on longer budgets, the model is more likely to generate tokens associated with self-correction and verification (that is to say “but” and “wait”) and the conclusion drawing (“therefore” and “therefore”).

Beyond the improvement in length control in the standard mathematical reasoning parameter, L1 models are surprisingly generalized well with out-of-distribution tasks, including GPQA and MMLU.
This new research line on models that can adjust their reasoning budget can have important uses for real world applications, giving companies the possibility of extending reasoning models without leakage expenditure. It is a powerful alternative to simply deploy larger and more expensive models – and could be a crucial factor to make the AI more viable economically for high -volume and real applications.
The researchers opened the origin the LCPO Code and the Weight for L1 models.