Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Deepseek aiA Chinese research laboratory is in recognition of its powerful open source language models such as Deepseek-R1, has introduced significant progression in reward modeling for large language models (LLM).
Their new technique, the adjustment of self -print criticism (SPCT), aims to create models of generalist and scalable reward (RMS). This could potentially lead to AI applications more competent for tasks and open areas where current models cannot capture the shades and complexities of their environment and their users.
The crucial role and the current limits of reward models
Reinforcement learning (RL) has become a cornerstone in the development of advanced LLMS. In RL, the models are refined according to the feedback signals which indicate the quality of their responses.
The reward models are the critical component that provides these signals. Essentially, an RM acts as a judge, evaluating LLM outputs and attributing a score or a “reward” which guides the RL process and teaches LLM to produce more useful responses.
However, the current RMS often faces limitations. They generally excel in narrow fields with clear rules or easily verifiable responses. For example, current current reasoning models such as Deepseek-R1 have undergone an RL phase, in which they were trained on mathematics and coding problems where soil truth is clearly defined.
However, the creation of a reward model for complex, open or subjective requests in the general fields remains a major obstacle. In paper Explaining their new technique, Deepseek IA researchers write, “Generalist RM requires generating high-quality rewards beyond specific fields, where the reward criteria are more diversified and complex, and there is often no explicit or terrestrial reference.”
They highlight four key challenges in the creation of generalist RMS capable of managing wider tasks:
- Flexibility of entries: The RM must manage various types of input and be able to assess one or more responses simultaneously.
- Precision: He must generate precise reward signals in various fields where the criteria are complex and soil truth is often unavailable.
- Encroovers’s scalability: The RM must produce better quality rewards when more calculation resources are allocated during inference.
- Learn evolving behavior: For RMS to be effectively evolving at the time of inference, they must learn behaviors that allow performance improvement as more calculations are used.
The reward models can be largely classified by their “reward generation paradigm” (for example, SCALAR RMS outgoing a single score, generating RMS producing textual criticism) and their “rating model” (for example, the punctual score attributes individual scores to each response, pairs selects the best of two responses). These design choices affect the adequacy of the model for general tasks, in particular its entry flexibility and potential for Inference time scale.
For example, simple scalar RMS are fighting with the scale of inference because they will generate the same score several times, while the RMS per pair cannot easily assess unique responses.
Researchers propose that “Punctual Generative Reward Modeling” (GRM), where the model generates textual criticisms and draws scores, can offer flexibility and scalability required for general requirements.
The Deepseek team has conducted preliminary experiences on models like GPT-4O and Gemma-2-27B, and noted that “certain principles could guide the generation of reward in the criteria appropriate for GRMs, the improvement of the quality of the awards, which inspired us that the scalability of the inference time of RM could be made by reducing the reward of Principles of high quality and specific criticisms. ”
RMS training to generate their own principles
Based on these results, the researchers have developed an adjustment of autoprincy criticism (SPCT), which forms the GRM to generate principles and criticism based on dynamically requests and responses.
Researchers propose that principles should be “part of the generation of awards instead of a pre -treatment stage”. In this way, the GRMS could generate principles on the fly according to the task they assess, then generate criticism according to the principles.
“This change allows [the] Principles to be generated according to the request and input responses, aligning the adaptation [the] The generation process of awards, as well as the quality and granularity of the corresponding principles and criticisms could still be improved with post-training on the GRM, “write the researchers.

SPCT implies two main phases:
- Fine rejection to adjustment: This phase leads to the GRM to generate principles and criticisms for various types of input using the correct format. The model generates principles, criticisms and rewards for data and responses given. Trajectories (generation attempts) are only accepted if the planned reward aligns with the truth of the soil (correctly identifying the best answer, for example) and rejected otherwise. This process is repeated and the model is refined on filtered examples to improve its generation / critical generation capacities.
- RL based on rules: In this phase, the model is still refined thanks to the results -based reinforcement learning. The GRM generates principles and criticisms for each request, and the reward signals are calculated on the basis of simple precision rules (for example, did she choose the best known answer?). Then the model is updated. This encourages the GRM to learn to generate effective principles and specific criticisms dynamically and in a evolutionary manner.
“By taking advantage of RL online based on rules, SPCT allows GRMS to learn to display principles and criticisms according to the request and entry responses, leading to better results of results in the general fields,” write researchers.
To take up the time to scale up time (obtain better results with more calculation), the researchers perform the GRM several times for the same entry, generating different sets of principles and criticism. The final award is determined by the vote (entering the scores of the samples). This allows the model to consider a wider range of perspectives, leading to potentially more precise and nuanced final judgments because it is provided with more resources.
However, certain principles / criticisms generated can be of low quality or biased due to the limitations of the model or the random. To remedy it, the researchers introduced a “meta RM ”- A distinct light scalar scalar RM formed specifically to predict whether a principle / criticism generated by the primary GRM will probably lead to a correct final reward.
During inference, the meta-RM estimates the samples generated and filters low-quality judgments before the final vote, further improving the scaling performance.
Put into practice SpCT with Deepseek-GRM
The researchers applied SPCT to Gemma-2-27b, the open weight model of Google, creating Deepseek-GRM-27B. They evaluated it against several strong basic RMS (including LLM-AS-AA-JUDGE, SCALAR RMS and semi-skirted RMS) and public models (such as GPT-4O and Nemotron-4-340B-Rewer-Rewer) through several references.
They found that Deepseek-GRM-27B outperform the basic methods formed on the same data. The SPCT has considerably improved quality and, above all, the scalability of the inference time compared to the standard fine adjustment.

When it is scaled at the time of inference by generating more samples, Deepseek-GRM-27B’s performance has increased considerably, even exceeding much larger models like Nemotron-4-340B-Reward and GPT-4O. The meta-RM has further improved scaling, obtaining the best results by filtering judgments.
“With a larger scale sampling, Deepseek-GRM could more precisely judge the principles with higher diversity and production awards with a finer granularity,” write researchers.
Interestingly, the SPCT has shown fewer biases in different fields compared to scalar RMS, which have often worked well on verifiable but badly elsewhere.
Implications for the company
The development of more general and scalable reward models can be promising for corporate AI applications. The potential areas that can benefit from generalist RMS include creative tasks and applications where the model must adapt to dynamic environments such as the evolution of customer preferences.
Despite the solid results, Deepseek-GRM is still lagging behind Scalar RMS specializing in purely verifiable tasks where the generation of explicit reasoning could be less effective than direct rating. Efficiency also remains a challenge compared to non -generative RMS.
The Deepseek team suggests that future work will focus on improvements in efficiency and more in -depth integration. As they conclude, “future orientations could include the integration of GRMs into online RL pipelines as versatile reward systems, exploring the reference time co-scale with policy models or serving as robust offline assessors for foundation models.”