Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Linguistic models can better generalize when left to create their own solutions, a new study By the University of Hong Kong and the University of California, Berkeley, watch. The results, which apply both to important language models (LLMS) and vision language models (VLM), question one of the main beliefs in the LLM community – that models require examples of training marked by hand. In fact, researchers show that the formation of models on too many hand -made examples can have negative effects on the capacity of the model to become widespread in invisible data.
SFT VS RL in model training
For a long time, the supervised fine setting (SFT) was the gold stallion for the LLMS and the VLM training. Once a model is pre-formed on raw text and image data, companies and laboratories have generally interact on a large set of data of hand-made examples in the question format / response or request / response. After SFT, the model may undergo additional training steps, such as Reinforcement of learning human feedback (RLHF), where the model tries to learn implicit human preferences based on signals such as the classification of responses or to love / hate the model’s responses.
SFT is useful for directing the behavior of a model to the type of tasks for which the creators of the model have designed it. However, data collection is a slow and expensive process, which is a bottleneck for many companies and laboratories.
Recent developments in LLM have aroused interest in learning approaches by pure strengthening (RL), where the model receives a task and left to learn it alone without hand -made examples. The most important case is Deepseek-R1, the competitor Openai O1 which mainly used the learning of strengthening to learn complex reasoning tasks.
Generalization vs memorization
One of the main problems of automatic learning systems (ML) is an over-adjustment, where the model works well on its training data but fails to become widespread in invisible examples. During the training, the model gives the false impression of having learned the task, while in practice, it has just memorized its examples of training. In large and complex AI models, the separation of the generalization of memorization can be difficult.
The new study focuses on the generalization capacities of RL and SFT training in textual and visual reasoning tasks. For textual reasoning, an LLM formed on a set of rules should be able to generalize to the variants of these rules. In the visual reasoning, a VLM must remain consistent in the performance of the task in relation to the modifications of different aspects of the visual entry, such as the color and the spatial arrangement.
In their experiences, researchers used two representative tasks. The first was GeneralPoints, a reference that assesses the arithmetic reasoning capacities of a model. The model receives four cards, in the form of textual descriptions or images, and is invited to combine them to reach a target number. To study the generalization based on the regions, the researchers formed the model using a set of rules, then evaluated it using a different rule. For visual generalization, they formed the model using a color cards and tested its performance on the maps of other colors and the numbering diagrams.
The second task is V-alerWho tests the spatial reasoning capacities of the model in an open world navigation field which uses a realistic visual input. This task is also available in pure language versions and visual language. The researchers evaluated the generalization by modifying the type of instructions and the visual representations on which the model was formed and tested.

They carried out their tests on Llama-3.2-Vision-11b, warming the model by dragging it on a small set of SFT data, then creating separate versions for each task and training paradigm. For each task, they separated the training separately on RL and SFT. The SFT process forms the model on additional hand -made solutions, while RL allows the model to generate many solutions for each problem, to assess the results and to form on the right answers.
The results show that learning to strengthen regularly improve the performance of examples that are radically different from training data. On the other hand, SFT seems to memorize training rules and does not become widespread with examples outside distribution (OOD). These observations apply to text parameters only and multimodal.

Implications for real world applications
Although their experiences show that RL is better to generalize than SFT, researchers have also found that SFT is useful to stabilize the model output format and is crucial to allow RL to reach its performance gains. The researchers found that, without the initial SFT stage, the RL training did not obtain desirable results.
This is a little different from the results obtained by Deepseek-R1-Zero, which was post-formulated on pure RL. The researchers suggest that this may be due to the different skeleton models they have used in their experiences.
It is clear that there is a lot of unexploited potential in RL-RL approaches. For use cases that have verifiable results, let the models learn by themselves can often lead to unforeseen results that humans could not have been made. This could be very practical in parameters where the creation of artisanal examples can be tedious and costly.