Less supervision, better results: Study shows AI models generalize more effectively on their own

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more

Linguistic models can better generalize when left to create their own solutions, a new study By the University of Hong Kong and the University of California, Berkeley, watch. The results, which apply both to important language models (LLMS) and vision language models (VLM), question one of the main beliefs in the LLM community – that models require examples of training marked by hand. In fact, researchers show that the formation of models on too many hand -made examples can have negative effects on the capacity of the model to become widespread in invisible data.

SFT VS RL in model training

For a long time, the supervised fine setting (SFT) was the gold stallion for the LLMS and the VLM training. Once a model is pre-formed on raw text and image data, companies and laboratories have generally interact on a large set of data of hand-made examples in the question format / response or request / response. After SFT, the model may undergo additional training steps, such as Reinforcement of learning human feedback (RLHF), where the model tries to learn implicit human preferences based on signals such as the classification of responses or to love / hate the model’s responses.

SFT is useful for directing the behavior of a model to the type of tasks for which the creators of the model have designed it. However, data collection is a slow and expensive process, which is a bottleneck for many companies and laboratories.

Recent developments in LLM have aroused interest in learning approaches by pure strengthening (RL), where the model receives a task and left to learn it alone without hand -made examples. The most important case is Deepseek-R1, the competitor Openai O1 which mainly used the learning of strengthening to learn complex reasoning tasks.

Generalization vs memorization

One of the main problems of automatic learning systems (ML) is an over-adjustment, where the model works well on its training data but fails to become widespread in invisible examples. During the training, the model gives the false impression of having learned the task, while in practice, it has just memorized its examples of training. In large and complex AI models, the separation of the generalization of memorization can be difficult.

The new study focuses on the generalization capacities of RL and SFT training in textual and visual reasoning tasks. For textual reasoning, an LLM formed on a set of rules should be able to generalize to the variants of these rules. In the visual reasoning, a VLM must remain consistent in the performance of the task in relation to the modifications of different aspects of the visual entry, such as the color and the spatial arrangement.

In their experiences, researchers used two representative tasks. The first was GeneralPoints, a reference that assesses the arithmetic reasoning capacities of a model. The model receives four cards, in the form of textual descriptions or images, and is invited to combine them to reach a target number. To study the generalization based on the regions, the researchers formed the model using a set of rules, then evaluated it using a different rule. For visual generalization, they formed the model using a color cards and tested its performance on the maps of other colors and the numbering diagrams.

The second task is V-alerWho tests the spatial reasoning capacities of the model in an open world navigation field which uses a realistic visual input. This task is also available in pure language versions and visual language. The researchers evaluated the generalization by modifying the type of instructions and the visual representations on which the model was formed and tested.

They carried out their tests on Llama-3.2-Vision-11b, warming the model by dragging it on a small set of SFT data, then creating separate versions for each task and training paradigm. For each task, they separated the training separately on RL and SFT. The SFT process forms the model on additional hand -made solutions, while RL allows the model to generate many solutions for each problem, to assess the results and to form on the right answers.

The results show that learning to strengthen regularly improve the performance of examples that are radically different from training data. On the other hand, SFT seems to memorize training rules and does not become widespread with examples outside distribution (OOD). These observations apply to text parameters only and multimodal.

*The SFT trained models work well on the examples of training (in distribution) while showing poor performance on invisible examples (excluding distribution) (source: arxiv)*

Implications for real world applications

Although their experiences show that RL is better to generalize than SFT, researchers have also found that SFT is useful to stabilize the model output format and is crucial to allow RL to reach its performance gains. The researchers found that, without the initial SFT stage, the RL training did not obtain desirable results.

This is a little different from the results obtained by Deepseek-R1-Zero, which was post-formulated on pure RL. The researchers suggest that this may be due to the different skeleton models they have used in their experiences.

It is clear that there is a lot of unexploited potential in RL-RL approaches. For use cases that have verifiable results, let the models learn by themselves can often lead to unforeseen results that humans could not have been made. This could be very practical in parameters where the creation of artisanal examples can be tedious and costly.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.

Less supervision, better results: Study shows AI models generalize more effectively on their own

SFT VS RL in model training

Generalization vs memorization

Implications for real world applications

Leave a Reply Cancel reply

Follow US

Popular News

Apple withdraws cloud encryption service from UK after government order

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US

SFT VS RL in model training

Generalization vs memorization

Implications for real world applications

You Might Also Like

Like it or not, AI is learning how to influence you

Asus Zenbook A14 Review: The Best Copilot Plus PC So Far

EA is giving fans a chance to test the next Battlefield early

DeepSeek exposed internal database containing chat histories and sensitive data

Engadget review recap: iPad, Nothing Phone 3a, Assassin's Creed Shadows and more

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Apple withdraws cloud encryption service from UK after government order

Global Coronavirus Cases

Categories

Quick Link

Top Categories

Subscribe US