What is the new AI training technique introduced by JD.com and researchers?

The new technique is called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), which combines reinforcement learning with self-distillation.

How does RLSD improve upon existing AI training methods?

RLSD outperforms classic distillation and reinforcement learning by integrating reliable tracking with granular feedback, overcoming issues like sparse feedback or information leakage.

What are the benefits of RLSD for enterprises?

RLSD lowers the technical and financial barriers for enterprises to build custom reasoning models tailored to specific business logic.

AI Training Breakthrough: Cheaper Reasoning Models Unveiled

29 Apr

Summary

New AI training technique, RLSD, combines reinforcement learning with self-distillation.
RLSD significantly outperforms existing methods in visual reasoning benchmarks.
The approach lowers technical and financial barriers for custom AI reasoning models.

AI Training Breakthrough: Cheaper Reasoning Models Unveiled

A new AI training paradigm, Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), has been introduced by researchers from JD.com and academic institutions. This technique aims to reduce the substantial resource demands typically associated with training AI reasoning models. RLSD integrates the performance tracking of reinforcement learning with the detailed feedback of self-distillation, offering a more efficient approach.

Experiments demonstrate that models trained using RLSD achieve superior performance compared to those developed with conventional distillation or reinforcement learning algorithms. This breakthrough promises to lower the technical and financial hurdles for enterprises seeking to develop custom reasoning models aligned with their specific business logic.

The RLSD framework decouples the direction and magnitude of parameter updates. It uses verifiable environmental feedback for update direction and repurposes self-distillation's token-by-token assessment to determine the magnitude. This method avoids the pitfalls of previous techniques like On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD), which suffered from sparse feedback or privileged information leakage, respectively.

In testing, RLSD models achieved the highest average accuracy across multiple visual reasoning benchmarks, notably outperforming other methods. The framework also offers significant efficiency gains, demonstrating a faster convergence rate than standard algorithms. RLSD's stability and performance ceiling also surpass OPSD, which experienced performance degradation over time.

Disclaimer: This story has been auto-aggregated and auto-summarised by a computer program. This story has not been edited or created by the Feedzop team.

Home / Technology / AI Training Breakthrough: Cheaper Reasoning Models Unveiled

AI Training Breakthrough: Cheaper Reasoning Models Unveiled

29 Apr

•

Summary

New AI training technique, RLSD, combines reinforcement learning with self-distillation.
RLSD significantly outperforms existing methods in visual reasoning benchmarks.
The approach lowers technical and financial barriers for custom AI reasoning models.

Disclaimer: This story has been auto-aggregated and auto-summarised by a computer program. This story has not been edited or created by the Feedzop team.