Home / Technology / AI Training Breakthrough: Cheaper Reasoning Models Unveiled
AI Training Breakthrough: Cheaper Reasoning Models Unveiled
29 Apr
Summary
- New AI training technique, RLSD, combines reinforcement learning with self-distillation.
- RLSD significantly outperforms existing methods in visual reasoning benchmarks.
- The approach lowers technical and financial barriers for custom AI reasoning models.

A new AI training paradigm, Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), has been introduced by researchers from JD.com and academic institutions. This technique aims to reduce the substantial resource demands typically associated with training AI reasoning models. RLSD integrates the performance tracking of reinforcement learning with the detailed feedback of self-distillation, offering a more efficient approach.
Experiments demonstrate that models trained using RLSD achieve superior performance compared to those developed with conventional distillation or reinforcement learning algorithms. This breakthrough promises to lower the technical and financial hurdles for enterprises seeking to develop custom reasoning models aligned with their specific business logic.
The RLSD framework decouples the direction and magnitude of parameter updates. It uses verifiable environmental feedback for update direction and repurposes self-distillation's token-by-token assessment to determine the magnitude. This method avoids the pitfalls of previous techniques like On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD), which suffered from sparse feedback or privileged information leakage, respectively.
In testing, RLSD models achieved the highest average accuracy across multiple visual reasoning benchmarks, notably outperforming other methods. The framework also offers significant efficiency gains, demonstrating a faster convergence rate than standard algorithms. RLSD's stability and performance ceiling also surpass OPSD, which experienced performance degradation over time.