GSPO: A New Approach to Fine-Tuning Large Language Models

GSPO: A New Approach to Fine-Tuning Large Language Models

The world of natural language processing just got a little more exciting with the introduction of Group Sequence Policy Optimisation (GSPO), a new reinforcement learning algorithm for fine-tuning large language models. Developed by the Qwen team, GSPO builds upon DeepSeek’s Group Relative Policy Optimisation (GRPO) but with a key difference: it replaces token-level importance sampling with a sequence-level approach.

So, why the change? GRPO’s token-level importance sampling can lead to high-variance gradients for long generations, causing instability in Mixture-of-Experts (MoE) models. Additionally, GRPO often requires hacks like Routing Replay to converge stably.

GSPO, on the other hand, uses sequence-level importance ratios, normalized by length, resulting in lower variance and more stable off-policy updates. This means MoE models can be trained without the need for extra routing constraints.

The benefits of GSPO are clear: higher benchmark rewards on AIME’24, LiveCodeBench, and CodeForces, faster convergence, and better scaling with compute. But what do you think? Could sequence-level weighting become the default over token-level methods in RL-based LLM training?

Leave a Comment

Your email address will not be published. Required fields are marked *