The world of natural language processing just got a little more exciting with the introduction of Group Sequence Policy Optimisation (GSPO), a new reinforcement learning algorithm for fine-tuning large language models. Developed by the Qwen team, GSPO builds upon DeepSeek’s Group Relative Policy Optimisation (GRPO) but with a key difference: it replaces token-level importance sampling with a sequence-level approach.
So, why the change? GRPO’s token-level importance sampling can lead to high-variance gradients for long generations, causing instability in Mixture-of-Experts (MoE) models. Additionally, GRPO often requires hacks like Routing Replay to converge stably.
GSPO, on the other hand, uses sequence-level importance ratios, normalized by length, resulting in lower variance and more stable off-policy updates. This means MoE models can be trained without the need for extra routing constraints.
The benefits of GSPO are clear: higher benchmark rewards on AIME’24, LiveCodeBench, and CodeForces, faster convergence, and better scaling with compute. But what do you think? Could sequence-level weighting become the default over token-level methods in RL-based LLM training?