Stable and Scalable RLHF: Qwen3's Sequence-Level Method vs. GRPO | Ranjan Kumar

Have you ever encountered issues with token-level importance sampling in reinforcement learning for large language models (LLMs)? The Qwen team recently proposed Group Sequence Policy Optimization (GSPO), a new approach that tackles the instability and scaling problems of Group Relative Policy Optimization (GRPO). Let’s dive into the details.

GRPO, used in DeepSeek, optimizes LLMs via reward signals. However, it applies importance sampling per token, which can lead to high variance across long sequences. This is particularly problematic for Mixture-of-Experts (MoE) models, where token-level routing shifts can destabilize training. To counteract this, GRPO-based pipelines often rely on strategies like Routing Replay.

GSPO, on the other hand, moves to sequence-level importance sampling, normalizing by sequence length. This dramatically reduces variance and eliminates the need for routing hacks. The Qwen team reports stable MoE convergence and better scaling with GSPO.

In experiments, GSPO achieves better reward curves than GRPO on benchmarks like AIME’24, LiveCodeBench, and CodeForces. GSPO converges faster with more compute and shows smoother scaling trends. Moreover, GSPO does not require Routing Replay to perform adequately.

If you’re interested in learning more about GSPO and its benefits, I’d love to hear about your experiences with token-level importance sampling or GRPO. Have you tried sequence-level weighting in your RLHF pipelines?

Stable and Scalable RLHF: Qwen3’s Sequence-Level Method vs. GRPO

Leave a Comment Cancel Reply