Group Sequence Policy Optimization for LLMs

Author: Neural Intelligence Network
Published: Fri 01 Aug 2025
Episode Link: https://podcasters.spotify.com/pod/show/neuralintelpod/episodes/Group-Sequence-Policy-Optimization-for-LLMs-e36a1fl

The source introduces Group Sequence Policy Optimization (GSPO), a novel reinforcement learning algorithm developed by the Qwen Team at Alibaba Inc. for training large language models. This paper contrasts GSPO with previous methods like Group Relative Policy Optimization (GRPO), highlighting GRPO's instability due to misapplied token-level importance sampling. GSPO addresses this by defining importance ratios based on entire sequence likelihood, leading to more stable and efficient training, particularly for Mixture-of-Experts (MoE) models, where it eliminates the need for complex stabilization strategies like Routing Replay. The authors demonstrate GSPO's superior performance and training efficiency through empirical evaluations, noting its contribution to the improved capabilities of the latest Qwen3 models and its potential for simplifying RL infrastructure.

Share to:

EachPod

EachPod

Group Sequence Policy Optimization for LLMs