A More Stable Alternative to GRPO for LLM Training: Introducing GTPO

A More Stable Alternative to GRPO for LLM Training: Introducing GTPO

If you’re familiar with Large Language Models (LLMs), you know how important it is to have a stable training process. One popular method for LLM training is GRPO, but it has some key issues. For example, tokens can show up in both positive and negative completions, leading to conflicting updates that break structure and hurt learning.

That’s why I’m excited to introduce GTPO, a more stable alternative to GRPO. GTPO detects and protects “conflict tokens”, filters out noisy, high-entropy completions, and works without KL-divergence regularization or a reference model.

## The Problem with GRPO
GRPO has some limitations that can lead to unstable training and poor results. For instance, negative completions can push the model toward unlikely tokens, flattening the distribution and hurting learning. This can be especially problematic for LLMs, which rely on stable training to generate high-quality text.

## How GTPO Solves These Issues
GTPO takes a different approach to LLM training. By detecting and protecting conflict tokens, GTPO skips harmful updates and boosts helpful ones. This leads to more stable training and better results, both in and out of distribution. Plus, GTPO filters out noisy, high-entropy completions, which can further improve training stability.

## Real-World Results
The results speak for themselves. On GSM8K, MATH, and AIME 2024, GTPO shows more stable training and better results compared to GRPO. This is especially promising for anyone working with LLMs.

## Try GTPO for Yourself
If you’re interested in trying GTPO, you can check out the paper, browse the fully open code on GitHub, and even try it right now on Colab. It’s worth noting that another alternative, GSPO, has also been released, but it falls back into GRPO’s problems in certain settings.

## Final Thought
GTPO is an exciting development in the world of LLM training. By providing a more stable alternative to GRPO, GTPO can help researchers and developers generate higher-quality text and improve the overall performance of LLMs.

Leave a Comment

Your email address will not be published. Required fields are marked *