On the Impossibility of Unbiased and Length-Invariant Policy Optimization with Outcome Rewards

Fei Ding; Yongkang Zhang; Yuhao Liao; Zijian Zeng; Huiming Yang

On the Impossibility of Unbiased and Length-Invariant Policy Optimization with Outcome Rewards

Authors: Fei Ding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Huiming Yang

Group Relative Policy Optimization (GRPO) is the dominant reinforcement learning algorithm for training reasoning capabilities in large language models, notably adopted by DeepSeek-R1. The recent improvement Dr. GRPO (COLM 2025) identifies the response-level length bias caused by per-trajectory length normalization in GRPO and proposes removing this normalization, claiming the resulting optimizer is "unbiased." We show that this claim is incomplete. Specifically, we establish an impossibility theorem: under the standard outcome reward + GRPO setting, no length-based weighting scheme can simultaneously achieve the following two properties. (P1) Gradient unbiasedness: the gradient estimator is an unbiased estimate of the true policy gradient. (P2) Length invariance: each trajectory's effective contribution to the gradient is independent of its token length. GRPO approximately satisfies P2 but violates P1; Dr. GRPO satisfies P1 but violates P2. We characterize the complete tradeoff spectrum via the parametric family f_alpha(L) = L^{alpha - 1}, where alpha = 0 recovers GRPO, alpha = 1 recovers Dr. GRPO, and provide quantitative analysis showing that Dr. GRPO's length bias can cause longer trajectories to dominate gradient updates by a factor proportional to the length ratio. Our results reveal that neither algorithm is universally "done right"; they occupy opposite ends of a fundamental and unavoidable tradeoff.

Comments: 9 Pages.

Download: PDF

Submission history

[v1] 2026-05-31 03:05:42

Unique-IP document downloads: 0 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

Artificial Intelligence

On the Impossibility of Unbiased and Length-Invariant Policy Optimization with Outcome Rewards

Submission history