Papers·5일 전

Near-Future Policy Optimization boosts RLVR by 4.96 points on Qwen3-VL-8B

NPO leverages a policy's own later checkpoint as auxiliary off-policy trajectories, balancing trajectory quality (higher Q) and variance cost (lower V) to maximize effective learning signal S = Q/V. Applied to Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and its adaptive variant AutoNPO further pushes to 63.15. The method requires no external teachers and is validated on early-stage bootstrapping and late-stage plateau breakthrough.

#reinforcement-learning
#rlvr
#grpo
#qwen
#npo

Chuanyu Qin

원문 보기 →

Near-Future Policy Optimization boosts RLVR by 4.96 points on Qwen3-VL-8B

Comments