hypes.news
← Back to feed
Papers·1주 전

Fudan survey: Proxy Compression Hypothesis unifies reward hacking across RLHF, RLAIF, RLVR

Fudan survey: Proxy Compression Hypothesis unifies reward hacking across RLHF, RLAIF, RLVR

Fudan University researchers propose the Proxy Compression Hypothesis (PCH), formalizing reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. The framework explains phenomena like verbosity bias, sycophancy, hallucinated justification, and benchmark overfitting across RLHF, RLAIF, and RLVR regimes, unifying detection and mitigation strategies by targeting compression, amplification, or co-adaptation dynamics. The survey highlights open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

Fudan University

Comments

— 첫 댓글을 남겨보세요 —