Papers·3개월 전

Fudan survey: Proxy Compression Hypothesis unifies reward hacking across RLHF, RLAIF, RLVR

Fudan University researchers propose the Proxy Compression Hypothesis (PCH), formalizing reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. The framework explains phenomena like verbosity bias, sycophancy, hallucinated justification, and benchmark overfitting across RLHF, RLAIF, and RLVR regimes, unifying detection and mitigation strategies by targeting compression, amplification, or co-adaptation dynamics. The survey highlights open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

#rlhf
#reward-hacking
#alignment
#fudan-university

Fudan University

원문 보기 →

Fudan survey: Proxy Compression Hypothesis unifies reward hacking across RLHF, RLAIF, RLVR

Comments