Papers·1개월 전

ROPD: Teacher logit 없이 rubric 만으로 on-policy distillation — 샘플 효율 10x 개선

Junfeng Fang 팀이 제안한 ROPD는 기존 logit 기반 on-policy distillation(OPD)과 달리, teacher의 logit 대신 rubric(평가 기준)을 사용해 black-box teacher로도 정렬이 가능합니다. Teacher-student 출력 차이에서 prompt별 rubric을 추출하고, 이 rubric으로 student rollout을 채점해 on-policy 최적화를 수행합니다. Logit 기반 OPD를 대부분 능가했으며, 샘플 효율은 최대 10배 향상되었습니다. 다만 rubric 생성에 추가적인 계산이 필요하고, task 복잡도에 따라 rubric 품질이 달라질 수 있다는 점은 한계입니다.

#on-policy-distillation
#alignment
#rubric
#llm
#sample-efficiency

Junfeng Fang

원문 보기 →

ROPD: Teacher logit 없이 rubric 만으로 on-policy distillation — 샘플 효율 10x 개선

Comments