Papers·1개월 전

Tencent, flow matching RL 에서 ratio clipping 대신 KL divergence proximal 제약 — Flow-DPPO

Tencent 팀이 flow matching 모델의 online RL 학습에서 PPO-style ratio clipping 대신 KL divergence 기반 proximal 제약을 사용하는 Flow-DPPO 를 제안했습니다. per-step policy 가 Gaussian 이라는 점을 활용해 KL divergence 를 정확하고 싸게 계산하고, asymmetric divergence mask 로 신뢰 영역을 벗어나는 업데이트만 차단합니다. 기존 ratio clipping 대비 reward 가 높고, catastrophic forgetting 이 줄었으며, multi-epoch 학습이 안정적이라는 결과를 보였습니다. 단, 이 방법은 flow model 의 Gaussian policy 에 특화되어 있어 discrete token model 에는 직접 적용하기 어렵습니다.

Tencent 팀이 flow matching 모델의 online RL 학습에서 PPO ratio clipping 의 구조적 문제를 지적하고, KL divergence 기반 proximal 제약 Flow-DPPO 를 제안했습니다.

핵심 결론

태스크 — Flow matching 기반 이미지/비디오 생성 모델의 online RL 정렬.
개선 — Flow-DPPO 가 기존 Flow-GRPO, CPS 대비 reward 는 높고, KL-proximal efficiency 도 더 좋습니다.
안정성 — Multi-epoch 학습에서 ratio clipping 은 성능이 떨어지지만 Flow-DPPO 는 안정적으로 유지됩니다.

방법

관찰 — Flow model 의 per-step policy 가 Gaussian 이므로 KL divergence 를 closed-form 으로 정확하게 계산할 수 있습니다.
핵심 아이디어 — Ratio clipping 대신 KL divergence proximal 제약을 사용하고, asymmetric mask 로 신뢰 영역 밖이면서 divergence threshold 를 넘는 업데이트만 차단합니다.
이를 통해 over-constraining 과 under-constraining 문제를 동시에 완화합니다.

한계·조건

적용 범위 — Gaussian policy 가정에 의존하므로 discrete token model (LLM 등) 에는 직접 적용이 어렵습니다.
코드 — GitHub 에 공개되어 있으며, Tencent Hunyuan 기반으로 구현되었습니다.

편집자 한 줄

Flow model 의 Gaussian 성질을 활용한 점이 깔끔하고, 실용적인 대안이 될 만합니다.

#flow-matching
#reinforcement-learning
#tencent
#kl-divergence

Tencent-Hunyuan-Multimodal-RL

원문 보기 →

Tencent, flow matching RL 에서 ratio clipping 대신 KL divergence proximal 제약 — Flow-DPPO

핵심 결론

방법

한계·조건

Comments