Papers·1개월 전

NVIDIA Gated DeltaNet-2 — 채널별 erase/write 게이트로 linear attention 편집 분리, 1.3B에서 Mamba-2 등 능가

NVIDIA 팀이 linear attention 모델의 메모리 편집에서 erase와 write 역할을 분리하는 Gated DeltaNet-2를 제안했습니다. 기존 Gated DeltaNet과 KDA는 하나의 스칼라 게이트로 삭제와 쓰기를 동시에 제어했지만, Gated DeltaNet-2는 채널별 erase 게이트 b_t와 write 게이트 w_t를 도입해 각각 독립적으로 조정합니다. 1.3B 파라미터, 100B FineWeb-Edu 토큰 학습 결과, 언어 모델링, 상식 추론, 검색에서 Mamba-2, Gated DeltaNet, KDA, Mamba-3 변종들을 전반적으로 능가했으며, 특히 long-context RULER needle-in-a-haystack 벤치마크에서 멀티키 검색 성능이 두드러졌습니다. 코드는 공개되어 있지만, 1.3B 규모 단일 실험만 보고된 점은 한계입니다.

#linear-attention
#gated-deltanet
#nvidia
#long-context
#efficient-transformer

NVIDIA

원문 보기 →

NVIDIA Gated DeltaNet-2 — 채널별 erase/write 게이트로 linear attention 편집 분리, 1.3B에서 Mamba-2 등 능가

Comments