Papers·1개월 전

RTPurbo: full-attention LLM 을 9.36x prefill 가속 — KV cache 유지하며 intrinsic sparsity 활용

RTPurbo 는 full-attention LLM 이 이미 내재적 sparsity 를 가진다는 점에 착안, retrieval head 만 full KV cache 를 유지하고 나머지는 16차원 indexer 로 sparse attention 을 수행합니다. 1M context 에서 prefill 9.36배, decode 2.01배 속도 향상을 보이면서도 near-lossless 정확도를 유지했습니다. 수백 step 의 fine-tuning 만으로 적용 가능하다는 점이 실용적이지만, retrieval head 식별과 indexer 학습에 추가 리소스가 필요합니다.

#attention
#long-context
#sparsity
#efficiency
#rtpurbo

RTP-LLM

원문 보기 →

RTPurbo: full-attention LLM 을 9.36x prefill 가속 — KV cache 유지하며 intrinsic sparsity 활용

Comments