Papers·1개월 전

CompactAttention: 청크 프리필에서 128K 컨텍스트 2.72배 속도 향상

서울대 VLSI Lab이 청크 프리필 환경에서 attention 연산을 가속하는 CompactAttention을 제안했습니다. 기존 sparse attention 방식은 청크 단위 쿼리에 비효율적이었는데, CompactAttention은 2D 블록-스파스 마스크를 KV 선택 신호로 활용해 GQA-aware 블록 테이블을 구성, KV compaction 없이 in-place 접근을 가능하게 했습니다. LLaMA-3.1-8B-Instruct 기준 RULER 벤치마크에서 dense attention에 가까운 정확도를 유지하면서 128K 컨텍스트에서 최대 2.72배 attention 속도 향상을 보였습니다.

#attention
#chunked-prefill
#long-context
#llm
#seoul-national-university

Seoul National University VLSI Lab

원문 보기 →

CompactAttention: 청크 프리필에서 128K 컨텍스트 2.72배 속도 향상

Comments