모바일 환경에서의 효과적인 LLM 추론을 위한 메모리 관리 기법 연구|RISS 상세보기

국문 초록 (Abstract)

서버 기반 LLM의 개인 프라이버시 우려와 네트워크 지연 문제로 인해 온디바이스 LLM이 새롭게 주목받고 있으나, 모바일 운영체제의 메모리 관리 정책은 LLM 추론 시 메모리 자원을 효율적으로 관리하기에 한계가 존재한다. 본 논문에서 제안한 초기 KV 캐시 스왑과 웨이트 지연 회수 기법은 사전 할당된 KV 캐시를 zRAM을 활용해 메모리 사용량을 개선하고, 모델 웨이트의 회수를 지연시킴으로써 스토리지 I/O를 최소화하여 LLM의 추론 성능을 향상시킨다. 제안한 기법은 기존 리눅스 커널 대비 최대 27%의 메모리 사용량 절감 효과를 보이며, 메모리 경쟁이 심한 모바일 환경에서의 LLM 추론 성능 최적화를 이끌 수 있다. 또한, 추측 디코딩과 같은 여러 후보 경로를 유지하는 추론 기법에서 경로의 수에 비례하여 더 큰 메모리 절감 효과를 보임으로써, 모바일 환경에서 다양한 LLM 추론 기법의 적용 가능성을 보여준다.

번역하기

서버 기반 LLM의 개인 프라이버시 우려와 네트워크 지연 문제로 인해 온디바이스 LLM이 새롭게 주목받고 있으나, 모바일 운영체제의 메모리 관리 정책은 LLM 추론 시 메모리 자원을 효율적으로...

다국어 초록 (Multilingual Abstract)

On-device LLMs have gained increased attention due to privacy and network latency issues associated with cloud-based LLMs. However, the memory management policies in mobile operating systems have limitations in efficiently handling memory resources during LLM inference. In this paper, we propose two techniques, Initial KV Cache Swap and Deferred Weight Reclamation, which leverage zRAM for preallocated KV cache and reduce storage I/O by deferring weight eviction, leading to enhanced LLM inference performance. Our proposed approach achieves up to a 27% reduction in memory usage compared to the default Linux kernel, optimizing LLM inference performance in memory-constrained mobile environments. Moreover, our approach yields greater memory savings as the number of candidate paths increases in inference techniques such as speculative decoding, demonstrating its effectiveness in supporting diverse LLM decoding techniques on mobile devices.

번역하기

상세검색

RISS 보유자료

상세검색

해외전자자료

모바일 환경에서의 효과적인 LLM 추론을 위한 메모리 관리 기법 연구 = Efficient Memory Management Techniques for LLM Inference in Mobile System

부가정보

동일학술지(권/호) 다른 논문

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료