Activation Compression for Memory-efficient and Reliable LLM = 메모리 효율적이고 신뢰성 있는 LLM을 위한 activation 압축|RISS 상세보기

국문 초록 (Abstract)

트랜스포머 기반 대규모 언어 모델(LLM)은 모델 크기가 커짐에 따라 다양한 분야에서 눈에 띄는 성능을 달성하며 현대 AI의 주된 패러다임이 되었다. 그러나 파라미터 수와 컨텍스트 길이가 증가함에 따라 LLM 학습에 필요한 메모리 요구량이 현대 GPU의 메모리 용량을 넘어서기 시작하며 주요 병목현상이 되었다. 이 문제를 해결하기 위해 본 논문에서는 LLM 학습에서의 메모리 문제를 용량 효율성과 오류 저항성 관점에서 다루어준다.
첫째, 메모리 효율적인 LLM 학습을 위해 학습 메모리의 대부분을 차지하는 activation을 대상으로 하는 하드웨어 기반 압축기를 제안한다. 제안한 압축기는 activation의 분포와 특성에 따라 설계되며 블록 단위로 압축을 한다. 압축은 지수 블록을 압축하는 Base-Delta Compression(BDC)과 부호와 가수의 상위 비트를 처리하는 Uniform Bit Compression(UBC)으로 구성된다. 이후 압축된 데이터를 activation 별로 적합한 크기에 맞춰 truncation을 하여 activation 메모리 사용량을 더욱 줄인다. 또한 압축기의 낮은 하드웨어 비용을 활용하여, 메모리 내에서의 압축을 통해 모델 정확도 저하없이 메모리 사용량을 줄이며, 그 결과 activation 메모리를 최대 4배까지 감소하였다.
둘째, LLM 학습이 점점 길어지고 메모리 사용량이 증가함에 따라 학습 중 메모리 오류 발생 가능성도 높아졌다. 소수의 비트 오류만으로도 학습이 실패할 수 있기 때문에 저장된 파라미터의 오류 저항성을 높이는 것이 중요하다. 본 연구는 앞서 제안한 압축 방법이 지닌 내재적 강인성에 기반하여 압축 기반 오류 강화 기법을 제안한다. 기존 압축 방법에 민감한 비트의 오류를 제한하는 추가 인코딩 방식을 도입하여 원본 데이터에서 모든 오류 민감 비트를 제거한다. 또한 압축 과정에서 생기는 태그 비트가 심각한 디코딩 오류를 초래할 수 있으므로 인코딩 형식을 비트 단위로 정렬하고 유사 태그를 인접 배치하여 태그 비트로 인한 오류를 줄인다. 그 결과, 높은 오류를 보이는 환경에서도 안정적인 학습이 가능함을 확인하였다.
마지막으로, 제안된 압축 방법의 범용성을 확인하기 위해 학습 activation을 넘어 LLM 추론 시 주요 메모리 소비 원인인 KV 캐시에 제안한 압축을 적용하였다. 기본 압축 알고리즘만으로도 효과가 있지만 더 높은 압축률을 위해 KV 데이터에 맞게 BDC를 최적화하였다. 구체적으로, 최적 델타 크기 선택을 통해 압축률을 더욱 개선하고 최대값 기반 반올림을 적용하여 데이터 표현에 필요한 비트폭을 줄이면서 정확도 손실을 최소화하였다. 그 결과, 제안한 방법은 KV 캐시 크기를 효과적으로 줄이는 동시에 양자화 방법보다 높은 정확도를 달성하였다.

번역하기

트랜스포머 기반 대규모 언어 모델(LLM)은 모델 크기가 커짐에 따라 다양한 분야에서 눈에 띄는 성능을 달성하며 현대 AI의 주된 패러다임이 되었다. 그러나 파라미터 수와 컨텍스트 길이가 ...

다국어 초록 (Multilingual Abstract)

Transformer-based large language models (LLMs) have achieved remarkable performance improvements across various tasks as their model sizes have grown, becoming the mainstream paradigm in modern AI. However, as both the number of parameters and the context length continue to increase, the memory requirements for LLM training have begun to exceed the capacity of modern GPUs, becoming a major bottleneck. This dissertation addresses this memory problem from the perspectives of capacity efficiency and error tolerance.
First, to enable memory-efficient LLM training, we propose a hardware based compressor targeting activations, which occupy the majority of training memory. The proposed compressor is designed according to the distribution and characteristics of activations, grouping them into blocks for compression. It consists of base-delta compression (BDC), which compresses exponent blocks, and uniform bit compression (UBC), which handles the sign and the upper bits of mantissa. The compressed data are then tailored to a target size via truncation, followed by tensor-level size optimization to further reduce memory usage. Additionally, leveraging its low hardware cost, we introduce a seamless in-memory compression scheme that reduces memory consumption with negligible performance impact. As a result, up to a 4× reduction in activation memory is achieved.
Second, as LLM training becomes longer and more memory-intensive, the likelihood of memory errors increases. Since even a few bit errors can cause training failures, enhancing error tolerance for stored parameters is crucial. Building on the inherent robustness of our compression method, we propose a compression-based error hardening technique. By introducing an additional encoding scheme that restricts errors in sensitive bits, all error-critical bits in the original data are eliminated. Furthermore, since compression introduces tag bits that can cause severe decoding errors, we mitigate this by bit-wise aligning encoding formats and rearranging tag transitions so that similar tags are placed adjacently, reducing transition-induced errors. As a result, we observe stable training even under high error rates.
Finally, to demonstrate the generality of the proposed compression method, we extend it beyond training activations and apply it to the KV cache—a major memory consumer during LLM inference. While the baseline compression algorithm already provides benefits, we further optimize BDC for KV data. Specifically, the algorithm is modified to select the optimal delta size to further improve the compression ratio. In addition, maximum-based rounding is employed to minimize accuracy degradation while reducing the bit-width required for data representation. As a result, the proposed method effectively reduces the KV cache size and achieves higher accuracy than quantization.

번역하기

목차 (Table of Contents)

Abstract i
Contents iii
List of Figures vii
List of Tables ix
1 Introduction 1

Abstract i
Contents iii
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 Study Background 1
1.2 Purpose of Research 4
2 Background 6
2.1 Large Language Model 6
2.1.1 Transformer Architecture 6
2.1.2 Training Memory Usage 7
2.1.3 KV Cache 8
2.2 Data Compression 9
3 Activation Compression for Memory-efficient LLM Training 12
3.1 Previous Work 12
3.1.1 Activation Memory Reduction Methods12
3.1.2 Hardware Memory Compressor 13
3.2 Transformer Activation Characteristics 14
3.2.1 Activation Memory Bottleneck and Training Time 14
3.2.2 Data Distribution and Characteristics 15
3.3 FACET: Transformer Activation Compressor 18
3.3.1 Overview 18
3.3.2 Base-delta Compression 22
3.3.3 Uniform Bit Compression 25
3.3.4 Packing with Truncation 26
3.3.5 System Integration 29
3.4 Evaluation 32
3.4.1 Hardware Implementation 32
3.4.2 Training Methodology 34
3.4.3 Lossless Compression 35
3.4.4 Lossy Compression 40
3.4.5 Adaptive Compression Sizing 41
3.4.6 Accuracy and Memory Savings 43
3.4.7 Performance 45
3.4.8 Consistency 46
3.5 Discussion 48
3.6 Summary 49
4 Activation Compression for Reliable LLM Training 50
4.1 Previous Work 51
4.2 Preliminary 52
4.2.1 Error Modeling 52
4.2.2 Error Sensitivity of Activation 53
4.2.3 Opportunities for Error Mitigation 54
4.2.4 Effect of Error on Compression 55
4.3 Enhancing Error Tolerance with Compression 57
4.3.1 Overview 58
4.3.2 Compression Algorithm 58
4.3.3 Critical Bit Hiding 59
4.3.4 Proof of Error Mitigation 60
4.3.5 High Bit Encoding 63
4.3.6 Tag and Transition Error 63
4.3.7 Encoding Format Alignment 65
4.3.8 Optimal Tag Composition Search 70
4.4 Evaluation 72
4.4.1 Methodology 72
4.4.2 Training Error Tolerance 73
4.4.3 Reduction of Bit-flip Errors and Critical Bits 75
4.4.4 Tag Error Mitigation 77
4.5 Discussion 78
4.6 Summary 78
5 KV Cache Compression 80
5.1 Previous Work 81
5.2 Motivation 81
5.2.1 Quantization and the Proposed Compression 81
5.2.2 KV Statistics 82
5.3 Base-delta Compression Tuning 83
5.3.1 Variable-delta with Minimum Base 83
5.3.2 Fixed-delta with Maximum-based Rounding 85
5.4 Evaluation 86
5.4.1 Methodology 86
5.4.2 Accuracy by Bit-width 87
5.4.3 Error Tolerance 90
5.4.4 Execution Time 91
5.5 Summary 91
6 Conclusion 93
Bibliography 95
초록 103

상세검색

RISS 보유자료

상세검색

해외전자자료

Activation Compression for Memory-efficient and Reliable LLM = 메모리 효율적이고 신뢰성 있는 LLM을 위한 activation 압축

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료