Activation-aware Quantization and Fine-tuning for Enhanced Efficiency and Quality of Large Models = 대규모 모델의 효율성과 품질 향상을 위한 활성도 기반 양자화 및 미세조정|RISS 상세보기

다국어 초록 (Multilingual Abstract)

The increasing scale of deep neural networks (DNNs), particularly in convolutional neural networks (CNNs) and large language models (LLMs), has led to substantial improvements in performance across a wide range of tasks. However, this success comes at the cost of considerable memory and computational demands, creating practical barriers to deploying these models in resource-constrained environments.
To amplify the advantages of large pre-trained models while mitigating their limitations, several lines of research have been actively explored. To further enhance the capabilities of pre-trained models, methods such as fine-tuning aim to adapt models to specific tasks, and context window extension techniques increase the input length capacity of LLMs. On the other hand, quantization has emerged as a key optimization strategy to reduce both memory consumption and computational cost by approximating high-precision values with lower-bit representations. Nevertheless, these techniques often suffer from quality degradation when not carefully designed.
This dissertation highlights the overlooked importance of activation behavior in neural networks and proposes a set of activation-aware methods that improve the quality and efficiency of quantization, fine-tuning, and long-context retrieval. For quantization of CNNs, we introduce INSTA-BNN [1], a binary neural network that uses instance-specific activation statistics to dynamically determine binarization thresholds, improving the accuracy of 1-bit quantized models. For quantization of LLMs, we propose Outlier-Aware Weight Quantization (OWQ) [2], which enhances quantization quality by preserving weights corresponding to activation outliers in higher precision. This is extended by Weak Column Tuning (WCT), a fine-tuning method that updates only the preserved weight columns, significantly reducing trainable parameters while maintaining high adaptation quality.
To further accelerate both inference and fine-tuning, we propose QEFT [3], which reorganizes weight structures using offline global reordering based on consistent activation outlier patterns across layers. QEFT consists of two main parts: the method design and the acceleration kernel implementation. We contributed to the method development and theoretical aspects of QEFT. As a result, QEFT achieves improvements across inference latency, training time, and adapted model accuracy. Finally, in the context of long-context retrieval, we introduce SEAL [4], which learns to scale attention components, leading to notable gains in retrieval quality across long-context scenarios.
Through extensive verification, this dissertation demonstrates that leveraging activation can serve as a unifying principle for improving the quality and efficiency of deep models across both vision and language domains. The proposed methods pave the way for broader and more effective deployment of large-scale neural networks in real-world applications.

번역하기

국문 초록 (Abstract)

딥 뉴럴 네트워크 (deep neural network, DNN) 는 많은 모델 파라미터 및 깊은 층 (layer) 구조를 기반으로 이미지 인식, 음성인식 등 응용 분야에서 뛰어난 성능을 보여주고 있다. 합성곱 신경망 (convolutional neural network, CNN) 을 필두로 이미지 분야뿐만 아니라, 특히 최근의 초거대 언어모델 (large language model, LLM) 은 더욱 방대한 모델 크기 및 학습 데이터셋으로 챗봇이나 질의응답 등 언어적 영역을 포함하여 다양한 분야에서 전례없는 성능을 보여주고 있다. 모델 파라미터 수의 증가는 사전 학습 (pre-trained) 모델의 제로샷 성능만으로 이미 광범위한 분야에 괄목할만한 성능을 보인다는 장점을 가지지만, 동시에 메모리 및 연산 오버헤드의 주된 원인이 되어 모델의 학습 및 추론에 많은 자원을 필요로 한다는 중대한 단점 또한 가지게 된다. 더불어 해당 단점은 모바일 기기와 같은 제한된 자원을 가진 환경에서 모델의 구동을 어렵게 하여 모델 보급 및 대중화의 장벽이 된다.
앞서와 같은 장점은 더욱 부각시키고, 단점은 상쇄하기 위한 다양한 방법들이 제안되었다. 사전 학습 모델의 능력을 추가로 향상시키기 위한 방법으로는 정답 라벨이 주어진 데이터로 추가적으로 학습시키는 미세조정 (fine-tuning) 및 길이가 긴 학습 데이터들로 미세조정하여 입력으로 받을 수 있는 문장의 길이를 증가시키는 문맥길이 확장 (context window extension) 등이 있다. 모델 오버헤드의 단점을 해결하기 위한 대표적인 방법은 양자화 (quantization) 라는, 고정밀도 값을 저정밀도로 근사하여 표현하는 방법이다. 모델의 가중치 (weight) 혹은 뉴런의 활성도 (activation) 를 적은 bit 수로 표현함으로써 거대 모델의 메모리 소모 및 연산 오버헤드 둘 다를 효과적으로 줄일 수 있다.
하지만 제시한 방법들은 정교하게 설계되지 않을 경우 성능 하락 문제를 가지고 있다. 양자화의 경우 비용 오버헤드를 줄이는 대신, 정보 손실로 인한 양자화 오류의 발생이 모델의 품질 저하로 이어진다. 미세조정은 특히 양자화와 함께 사용될 경우, 목표하는 충분한 품질을 얻기 어렵다. 문맥길이 확장의 경우에도 확장된 문맥길이 안에서 입력의 길이가 길어질수록 정확도 하락이 발생함이 보고되어있다. 한편 모델 내부에서 활성도 (activation) 는 실질적으로 처리되고 있는 데이터일 뿐만 아니라 가중치 (weight) 와 더불어 연산에 주요하게 관여하는 값임에도 불구하고 많은 기존 연구들에서 활성도를 활용하는 것의 중요성이 간과되고 있다.
본 논문에서는 앞서 방법들에 대해 모델의 활성도를 고려하는 것의 중요성을 강조하고, 각 방법에 활성도 정보 혹은 활성도 그 자체를 적절히 활용하여 품질을 향상시킨 기법들이 제안되었다. 먼저 CNN의 1-bit (binary) 양자화의 품질을 높이기 위한 인스턴스 기반 임계값을 사용하는 이진 뉴럴 네트워크 (binary neural network, BNN) 를 제안하였다. 기존의 BNN 연구들은 고정된 임계값을 사용해 활성도를 +1 혹은 -1로 양자화했지만, 실제 모델의 입력 활성도를 살펴보면 인스턴스 별로 분포가 상이한 것을 볼 수 있다. 이 때 고정된 임계값은 인스턴스 간의 활성도 값의 민감한 차이를 잘 반영하지 못한다. 따라서 이 연구에서는 각 인스턴스의 활성도 통계 (activation statistics) 정보를 활용하여 동적으로 임계값을 만드는 방법을 제안함으로써 양자화로 인한 성능 하락을 완화하는 이진 양자화 방법을 제안하였다.
한편 LLM에서는 입력에 무관하게, 고정된 특정 위치의 채널에 값의 절댓값이 아주 큰 활성도 이상값 (activation outlier) 이 발생함이 알려져 있다. 이에 기반하여 다음으로 양자화를 LLM으로 확장하여 활성도 이상값을 가중치 양자화에 반영하는 Outlier-Aware Weight Quantization (OWQ) 를 제안하였다. 마찬가지로 활성도에 초점을 두어, 활성도 이상값에 대한 분석을 통해 이런 이상값들이 가중치만 양자화하는 경우에도 양자화의 민감도에 깊은 영향을 미침을 발견하였다. 따라서 활성도 이상값과 대응되는 가중치 채널들인 민감한 열 (weak column) 들을 양자화 없이 높은 정밀도로 유지하는 혼합 정밀도 양자화 (mixed-precision quantization) 를 제안하여 LLM의 양자화 품질을 기존 방법대비 크게 향상시킬 수 있었다. OWQ를 사용해 사전 학습된 모델의 양자화 오류를 크게 감소시킬 수 있고, 원하는 작업에 대해 부가적인 능력 향상을 위해 양자화된 언어모델을 미세조정할 수 있다. 이에 추가로 OWQ의 높은 정밀도를 가지는 민감한 열들만 미세조정 (weak column tuning, WCT) 하도록 확장하는 작업 맞춤형 적응 또한 제안되어, 기존 양자화 고려 미세조정 방법에 비해 더 적은 조정 파라미터로 좋은 적응 성능을 보여주었다.
이처럼 OWQ 및 WCT는 초거대 언어모델의 활성도 특징을 이용해 효율성과 품질 둘 다를 개선했다. 하지만, 활성도 이상값을 가진 채널들의 불규칙한 위치 패턴은 추론 및 미세조정 과정의 가속에 한계를 가져온다는 문제가 있다. 이를 해결하기 위해 LLM 내에서 각 층 (layer) 의 활성도 이상값 패턴을 추가로 분석하여 모델 전체에서 이상값이 발생하는 채널 위치가 거의 일정하게 유지됨을 확인했고, 이 정보를 활용하여 민감한 열들을 사전에 재정렬해두는 기법이 제안되었다. 이 기법에 기반한 QEFT 양자화 방법을 통해 추가로 추론 속도 및 미세조정 속도를 개선할 수 있었다.
미세조정과 더불어, 긴 입력을 활용하는 작업에 대한 관심 및 수요가 증가하며 문맥길이 확장 또한 주목받고 있다. LLM의 근간이 되는 트랜스포머 (Transformer) 구조의 경우 활성도 그 자체도 중요하지만, 기존 입력들에 대한 활성도 값과 현재 입력에 대한 활성도 값의 관계를 계산하는 셀프-어텐션 (self-attention) 이 핵심 요소로 동작하며, 이는 장문맥 검색에서도 필수적인 요소이다. 하지만 확장된 문맥 창 길이에도 불구하고 LLM의 입력 길이가 점점 길어질수록 장문맥 검색 정확도의 저하가 발생한다. 이를 개선하기 위해 셀프-어텐션의 헤드별 혹은 채널별 세기 (scale) 를 학습하는 SEAL이 제안되었다. 어텐션 요소들의 세기를 적절히 조절하는 SEAL을 통해 다양한 장문맥 검색 작업에서 원본 모델 대비 검색 정확도를 크게 개선했다.
본 학위논문에서 제안된 활성도를 고려한 양자화 및 미세조정 방법들을 통해, 모델의 품질을 유지하며 신경망의 효율성을 높일 뿐만 아니라 나아가 목표하는 세부 작업이나 긴 입력에 대한 모델의 품질을 더욱 증가시킬 수 있었다. 이를 통해 딥 뉴럴 네트워크 및 초거대 언어모델의 대중화에 기여할 수 있을 것으로 기대된다.

번역하기

딥 뉴럴 네트워크 (deep neural network, DNN) 는 많은 모델 파라미터 및 깊은 층 (layer) 구조를 기반으로 이미지 인식, 음성인식 등 응용 분야에서 뛰어난 성능을 보여주고 있다. 합성곱 신경망 (convol...

목차 (Table of Contents)

I. Introduction 2
II. Background and Related Work 6
2.1 Binary Neural Networks (BNNs) 6
2.1.1 Network Binarization 6
2.1.2 Threshold Optimization 7

I. Introduction 2
II. Background and Related Work 6
2.1 Binary Neural Networks (BNNs) 6
2.1.1 Network Binarization 6
2.1.2 Threshold Optimization 7
2.2 Quantization of Large Language Models (LLMs) 7
2.2.1 Int8 Quantization for Activation and Weight 8
2.2.2 Low-precision Weight Quantization for LLMs 8
2.3 Advanced applications of LLMs 9
2.3.1 Parameter-Efficient Fine-Tuning (PEFT) 9
2.3.2 Quantization-aware PEFT 9
2.3.3 Benchmarks for Long-Context LLMs 10
III. INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold 11
3.1 Introduction 11
3.2 Motivation 13
3.3 Proposed BNN with INSTAnce-aware threshold (INSTA-BNN) 15
3.3.1 Importance of instance-wise threshold 15
3.3.2 Importance of higher-order statistics information 19
3.3.3 Squeeze-and-Excitation Module 22
3.3.4 Instance-aware PReLU 24
3.4 Practical guidelines for INSTA-BNN 25
3.4.1 Selective use of the INSTA module 25
3.4.2 Reuse of activation statistics 27
3.4.3 Detailed model structure of latency-optimized INSTA-BNN 29
3.5 Experiments 31
3.5.1 Experimental Setup 31
3.5.2 Comparison on ImageNet Classification 31
3.5.3 Inference latency evaluation 35
3.5.4 Ablation Study 37
3.5.5 Visualization results of t-SNE 39
3.6 Conclusion 41
IV. OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models 42
4.1 Introduction 42
4.2 Motivation 44
4.2.1 Layer-wise Quantization and Hessian of Weights 45
4.3 OWQ: Outlier-aware Weight Quantization 49
4.3.1 Quantization Configuration Search 52
4.3.2 PEFT with Weak Column Tuning 52
4.4 Experiments 55
4.4.1 Experimental Setup 55
4.4.2 Results of Perplexity Measure 56
4.4.3 Results of Various Few-shot Tasks 61
4.4.4 Acceleration on Real Device 64
4.4.5 Quantization Speed 64
4.4.6 Results of WCT-based Fine-tuning 65
4.4.7 Comparison of PTQ Methods used in WCT 67
4.4.8 Comparison with Group-wise Quantization 67
4.4.9 Weak Column Selection Metrics 70
4.4.10 Varying Ratios of Weak Columns 70
4.4.11 Layer-wise Quantization Sensitivity 71
4.5 Conclusion 72
V. QEFT: Quantization for Efficient Fine-Tuning of LLMs 73
5.1 Introduction 73
5.2 Motivation 76
5.3 Proposed method: QEFT 79
5.3.1 Data Structure and Quantization Process 79
5.3.2 Offline Global Reordering 79
5.3.3 Efficient Backward Computation 82
5.4 Optimal Weak Column Selection 84
5.5 Advanced Application: PEFT Merging 87
5.6 Experiments 89
5.6.1 Experiments Setting 89
5.6.2 Overall Fine-tuning Results 89
5.6.3 PEFT Merging Results 94
5.6.4 Inference Acceleration 96
5.7 Conclusion 96
VI. SEAL: Scaling to Emphasize Attention for Long-Context Retrieval 97
6.1 Introduction 97
6.2 Motivation 100
6.2.1 Attention Per-head Pruning 100
6.2.2 Attention Head-wise Scaling 102
6.2.3 Attention Channel-wise Scaling 102
6.3 Proposed method: SEAL 103
6.3.1 Format-aware data synthesis 105
6.3.2 Learnable space design: SEAL-H and SEAL-C 105
6.3.3 Practicality of SEAL: offline merging 106
6.4 Experiments 106
6.4.1 Qualitative Analysis with Circuit Analysis 107
6.4.2 Results on line retrieval task 110
6.4.3 Results on Needle-in-a-Haystack task 112
6.4.4 Results on RULER benchmark 114
6.5 SEAL with context length extension 117
6.6 Analysis on transferability of SEAL 120
6.7 Comparison with In-Context Learning 122
6.8 Comparison with Low-Rank Adaptation 122
6.9 Conclusion 123
VII. Conclusion 124
Summary (in Korean) 126
References 130

상세검색

RISS 보유자료

상세검색

해외전자자료

Activation-aware Quantization and Fine-tuning for Enhanced Efficiency and Quality of Large Models = 대규모 모델의 효율성과 품질 향상을 위한 활성도 기반 양자화 및 미세조정

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료