The increasing scale of deep neural networks (DNNs), particularly in convolutional neural networks (CNNs) and large language models (LLMs), has led to substantial improvements in performance across a wide range of tasks. However, this success comes at...
The increasing scale of deep neural networks (DNNs), particularly in convolutional neural networks (CNNs) and large language models (LLMs), has led to substantial improvements in performance across a wide range of tasks. However, this success comes at the cost of considerable memory and computational demands, creating practical barriers to deploying these models in resource-constrained environments.
To amplify the advantages of large pre-trained models while mitigating their limitations, several lines of research have been actively explored. To further enhance the capabilities of pre-trained models, methods such as fine-tuning aim to adapt models to specific tasks, and context window extension techniques increase the input length capacity of LLMs. On the other hand, quantization has emerged as a key optimization strategy to reduce both memory consumption and computational cost by approximating high-precision values with lower-bit representations. Nevertheless, these techniques often suffer from quality degradation when not carefully designed.
This dissertation highlights the overlooked importance of activation behavior in neural networks and proposes a set of activation-aware methods that improve the quality and efficiency of quantization, fine-tuning, and long-context retrieval. For quantization of CNNs, we introduce INSTA-BNN [1], a binary neural network that uses instance-specific activation statistics to dynamically determine binarization thresholds, improving the accuracy of 1-bit quantized models. For quantization of LLMs, we propose Outlier-Aware Weight Quantization (OWQ) [2], which enhances quantization quality by preserving weights corresponding to activation outliers in higher precision. This is extended by Weak Column Tuning (WCT), a fine-tuning method that updates only the preserved weight columns, significantly reducing trainable parameters while maintaining high adaptation quality.
To further accelerate both inference and fine-tuning, we propose QEFT [3], which reorganizes weight structures using offline global reordering based on consistent activation outlier patterns across layers. QEFT consists of two main parts: the method design and the acceleration kernel implementation. We contributed to the method development and theoretical aspects of QEFT. As a result, QEFT achieves improvements across inference latency, training time, and adapted model accuracy. Finally, in the context of long-context retrieval, we introduce SEAL [4], which learns to scale attention components, leading to notable gains in retrieval quality across long-context scenarios.
Through extensive verification, this dissertation demonstrates that leveraging activation can serve as a unifying principle for improving the quality and efficiency of deep models across both vision and language domains. The proposed methods pave the way for broader and more effective deployment of large-scale neural networks in real-world applications.