Output stationary NPU를 위한 데이터 레이아웃 최적화 및 벡터 연산 유닛 설계 = Data Layout Optimization and Vector Processing Unit for Output Stationary NPUs|RISS 상세보기

다국어 초록 (Multilingual Abstract)

To handle the high computational demands and memory bandwidth requirements of deep neural networks (DNNs), a new type of processor called the neural processing unit (NPU) has been proposed. NPUs commonly consist of a matrix unit that supports matrix multiplication and convolution operations, and a vector processing unit (VPU) responsible for vector operations and general computations. While matrix units can be implemented in various ways, the systolic array is widely used in many NPUs. Systolic arrays can be categorized based on their dataflow. The weight stationary systolic array (WS-SA) employing weight stationary dataflow has been increasingly adopted in recent NPUs. However, large systolic arrays benefit from using output stationary dataflow in the output stationary systolic array (OS-SA), which can more easily enhance computation utilization. Nonetheless, the data layout—a method of how data is stored in the memory—introduces different constraints for OS-SA compared to WS-SA. This difference makes it inefficient to operate an OS-SA based NPU using the previous NPU organization and software. This study approaches the inefficiencies arising from the data layout in OSSA based NPUs in two ways. Firstly, we develop a software framework, known as the layout mapping framework, capable of representing the data layout of OS-SA. We then employs heuristics to optimize the selection of a data layout that reduces the overall execution time of DNNs among various data layouts available in OS-SA. Secondly, an instruction set is designed to efficiently handle the frequent data rearrangements that occur when using the OS-SA’s data layout in the vector processing unit (VPU). To select the optimal data layout for OS-SA, a layout mapping framework capable of representing the data layout of OS-SA was established. While similar frameworks targeting CPU, GPU, and WS-SA-based NPUs already exist, there is a limitation of existing frameworks which is their inability to adequately represent the data layout specific to OS-SA. To address this issue, a new data layout representation based on Graphene was adopted, tailored specifically for OS-SA NPUs within the layout mapping framework. The introduction of this new data layout representation alone demonstrated a performance improvement of up to 39 times faster in convolutional neural networks compared to using the previous data layout representations. Next, the study proposed and designed a layout mapping heuristic optimized for each layer within a DNN. Traditional layout mapping heuristics predominantly used a propagation-based approach, where the data layout is initially set for operations in the systolic array, and then propagated to ensure adjacent layers share the same layout. This approach, however, does not explore the variety of data layouts available for OS-SA, ignore the performance of non-systolic array operations, and does not optimize the endpoint of propagation. To overcome these limitations, this research introduces a new optimization method based on simulated annealing. It proposes two new state transition operations tailored to the layout mapping problem and enhances convergence stability by incorporating an additional propagation technique. This method demonstrated a performance increase of approximately 20 % in BERT-base models, showcasing its effectiveness. Finally, the instruction set for the VPU was tailored to accommodate the frequent data rearrangements used in OS-SA. The VPU often exchange data with the systolic array, thus it needs to process data stored in the OS-SA’s data layout or convert data into the OS-SA’s data layout before it is input into the OS-SA. Specifically, the VPU must adhere to the blocked data layout characteristic of OS-SA. Previously, the instruction set for VPUs in NPUs was designed with WS-SA in mind, necessitating multiple instructions for data rearrangement or including instructions that resulted in high hardware costs in the VPU. To address these issues, an instruction set aligned with OS-SA’s blocked data layout was proposed. This proposed instruction set includes operations such as block-broadcasting and block-rotate, which differentiate between operations inside and outside of blocks. When VPUs including this proposed instruction set were used in NPUs, they demonstrated an 8 % faster performance in BERT-base compared to NPUs with VPUs designed for WS-SA.
Keywords: Data layout, NPU, Vector processing unit, Data layout mapping Student Number: 2017-23638

번역하기

국문 초록 (Abstract)

Deep neural network (DNN)의 높은 연산량과 memory bandwidth 요구량을 처리하기 위해 neural processing unit (NPU)이라는 새로운 형태의 프로세서를 고안되었다. NPU는 공통적으로 matrix multiplication과 convolution 연산을 지 원하는 matrix unit과 벡터 연산 및 범용 연산을 담당하는 vector processing unit (VPU)로 이루어져 있다. Matrix unit은 다양한 방법으로 구현되지만, systolic array가 많은 NPU에서 사용되고 있다. Systolic array는 dataflow에 따라서 종류를 나눌 수 있는데, 최근 NPU에서 많이 채택되는 systolic array는 weight stationary dataflow를 사용하는 weight stationary systolic array (WS-SA)이다. 하지만 큰 systolic array 에서는 output stationary dataflow를사용하는 output stationary systolic array (OS-SA)가쉽게 utilization을 높일 수 있다는 장점이 있다. 다만 data의 저장 방식인 data layout 에 대해 OS-SA는WS-SA와는다른 constraint 가 생기게 되고,이는 기존WS-SA 를 포함하는 NPU 구성 방식으로는 OS-SA가 포함된 NPU를 효율적으로 구동할 수 없게 만든다. 본 연구는 OS-SA 기반 NPU의 data layout으로 발생하는 비효율을 해결하 기 위해 2 가지 방법으로 접근한다. 첫번째로, OS-SA의 data layout을 표현할 수 있는 software framework 인 layout mapping framework를 구성하고, OS-SA의 여러 data layout 중 전체 DNN 수행 시간을 감소시키는 data layout을 선택하 는 문제를 heuristic으로 최적화하였다. 두번째는 OS-SA의 data layout을 사용할 경우 빈번하게 일어나는 data rearrangement를 VPU에서 효율적으로 처리할 수 있도록 instruction set을 구성하였다. 먼저 data layout을 선택하기 위하여 OS-SA의 data layout을 표현할 수 있는 layout mapping framework를 구성하였다. Layout mapping framework는 기존에 도 CPU, GPU, WS-SA의 NPU를 목표로 많이 구현되어 있고, framework 내부에 는 data layout을 표현하기 위한 data layout representation 과 mapping을 수행 하기 위한 layout mapping heuristic들이 구현되어 있다. 하지만 기존 framework 의경우 data layout representation이 OS-SA의 data layout을표현하지못한다는 한계점이 존재한다. 이 문제를 해결하기 위해 최근 제안된 Graphene 기반 data layout representation을 도입하여 OS-SA NPU를 목표로 하는 layout mapping framework를 구성하였다. 해당 data layout representation의 도입만으로도 기존 data layout representation을사용하는것보다 convolutional neural network에서 39 배 빠른 성능향상을 보여주었다. 다음으로, DNN 내의 각 layer 별 data layout을 최적화하는 layout mapping heuristic을 설계하고 제안하였다. 기존 layout mapping heuristic은 전파 방식의 mapping heuristic을 주로 사용하였다. 이 방식은 systolic array에서 수행되는 연 산에대해먼저 data layout을설정한뒤에인접 layer들이동일한 data layout을가 지도록 data layout을전파해나간다.이방식은 OS-SA가가지는여러 data layout 을 exploration해보지않고, systolic array에서수행되는 layer의 data layout은수 동적으로 정해지며, 전파가 종료되는 지점을 최적화하지 않는다는 한계점이 있다. 본 연구에서는 이 한계점 해결을 위해 simulated annealing 기반의 최적화 방법을 새로 제안한다. Layout mapping 문제에 맞춰 두 가지 state transition operation 을 새롭게 제안하였고, 이에 더불어 추가 전파 기법을 더해 수렴 안정성을 증가시 켰다. 이러한 방법으로 BERT-base에서 약 20 %의 성능 증가를 보일 수 있었다. 마지막으로, OS-SA에서 자주 사용되는 data rearrangement에 맞춰 VPU의 instruction set을 구성하였다. VPU의 경우 systolic array와 data를 주고받기 때 문에 OS-SA의 data layout으로 저장된 data를 입력으로 받아 연산을 수행하거나 OS-SA로 입력될 data를 OS-SA의 data layout으로 바꿔주는 작업을 수행해야 한 다. 특히 OS-SA의 사용하는 data layout 특징인 blocked data layout을 따라야 한다. 기존 NPU에서 VPU의 instruction set 은 WS-SA를 목표로 구성이 되어 data rearrangement를 위해 여러 instruction을 사용해야 하거나, 높은 하드웨어 비용을 만들어 내는 instruction이 포함이 되어있었다. 이 문제를 해결하기 위해 OS-SA의 blocked data layout에 맞춘 instruction set을 제안하였다. 제안하는 instruction set은 block 내부와 외부를 구분 지어 수행되는 block-broadcasting, block-rotate instruction을 포함하고 있다. 제안하는 instruction set을 포함하는 VPU를구성하였을때WS-SA을목표로한 VPU를포함한 NPU보다 BERT-base 에서 8 % 빠른 성능을 보여줄 수 있었다. 주요어: Data layout, NPU, Vector processing unit, Data layout mapping 학번: 2017-23638

번역하기

Deep neural network (DNN)의 높은 연산량과 memory bandwidth 요구량을 처리하기 위해 neural processing unit (NPU)이라는 새로운 형태의 프로세서를 고안되었다. NPU는 공통적으로 matrix multiplication과 convolution ...

목차 (Table of Contents)

제 1 장 서론 1
제 2 장 연구의 배경 5
2.1 Output stationary systolic array 5
2.1.1 Systolic array and Dataflow 5
2.1.2 Output stationary dataflow 7

제 1 장 서론 1
제 2 장 연구의 배경 5
2.1 Output stationary systolic array 5
2.1.1 Systolic array and Dataflow 5
2.1.2 Output stationary dataflow 7
2.1.3 Comparison to weight stationary dataflow 8
2.1.4 Convolution on Output Stationary Systolic Array 14
2.2 Data Layout and Output Stationary Systolic Array 17
2.2.1 NPU address and data layout 17
2.2.2 OS-SA Data Layout 19
2.3 Blocked output stationary systolic array 21
2.4 목표하는 아키텍처 26
2.5 DNN Mapping Framework 27
제 3 장 Data layout mapping framework 설계 31
3.1 서론 31
3.2 기존 연구 33
3.3 BOS-SA data layout mapping framework 36
3.3.1 Extended Graphene Representation 36
3.3.2 Data layout exploration space 39
3.4 실험 결과 및 분석 44
3.4.1 실험 환경 44
3.4.2 단일 convolution layer에 대한 성능 분석 46
3.4.3 DNN network에 대한 성능 분석 49
3.5 논의 51
3.6 본 장의 결론 52
제 4 장 BOS-SA NPU의 DNN 수행 latency 감소를 위한 data layout mapping heuristic 제안 54
4.1 서론 54
4.2 기존 연구 및 Motivation 57
4.2.1 기존 연구 57
4.2.2 Motivation 60
4.3 Simulated annealing 기반 data layout mapping heuristic 64
4.3.1 Simulated annealing으로 문제 정의 64
4.3.2 Data layout mapping을 위한 transition operation 66
4.3.3 추가 전파를 통한 수렴속도 최적화 69
4.4 실험 결과 및 분석 71
4.4.1 실험 환경 71
4.4.2 DNN network 성능 비교 73
4.4.3 기법들에 대한 ablation study 75
4.5 논의 77
4.6 본 장의 결론 79
제 5 장 BOS-SA의 data rearrangement overhead를 줄이는 VPU 설계 80
5.1 서론 80
5.2 기존 연구 83
5.3 Block aware cross-lane instruction set 84
5.3.1 Block aware movement 84
5.3.2 Block aware data type conversion 88
5.3.3 Overall cross-lane instruction set 89
5.4 실험 결과 및 분석 91
5.4.1 실험 환경 91
5.4.2 Hardware implementation result 92
5.4.3 Microbenchmark 93
5.4.4 DNN benchmark 96
5.5 논의 97
5.6 본 장의 결론 97
제 6 장 결론 99
Abstract 110

참고문헌 (Reference)

1. Onnx, O. developers, https://onnx. ai/, 2024, version: 1.16.1, , 2024

2. Layer normalization, J. L. Ba, J. R. Kiros and, G. E. Hinton, Available https//arxiv. org/abs/1607.06450, , 2016

3. Long short-term memory, J. Schmidhuber, S. Hochreiter and, Neural Comput., vol. 9, no. 8, p. 1735–1780Online Available https//doi. org/10.1162/neco.1997.9.8.1735, , 1997

4. Attention is all you need in, I. Guyon, L. u. Kaiser and, A. N. Gomez, U. V. Luxburg, L. Jones, A. Vaswani, R. Fergus, I. Polosukhin, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017., H. Wallach, N. Shazeer, S. Bengio, N. Parmar, J. Uszkoreit, Advances in Neural Information Processing Systems,, , 2017

5. Optimization by simulated annealing, S. Kirkpatrick, M. P. Vecchi, C. D. Gelatt and, vol. 220, no. 4598, pp. 671–680Online Available http//www. jstor. org/stable/1690046, , 1983

6. ScalesimSystolic cnn accelerator simulator, T. Krishna, M. Mattina and, P. Whatmough, A. Samajdar, Y. Zhu, arXiv preprint arXiv:1811.02883, , 2018

7. Adaptation in Natural and Artificial Systems, J. H. Holland, Ann Arbor, MI: University of Michigan Press, 1975, second edition, , 1992

8. Open cell library in 15nm freepdk technology in, G. Schlinker, A. Reis, R. P. Ribas, L. Rech and, J. M. Matos, M. Martins, J. Michelsen, Proceedings of the 2015 Symposium on International Symposium on Physical Design, ser. ISPD ’15. New York, NY, USA: Association for Computing Machinery, p. 171–178. [Online]. Available: https://doi. org/10.1145/2717764.2717783, , 2015

9. CompilersPrinciples Techniques and Tools2nd Edition, A. V. Aho, M. S. Lam, R. Sethi and, J. D. Ullman, Addison WesleyOnline Available http//www amazon ca/exec/obidos/redirecttag=citeulike09- 20&path=ASIN/0321486811, , 2006

10. A survey on compiler autotuning using machine learning, C. Silvano, J. Cavazos, A. H. Ashouri, W. Killian, G. Palermo and, ACM Comput. Surv. vol. 51, no. 5,Online Available https//doi org/10.1145/3197978, , 2018

1. Onnx, O. developers, https://onnx. ai/, 2024, version: 1.16.1, , 2024

2. Layer normalization, J. L. Ba, J. R. Kiros and, G. E. Hinton, Available https//arxiv. org/abs/1607.06450, , 2016

3. Long short-term memory, J. Schmidhuber, S. Hochreiter and, Neural Comput., vol. 9, no. 8, p. 1735–1780Online Available https//doi. org/10.1162/neco.1997.9.8.1735, , 1997

5. Optimization by simulated annealing, S. Kirkpatrick, M. P. Vecchi, C. D. Gelatt and, vol. 220, no. 4598, pp. 671–680Online Available http//www. jstor. org/stable/1690046, , 1983

6. ScalesimSystolic cnn accelerator simulator, T. Krishna, M. Mattina and, P. Whatmough, A. Samajdar, Y. Zhu, arXiv preprint arXiv:1811.02883, , 2018

7. Adaptation in Natural and Artificial Systems, J. H. Holland, Ann Arbor, MI: University of Michigan Press, 1975, second edition, , 1992

11. Ai accelerator on ibm telum processorindustrial product, H. Pozidis, E. Tzortzatos, R. Bertran, C. Lichtenau, P. Figuli, C. Jacobi, A. Sica and, A. Saporito, A. Buyuktosunoglu, N. Papandreou, Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1012–1028. [Online]. Available: https://doi. org/10.1145/3470496.3533042, , 2022

12. GrapheneAn ir for optimized tensor computations on gpus, H. Chen, B. Hagedorn, C. Cecka, M. Garland and, B. Fan, V. Grover, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 302–313. [Online]. Available: https://doi. org/10.1145/3582016.3582018, , 2023

13. TimeloopA systematic approach to dnn accelerator evaluation in, Y. S. Shao, V. A. Ying, A. Parashar, A. Mukkara, Y. H. Chen, R. Venkatesan, P. Raina, S. W. Keckler and, B. Khailany, J. Emer, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Conference Proceedings, pp. 304–315., , 2019

14. Imagenet classification with deep convolutional neural networks, G. E. Hinton, I. Sutskever and, A. Krizhevsky, vol. 60, no. 6, p. 84–90Online Available https//doi org/10.1145/3065386, , 2017

15. Gcd2A globally optimizing compiler for mapping dnns to mobile dsps, G. Agrawal and, J. Guan, Y. Wang, W. Niu, B. Ren, X. Shen, in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 512–529, , 2022

16. Tvman automated end-to-end optimizing compiler for deep learning in, A. Krishnamurthy, T. Chen, L. Wang, C. Guestrin and, Y. Hu, L. Zheng, M. Cowan, Z. Jiang, E. Yan, H. Shen, T. Moreau, L. Ceze, Pro- ceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 579–594., , 2018

17. Data movement is all you needA case study on optimizing transformers, S. Li and, N. Dryden, A. Ivanov, T. Hoefler, T. Ben-Nun, Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, , 2021

18. MlperfAn industry standard benchmark suite for machine learning performance, G. Schmuelling, G.-Y. Wei and, D. Kanter, C.-J. Wu, C. Coleman, D. Patterson, C. Cheng, P. Micikevicius, H. Tang, G. Diamos, V. J. Reddi, P. Mattson, IEEE Micro, vol. 40, no. 2, pp. 8–16, 2020., , 2020

19. TileflowA framework for modeling fusion dataflow via tree-based analysis in, G. Sun, S. Gao, L. Jia, S. Chen, R. Wang and, Y. Liang, S. Zheng, 2023 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), Conference Proceedings, pp. 1271–1288, , 2023

20. InterstellarUsing halides scheduling language to analyze dnn accelerators in, Q. Liu, X. Yang, S. Bell, A. Nayak, H. Ha, C. Kozyrakis and, M. Horowitz, J. Pu, M. Gao, P. Raina, K. Cao, J. Setter, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 369–383. [Online]. Available: https://doi. org/10.1145/3373376.3378514, , 2020

21. ChameleonAdaptive code optimization for expedited deep neural network compilation, A. Yazdanbakhsh and, B. H. Ahn, P. Pilligundla, H. Esmaeilzadeh, 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, OpenReview. netOnline, , 2020

22. TenetA framework for modeling tensor dataflow based on relation-centric notation in, L. Jia, J. Cong and, Y. Liang, Z. Luo, Y. Wang, N. Guan, L. Lu, J. Yin, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Conference Proceedings, pp. 720–733, , 2021

23. A uniform latency model for dnn accelerators with diverse architectures and dataflows, L. Mei, H. E. Sumbul, M. Verhelst and, E. Beigne, H. Liu, T. Wu, in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), Conference Proceedings, pp. 220–225, , 2022

24. onednn graph compilerA hybrid approach for high-performance deep learning compilation, D. Lavery, Y. Zhang, Y. Song, E. Lin and, C. Chen, L. Du, J. Cui, Y. Zhang, Y. Mei, B. Jin, J. Ye, J. Li, Z. Qin, X. Cheng, 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Conference Proceedings, pp. 460–470., , 2024

25. Impact of local interconnects on timing and power in a high performance microprocessor, M. Patyra, R. S. Shelar and, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 10, pp. 1623–1627, , 2013

26. AutomapAutomatic mapping of neural networks to deep learning accelerators for edge devices, X. Jin, H. Zheng, Q. Zou and, Z. Zhao, M. Nie, C.-J. R. Shi, Y. Wang, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 9, pp. 2994–3006, 2023, , 2023

27. EyerissAn energyefficient reconfigurable accelerator for deep convolutional neural networks, J. S. Emer and, V. Sze, Y.-H. Chen, T. Krishna, vol. 52, no. 1, pp. 127–138, , 2017

28. AltBreaking the wall between data layout and loop optimizations for deep learning compilation, H. Wan, J. Xu, X. Wang, W. Wang, K. Wang and, H. Peng, H. Dai, H. Cheng, Y. Xu, G. Chen, Z. Xu, Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 199–214. [Online]. Available: https://doi. org/10.1145/3552326.3587440, , 2023

29. Batch normalizationaccelerating deep network training by reducing internal covariate shift in, S. Ioffe and, C. Szegedy, Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15. JMLR. org, p. 448–456, , 2015

30. MaestroA data-centric approach to understand reuse performance and hardware cost of dnn mappings, M. Pellauer and, V. Sarkar, H. Kwon, A. Parashar, T. Krishna, P. Chatarasi, IEEE Micro, vol. 40, no. 3, pp. 20–29, , 2020

31. speedai240A 2-petaflop 30-teraflops/w at-memory inference acceleration device with 1456 risc-v cores, R. Beachler, M. Snelgrove and, IEEE Micro, vol. 43, no. 3, pp. 58–63, 2023, , 2023

32. High-performance deep-learning coprocessor integrated into x86 soc with server-class cpus industrial product in, G. Henry, P. Palangpour, M. Thomson, K. Houck, B. Arden, K. O’Brien, J. S. Gardner, B. Seroussi and, S. Petersen, J. Donahue, T. Walker, J. Johnson, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 15–26., , 2020

33. Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators in, M. Yang, Y. Zhou, C. Guo, Y. Zhu, Q. Chen, M. Guo and, Y. Liang, J. Leng, 2021 IEEE International Symposium on Workload Characterization (IISWC). Los Alamitos, CA, USA: IEEE Computer Society, pp. 214–225., , 2021

34. DefinesEnabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling in, K. Goetschalckx, L. Mei, A. Symons and, M. Verhelst, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 570–583, , 2023

35. RubickA unified infrastructure for analyzing exploring and implementing spatial architectures via dataflow decomposition, J. Cong, J. Yin, Z. Luo, L. Lu, J. Yin, S. Zheng, Y. Liang and, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 4, pp. 1177–1190, 2024, , 2024

상세검색

RISS 보유자료

상세검색

해외전자자료

Output stationary NPU를 위한 데이터 레이아웃 최적화 및 벡터 연산 유닛 설계 = Data Layout Optimization and Vector Processing Unit for Output Stationary NPUs

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료