1. Onnx, O. developers, https://onnx. ai/, 2024, version: 1.16.1, , 2024
2. Layer normalization, J. L. Ba, J. R. Kiros and, G. E. Hinton, Available https//arxiv. org/abs/1607.06450, , 2016
3. Long short-term memory, J. Schmidhuber, S. Hochreiter and, Neural Comput., vol. 9, no. 8, p. 1735–1780Online Available https//doi. org/10.1162/neco.1997.9.8.1735, , 1997
4. Attention is all you need in, I. Guyon, L. u. Kaiser and, A. N. Gomez, U. V. Luxburg, L. Jones, A. Vaswani, R. Fergus, I. Polosukhin, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017., H. Wallach, N. Shazeer, S. Bengio, N. Parmar, J. Uszkoreit, Advances in Neural Information Processing Systems,, , 2017
5. Optimization by simulated annealing, S. Kirkpatrick, M. P. Vecchi, C. D. Gelatt and, vol. 220, no. 4598, pp. 671–680Online Available http//www. jstor. org/stable/1690046, , 1983
6. ScalesimSystolic cnn accelerator simulator, T. Krishna, M. Mattina and, P. Whatmough, A. Samajdar, Y. Zhu, arXiv preprint arXiv:1811.02883, , 2018
7. Adaptation in Natural and Artificial Systems, J. H. Holland, Ann Arbor, MI: University of Michigan Press, 1975, second edition, , 1992
8. Open cell library in 15nm freepdk technology in, G. Schlinker, A. Reis, R. P. Ribas, L. Rech and, J. M. Matos, M. Martins, J. Michelsen, Proceedings of the 2015 Symposium on International Symposium on Physical Design, ser. ISPD ’15. New York, NY, USA: Association for Computing Machinery, p. 171–178. [Online]. Available: https://doi. org/10.1145/2717764.2717783, , 2015
9. CompilersPrinciples Techniques and Tools2nd Edition, A. V. Aho, M. S. Lam, R. Sethi and, J. D. Ullman, Addison WesleyOnline Available http//www amazon ca/exec/obidos/redirecttag=citeulike09- 20&path=ASIN/0321486811, , 2006
10. A survey on compiler autotuning using machine learning, C. Silvano, J. Cavazos, A. H. Ashouri, W. Killian, G. Palermo and, ACM Comput. Surv. vol. 51, no. 5,Online Available https//doi org/10.1145/3197978, , 2018
1. Onnx, O. developers, https://onnx. ai/, 2024, version: 1.16.1, , 2024
2. Layer normalization, J. L. Ba, J. R. Kiros and, G. E. Hinton, Available https//arxiv. org/abs/1607.06450, , 2016
3. Long short-term memory, J. Schmidhuber, S. Hochreiter and, Neural Comput., vol. 9, no. 8, p. 1735–1780Online Available https//doi. org/10.1162/neco.1997.9.8.1735, , 1997
4. Attention is all you need in, I. Guyon, L. u. Kaiser and, A. N. Gomez, U. V. Luxburg, L. Jones, A. Vaswani, R. Fergus, I. Polosukhin, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017., H. Wallach, N. Shazeer, S. Bengio, N. Parmar, J. Uszkoreit, Advances in Neural Information Processing Systems,, , 2017
5. Optimization by simulated annealing, S. Kirkpatrick, M. P. Vecchi, C. D. Gelatt and, vol. 220, no. 4598, pp. 671–680Online Available http//www. jstor. org/stable/1690046, , 1983
6. ScalesimSystolic cnn accelerator simulator, T. Krishna, M. Mattina and, P. Whatmough, A. Samajdar, Y. Zhu, arXiv preprint arXiv:1811.02883, , 2018
7. Adaptation in Natural and Artificial Systems, J. H. Holland, Ann Arbor, MI: University of Michigan Press, 1975, second edition, , 1992
8. Open cell library in 15nm freepdk technology in, G. Schlinker, A. Reis, R. P. Ribas, L. Rech and, J. M. Matos, M. Martins, J. Michelsen, Proceedings of the 2015 Symposium on International Symposium on Physical Design, ser. ISPD ’15. New York, NY, USA: Association for Computing Machinery, p. 171–178. [Online]. Available: https://doi. org/10.1145/2717764.2717783, , 2015
9. CompilersPrinciples Techniques and Tools2nd Edition, A. V. Aho, M. S. Lam, R. Sethi and, J. D. Ullman, Addison WesleyOnline Available http//www amazon ca/exec/obidos/redirecttag=citeulike09- 20&path=ASIN/0321486811, , 2006
10. A survey on compiler autotuning using machine learning, C. Silvano, J. Cavazos, A. H. Ashouri, W. Killian, G. Palermo and, ACM Comput. Surv. vol. 51, no. 5,Online Available https//doi org/10.1145/3197978, , 2018
11. Ai accelerator on ibm telum processorindustrial product, H. Pozidis, E. Tzortzatos, R. Bertran, C. Lichtenau, P. Figuli, C. Jacobi, A. Sica and, A. Saporito, A. Buyuktosunoglu, N. Papandreou, Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1012–1028. [Online]. Available: https://doi. org/10.1145/3470496.3533042, , 2022
12. GrapheneAn ir for optimized tensor computations on gpus, H. Chen, B. Hagedorn, C. Cecka, M. Garland and, B. Fan, V. Grover, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 302–313. [Online]. Available: https://doi. org/10.1145/3582016.3582018, , 2023
13. TimeloopA systematic approach to dnn accelerator evaluation in, Y. S. Shao, V. A. Ying, A. Parashar, A. Mukkara, Y. H. Chen, R. Venkatesan, P. Raina, S. W. Keckler and, B. Khailany, J. Emer, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Conference Proceedings, pp. 304–315., , 2019
14. Imagenet classification with deep convolutional neural networks, G. E. Hinton, I. Sutskever and, A. Krizhevsky, vol. 60, no. 6, p. 84–90Online Available https//doi org/10.1145/3065386, , 2017
15. Gcd2A globally optimizing compiler for mapping dnns to mobile dsps, G. Agrawal and, J. Guan, Y. Wang, W. Niu, B. Ren, X. Shen, in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 512–529, , 2022
16. Tvman automated end-to-end optimizing compiler for deep learning in, A. Krishnamurthy, T. Chen, L. Wang, C. Guestrin and, Y. Hu, L. Zheng, M. Cowan, Z. Jiang, E. Yan, H. Shen, T. Moreau, L. Ceze, Pro- ceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 579–594., , 2018
17. Data movement is all you needA case study on optimizing transformers, S. Li and, N. Dryden, A. Ivanov, T. Hoefler, T. Ben-Nun, Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, , 2021
18. MlperfAn industry standard benchmark suite for machine learning performance, G. Schmuelling, G.-Y. Wei and, D. Kanter, C.-J. Wu, C. Coleman, D. Patterson, C. Cheng, P. Micikevicius, H. Tang, G. Diamos, V. J. Reddi, P. Mattson, IEEE Micro, vol. 40, no. 2, pp. 8–16, 2020., , 2020
19. TileflowA framework for modeling fusion dataflow via tree-based analysis in, G. Sun, S. Gao, L. Jia, S. Chen, R. Wang and, Y. Liang, S. Zheng, 2023 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), Conference Proceedings, pp. 1271–1288, , 2023
20. InterstellarUsing halides scheduling language to analyze dnn accelerators in, Q. Liu, X. Yang, S. Bell, A. Nayak, H. Ha, C. Kozyrakis and, M. Horowitz, J. Pu, M. Gao, P. Raina, K. Cao, J. Setter, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 369–383. [Online]. Available: https://doi. org/10.1145/3373376.3378514, , 2020
21. ChameleonAdaptive code optimization for expedited deep neural network compilation, A. Yazdanbakhsh and, B. H. Ahn, P. Pilligundla, H. Esmaeilzadeh, 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, OpenReview. netOnline, , 2020
22. TenetA framework for modeling tensor dataflow based on relation-centric notation in, L. Jia, J. Cong and, Y. Liang, Z. Luo, Y. Wang, N. Guan, L. Lu, J. Yin, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Conference Proceedings, pp. 720–733, , 2021
23. A uniform latency model for dnn accelerators with diverse architectures and dataflows, L. Mei, H. E. Sumbul, M. Verhelst and, E. Beigne, H. Liu, T. Wu, in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), Conference Proceedings, pp. 220–225, , 2022
24. onednn graph compilerA hybrid approach for high-performance deep learning compilation, D. Lavery, Y. Zhang, Y. Song, E. Lin and, C. Chen, L. Du, J. Cui, Y. Zhang, Y. Mei, B. Jin, J. Ye, J. Li, Z. Qin, X. Cheng, 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Conference Proceedings, pp. 460–470., , 2024
25. Impact of local interconnects on timing and power in a high performance microprocessor, M. Patyra, R. S. Shelar and, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 10, pp. 1623–1627, , 2013
26. AutomapAutomatic mapping of neural networks to deep learning accelerators for edge devices, X. Jin, H. Zheng, Q. Zou and, Z. Zhao, M. Nie, C.-J. R. Shi, Y. Wang, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 9, pp. 2994–3006, 2023, , 2023
27. EyerissAn energyefficient reconfigurable accelerator for deep convolutional neural networks, J. S. Emer and, V. Sze, Y.-H. Chen, T. Krishna, vol. 52, no. 1, pp. 127–138, , 2017
28. AltBreaking the wall between data layout and loop optimizations for deep learning compilation, H. Wan, J. Xu, X. Wang, W. Wang, K. Wang and, H. Peng, H. Dai, H. Cheng, Y. Xu, G. Chen, Z. Xu, Proceedings of the Eighteenth European Conference on Computer Systems, ser. EuroSys ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 199–214. [Online]. Available: https://doi. org/10.1145/3552326.3587440, , 2023
29. Batch normalizationaccelerating deep network training by reducing internal covariate shift in, S. Ioffe and, C. Szegedy, Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15. JMLR. org, p. 448–456, , 2015
30. MaestroA data-centric approach to understand reuse performance and hardware cost of dnn mappings, M. Pellauer and, V. Sarkar, H. Kwon, A. Parashar, T. Krishna, P. Chatarasi, IEEE Micro, vol. 40, no. 3, pp. 20–29, , 2020
31. speedai240A 2-petaflop 30-teraflops/w at-memory inference acceleration device with 1456 risc-v cores, R. Beachler, M. Snelgrove and, IEEE Micro, vol. 43, no. 3, pp. 58–63, 2023, , 2023
32. High-performance deep-learning coprocessor integrated into x86 soc with server-class cpus industrial product in, G. Henry, P. Palangpour, M. Thomson, K. Houck, B. Arden, K. O’Brien, J. S. Gardner, B. Seroussi and, S. Petersen, J. Donahue, T. Walker, J. Johnson, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 15–26., , 2020
33. Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators in, M. Yang, Y. Zhou, C. Guo, Y. Zhu, Q. Chen, M. Guo and, Y. Liang, J. Leng, 2021 IEEE International Symposium on Workload Characterization (IISWC). Los Alamitos, CA, USA: IEEE Computer Society, pp. 214–225., , 2021
34. DefinesEnabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling in, K. Goetschalckx, L. Mei, A. Symons and, M. Verhelst, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 570–583, , 2023
35. RubickA unified infrastructure for analyzing exploring and implementing spatial architectures via dataflow decomposition, J. Cong, J. Yin, Z. Luo, L. Lu, J. Yin, S. Zheng, Y. Liang and, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 4, pp. 1177–1190, 2024, , 2024