• Journal of Semiconductors
  • Vol. 45, Issue 4, 040204 (2024)
Bohan Yang1、3, Jia Chen1、2, and Fengbin Tu1、2、*
Author Affiliations
  • 1Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
  • 2AI Chip Center for Emerging Smart Systems, The Hong Kong University of Science and Technology, Hong Kong, China
  • 3School of the Gifted Young, University of Science and Technology of China, Hefei 230026, China
  • show less
    DOI: 10.1088/1674-4926/45/4/040204 Cite this Article
    Bohan Yang, Jia Chen, Fengbin Tu. Towards efficient generative AI and beyond-AI computing: New trends on ISSCC 2024 machine learning accelerators[J]. Journal of Semiconductors, 2024, 45(4): 040204 Copy Citation Text show less

    Abstract

    Apart from the Ising machine, there are also accelerators for other solver algorithms in ISSCC 2024. Shim et al. from UCSB propose VIP-SAT to solve the Boolean satisfiability (SAT) problem with a scalable digital in-memory computing dataflow and hardware-software co-design method[28]. Ju et al. from Northwestern University accelerate real-time partial differential equation (PDE) solver on edge devices by reconfiguring its architecture between advanced physics-informed neural network (PINN) mode for low latency and traditional finite element method (FEM) mode for high accuracy[29].

    Trend 3: DSA for embedded vision processors

    Last year, Prof. Chixiao Chen’s group from Fudan University summarized trends for ML chips beyond CNN computing in the "Chip Olympiad" ISSCC 2023[4]. In this survey, we take a deep look into ISSCC 2024 and observe four research trends toward efficient generative AI (ML chips for generative AI, computing-in-memory (CIM) innovation from circuits to systems) and beyond-AI computing (DSA for embedded vision processors, DSA for solver accelerators), as illustrated in Fig. 1. We believe these remarkable trends will lead to more AI software and hardware innovations from academia and industry in the near future.

    Trend 4: DSA for solver accelerators

    Trend 2: CIM innovation from circuits to systems

    Moreover, researchers are collaborating and considering challenges from a systematic perspective to help CIM integrate into real computing systems. A DCIM-based neural visual-enhancement engine (NVE) is fabricated in the 3 nm process through the collaboration of MediaTek and TSMC[17]. Wang et al. from the University of Texas at Austin present Vecim, a RISC-V vector co-processor integrated with a CIM-based vector register file, using foundry SRAM cells in 65 nm CMOS for efficient high-performance computing[18]. On the other hand, the fusion with RISC-V also enhances the programmability of CIM-based systems.

    Since Tu et al. from Tsinghua University proposed the first FP CIM on ISSCC in 2022[14], the CIM community has been actively studying this new direction to meet the growing high-precision demand of both AI inference and training, especially in this generative AI era. Wang et al. from Tsinghua University propose an FP SRAM-CIM macro based on an emerging dynamic POSIT8 data format, which utilizes the low bitwidth to achieve comparable accuracy to the BF16 format[15]. Wen et al. from National Tsing Hua University present an FP ReRAM-CIM macro, with a kernel-wise weight pre-alignment scheme and a rescheduled multi-bit input compression scheme to suppress the amount of truncated data by 1.96−2.47× and reduce MAC operation cycles by 4.73×[16].

    Hybrid CIM is another appealing direction in ISSCC 2024 to integrate the advantages of analog and digital schemes. Guo et al. from Southeast University design a lightning-like analog/digital hybrid CIM structure to capitalize on both the high energy efficiency of analog CIM (ACIM) and the robust accuracy of DCIM[12]. Wang et al. from the Institute of Microelectronics, Chinese Academy of Sciences report a hybrid flash-SRAM CIM design to support on-chip plastic learning in a 14 nm FinFET process[13].

    ISSCC 2024 sheds light on many research prototypes and commercial products on this trend. Park et al. from KAIST integrate neural radiance fields (NeRF), simultaneous localization and mapping (SLAM), and sparse mixture-of-expert (SMoE) in their space-mate[21], a fast and low-power NeRF-SLAM edge accelerator, which solves irregular SMoE expert loading patterns by out-of-order MoE router, large mapping energy expense by familiar region pruning, and dual-mode sparsity by heterogeneous coarse-grained sparse core. Ryu et al., also from KAIST, design an accelerator for vanilla-NeRF-based instant 3D modeling and real-time rendering, called NeuGPU[22]. They use segmented hashing tables with data tiling to reduce on-chip storage pressure and attention-based hybrid interpolation units to alleviate bank conflict costs. Also, by their exploration of similar activation characteristics in NeRF, they compressed similar values in feature vectors of adjacent samples. Nose et al. from Renesas Electronics propose a heterogenous processor for multi-task and real-time robot applications[23], combining vision recognition with planning and control based on the cooperation among a dynamically reconfigurable processor, AI accelerator, and embedded CPU.

    (Color online) New trends on ISSCC 2024 machine learning accelerators.

    Figure 1.(Color online) New trends on ISSCC 2024 machine learning accelerators.

    Compared to the last decade when the convolution neural network (CNN) dominated the research field, machine learning (ML) algorithms have reached a pivotal moment called the generative artificial intelligence (AI) era. With the emergence of large-scale foundation models[1], such as large multimodal model (LMM) GPT-4[2] and text-to-image generative model DALL·E[3], advanced ML accelerators should address challenging scaling problems from both compute and memory. Meanwhile, with the growing demand for general intelligence on diverse application scenarios (robotics, automobiles, digital economy, manufacturing, etc.), designing integrated smart systems with intelligent perception, processing, planning, and action capabilities is a trend for future AI platforms. Besides the AI processing engine, an integrated smart system also requires domain-specific architectures (DSA) for beyond-AI computing.

    When we move towards more autonomous and collaborative AI computing, visual perception of physical environments becomes a fundamental capability for future integrated smart systems. In 2023, Apple Vision Pro[19] reinvigorated the landscape of augmented reality (AR) and virtual reality (VR), blending digital elements with our physical surroundings (a.k.a. spatial computing). Moreover, emerging embodied AI hinges on agile visual perception and LLM-embedded processing systems, enabling a deeper understanding of our world (e.g., Figure 01 robot[20]). However, these vision workloads need to directly communicate with humans and give fast feedback during runtime. A slow response may significantly hurt user experience, especially in autonomous driving tasks. Also, edge devices usually have strict constraints on device weight, leading to a layout with limited batteries. To drive these devices continuously, ultra-low power consumption is required for the underlying hardware. As more real-time and resource-constrained edge systems are supporting intelligent vision tasks, we observed a trend to design dedicated embedded vision processors.

    In ISSCC 2024, Dr. Jonah Alben, senior vice president of NVIDIA, delivers a plenary speech about Computing in the Era of Generative AI[5]. He shows the possibility of alleviating the gap between AI algorithms and underlying computing systems by letting AI help us design chips because they can find more optimized architecture and circuits. In the main conference, AMD proposes MI300 series processors for generative AI and high-performance computing (HPC) workloads with modular chiplet package and cutting-edge HBM3 memory which has up to 192 GB capacity and 5.3 TB/s peak theoretical bandwidth[6]. Guo et al. from Tsinghua University propose the first heterogenous CIM-based accelerator for image-generative diffusion models[7], which leverages pixel similarity between denoising iterations to apply mixed quantization. They propose novel methods in bit-parallel CIM by booth-8 multiplier, balanced integer/floating-point (INT/FP) processing latency by exponent processing acceleration, and support FP sparsity by in-memory redundancy search. Kim et al. from KAIST target hybrid SNN-transformer inference on edge[8]. Spiking neural network (SNN) has supremacy in low power and high efficiency with low-bitwidth accumulation-only discrete operations and can be integrated with transformers. They realize an ultra-low-power on-device inference system by hybrid multiplication/accumulation units, speculative decoding, and implicit weight generation, reducing external memory access (EMA) by 74%−81%. In addition, a special forum named Energy-Efficient AI-Computing Systems for Large-Language Models shares more practical thoughts about large language model (LLM) computing systems[9]. Georgia Institute of Technology, NVIDIA, Intel, Google, KAIST, Samsung, Axelera AI, and MediaTek introduce their latest research over LLM training and inference in both cloud and edge.

    In conclusion, we discuss four exciting ML accelerator research trends in ISSCC 2024: ML chips for generative AI, CIM innovation from circuits to systems, DSA for embedded vision processors, and DSA for solver accelerators. With the goal of efficient generative AI and beyond-AI computing, we believe future ML accelerators will realize general intelligence in diverse application scenarios.

    Since 2022, generative AI models have demonstrated remarkable capabilities in creating new data samples based on the probabilistic model learned from huge datasets. By generating high-quality and coherent text, images, or even control signals, generative AI has the potential to revolutionize any field with its creative sample generation skills. In order to power this new era of AI platforms with high computing and memory efficiency, there is a trend of ML chips for generative AI observed in different events of ISSCC 2024, including the plenary speech, main conference, and forum.

    The lookup table (LUT)-based digital CIM (DCIM) scheme presents a promising enhancement in the compute density of traditional single-bit-input DCIM design. TSMC reports an advanced 3 nm-node DCIM macro with a parallel multiply-accumulation (MAC) architecture based on LUTs[10]. He et al. from Tsinghua University also propose an eDRAM-LUT-based DCIM macro for compute and memory-intensive applications[11].

    Computing-in-memory (CIM) technology, as a promising way to break the memory wall of traditional Von Neumann architecture, has gained significant popularity in the integrated circuit and computer architecture communities. As complex ML algorithms advance, new challenges for CIM design are widely investigated. In ISSCC 2024, we have noticed many CIM innovations from energy-efficient circuits, floating-point (FP) support, to system integration.

    The optimization problem solver is another important component of integrated smart systems, which can take the role of intelligent decision-making and planning. Optimization problems across various domains, including modeling, controlling, and scheduling, are addressed by corresponding solver algorithms. However, given that the large and complex solution space should be efficiently explored in real-time systems, hardware implementations of solver algorithms face strict requirements in terms of low latency, high accuracy, and high robustness.

    Trend 1: ML chips for generative AI

    In ISSCC 2024, we observe an emerging trend of DSA for solver accelerators. One of the popular solvers is the Ising machine solver for combinatorial optimization problems (COP). The Ising machine’s significance lies in its ability to solve COPs with the nondeterministic polynomial-time (NP) complexity costing only polynomial overhead. Together with the hardware-friendly dataflow, Ising machine accelerators achieve significant speedup and efficiency improvement for solving COPs. Chu et al. from Taiwan University propose an annealing-based Ising machine processor for large-scale autonomous navigation with integrated mapping workflow[24]. Song et al. from Peking University design an eDRAM-based continuous-time Ising machine with embedded annealing and leaked negative feedback[25]. Bae et al. from the UCSB propose two chips for scalable SRAM-based Ising macro with enhanced chimera topology[26] and continuous-time latch-based Ising computer using massively parallel random-number generations and replica equalizations[27].

    References

    [1] R Bommasani, D A Hudson, E Adeli et al. On the opportunities and risks of foundation models. arXiv preprint, 1(2021).

    [2] J Achiam, S Adler, S Agarwal et al. GPT-4 technical report. arXiv preprint, 1(2023).

    [3] A Ramesh, M Pavlov, G Goh et al. Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 1, 1(2021).

    [4] C Mu, J P Zheng, C X Chen. Beyond convolutional neural networks computing: New trends on ISSCC 2023 machine learning chips. J Semicond, 44, 050203(2023).

    [5] J Alben. Computing in the era of generative AI, 26(2024).

    [6] A Smith, E Chapman, C Patel et al. AMD InstinctTM MI300 series modular chiplet package–HPC and AI accelerator for exa-class systems, 490(2024).

    [7] R Q Guo, L Wang, X F Chen et al. A 28nm 74.34TFLOPS/W BF16 heterogenous CIM-based accelerator exploiting denoising-similarity for diffusion models, 362(2024).

    [8] S Kim, S Kim, W Jo et al. C-transformer: A 2.6-18.1μJ/token homogeneous DNN-transformer/spiking-transformer processor with big-little network and implicit weight generation for large language models, 368(2024).

    [10] H Fujiwara, H Mori, W C Zhao et al. A 3nm, 32.5TOPS/W, 55.0TOPS/mm2 and 3.78Mb/mm2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell, 572(2024).

    [11] Y F He, S P Fan, X Li et al. A 28nm 2.4Mb/mm2 6.9-16.3TOPS/mm2 eDRAM-LUT-based digital-computing-in-memory macro with in-memory encoding and refreshing, 578(2024).

    [12] A Guo, X Chen, F Y Dong et al. A 22nm 64kb lightning-like hybrid computing-in-memory macro with a compressed adder tree and analog-storage quantizers for transformer and CNNs, 570(2024).

    [13] L F Wang, W Z Li, Z D Zhou et al. A flash-SRAM-ADC-fused plastic computing-in-memory macro for learning in neural networks in a standard 14nm FinFET process, 582(2024).

    [14] F B Tu, Y Q Wang, Z H Wu et al. A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 reconfigurable digital CIM processor with unified FP/INT pipeline and bitwise In-memory booth multiplication for cloud deep learning acceleration, 1(2022).

    [15] Y Wang, X L Yang, Y B Qin et al. A 28nm 83.23TFLOPS/W POSIT-based compute-in-memory macro for high-accuracy AI applications, 566(2024).

    [16] T H Wen, H H Hsu, W S Khwa et al. A 22nm 16Mb floating-point ReRAM compute-in-memory macro with 31.2TFLOPS/W for AI edge devices, 1, 1(2024).

    [17] M E Shih, S W Hsieh, P Y Tsai et al. NVE: A 3nm 23.2TOPS/W 12b-digital-CIM-based neural engine for high-resolution visual-quality enhancement on smart devices, 360(2024).

    [18] Y P Wang, M T Yang, C P Lo et al. Vecim: A 289.13GOPS/W RISC-V vector co-processor with compute-in-memory vector register file for efficient high-performance computing, 492(2024).

    [21] G Park, S Song, H Y Sang et al. Space-mate: A 303.5mW real-time sparse mixture-of-experts-based NeRF-SLAM processor for mobile spatial computing, 374(2024).

    [22] J Ryu, H Kwon, W Park et al. NeuGPU: A 18.5mJ/iter neural-graphics processing unit for instant-modeling and real-time rendering with segmented-hashing architecture, 372(2024).

    [23] K Nose, T Fujii, K Togawa et al. A 23.9TOPS/W @ 0.8V, 130TOPS AI accelerator with 16 × performance-accelerable pruning in 14nm heterogeneous embedded MPU for real-time robot applications, 364(2024).

    [24] Y C Chu, Y C Lin, Y C Lo et al. A fully integrated annealing processor for large-scale autonomous navigation optimization, 488(2024).

    [25] J H Song, Z H Wu, X Y Tang et al. A variation-tolerant In-eDRAM continuous-time Ising machine featuring 15-level coefficients and leaked negative-feedback annealing, 490(2024).

    [26] J Bae, C Shim, B Kim. E-chimera: A scalable SRAM-based Ising macro with enhanced-chimera topology for solving combinatorial optimization problems within memory, 286(2024).

    [27] J Bae, J Koo, C Shim et al. LISA: A 576 × 4 all-in-one replica-spins continuous-time latch-based Ising computer using massively-parallel random-number generations and replica equalizations, 284(2024).

    [28] C Shim, J Bae, B Kim. VIP-sat: A Boolean satisfiability solver featuring 5 × 12 variable in-memory processing elements with 98% solvability for 50-variables 218-clauses 3-SAT problems, 486(2024).

    [29] Y H Ju, G Q Xu, J Gu. A 28nm physics computing unit supporting emerging physics-informed neural network and finite element method for real-time scientific computing on edge devices, 366(2024).

    Bohan Yang, Jia Chen, Fengbin Tu. Towards efficient generative AI and beyond-AI computing: New trends on ISSCC 2024 machine learning accelerators[J]. Journal of Semiconductors, 2024, 45(4): 040204
    Download Citation