A Super Memory Processing Unit Based on 3D Stacking and Hybrid Bonding for High-Efficiency AI Computing
Abstract
DRAM-based in-memory computing integrates computational regions into the main memory, enabling local data processing within the memory, thereby achieving faster and more efficient data computation. However, enhancing system performance requires addressing a critical challenge: achieving more general and sufficiently powerful data processing capabilities within DRAM-PIM. Existing DRAM-PIM implementations often suffer from limited computational capabilities due to the shared standard DRAM package area between memory cells and computational circuits or because the operator circuits are overly customized, which limits their ability to meet required data processing demands. To address this issue, in this paper, we propose a Super Memory Processing Unit (SMPU). The SMPU uses Hybrid Bonding technology to 3D-stack DRAM and many-core computational clusters, enabling large-bandwidth (0.25 TB/s per-bank, 2 TB/s for 8-bank system bandwidth) on-chip data transmission between DRAM and the computational cluster via copper interconnects, effectively breaking the memory wall bottleneck of existing computing architectures. The SMPU constructs a dual-channel fine-grained computational cluster at the logical computing layer, providing flexible and ample computility for various AI models, such as ResNet50 and Llama2. The SMPU uses standard DDR protocols and integrates a new memory space allocation and parsing controller to ensure system compatibility without modifying the host-end hardware, facilitating the integration and invocation of computility in memory particles. Additionally, the SMPU features an independent dual-channel memory-management mechanism within the memory particles, enabling simultaneous multi-channel, multi-modal AI model inference. We compared a CPU system equipped with an SMPU to current computing systems using FPGA simulations. The FPGA simulation results show that, under the same computational configuration, the system with the SMPU improves the performance of ResNet50-v1.5 by up to 5.1× and Llama by up to 27.43× compared to the base system, while reducing system power consumption by 71.6% (ResNet50-v1.5) to 77.8% (Llama 7B).