DeepSeek Open Source Week Day 3: DeepGEMM.
DeepGEMM is a library designed for concise and efficient FP8 general matrix multiplication (GEMM) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mixed of Experts (MoE) grouped GEMMs. The library is written in CUDA and does not require compilation during installation, compiling all kernels at runtime using a lightweight Just-in-Time (JIT) module.
Currently, DeepGEMM only supports NVIDIA Hopper tensor cores. To address the problem of inaccurate accumulation of FP8 tensor cores, it adopts a two-level accumulation (lifting) method of CUDA cores. Although it borrows some concepts from CUTLASS and CuTe, it avoids heavy reliance on their templates or algebra. Instead, the library is designed to be concise, containing only one core kernel function with about 300 lines of code. This makes it a clear and accessible resource for learning Hopper FP8 matrix multiplication and optimization techniques.

Up to 1350+ FP8 TFLOPS on Hopper GPU

No heavy dependencies, simple as a tutorial

Fully JIT compiled

~300 lines of core logic - but still outperforms expert-tuned kernels on most matrix sizes

Supports dense layout and two MoE layouts