The second bomb of DeepSeek Open Source Week is released! DeepEP!
The first open source EP communication library for MoE model training and inference.
This library is really a workhorse of MoE models, which can improve the throughput between GPU cores and reduce latency. And the library also supports low-precision operations, such as FP8.
In keeping with the group-limited gating algorithm proposed in the DeepSeek-V3 paper, DeepEP provides a set of kernels optimized for asymmetric domain bandwidth forwarding, such as forwarding data from the NVLink domain to the RDMA domain. These kernels provide high throughput, making them suitable for training and inference pre-population tasks. In addition, they also support SM (Streaming Multiprocessor) number control.
For latency-sensitive inference decoding, DeepEP includes a set of low-latency kernels that use pure RDMA to minimize latency. The library also introduces a hook-based communication computation overlap method that does not occupy any SM resources.
Note that this library still only supports Hopper GPUs (i.e. H100, H200, H800. Consumer-grade graphics cards are not supported yet)
github.com/deepseek-ai/DeepEP