About half an hour ago, Deepseek just updated the first of its five open source library unlocking plans: FlashMLA.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.