Artificial Intelligence thread

OptimusLion

Junior Member
Registered Member
About half an hour ago, Deepseek just updated the first of its five open source library unlocking plans: FlashMLA.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.
 

Hyper

Junior Member
Registered Member
About half an hour ago, Deepseek just updated the first of its five open source library unlocking plans: FlashMLA.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.
Release is one per week or one per day for a week?
 

luminary

Senior Member
Registered Member
Nevertheless, I have long said that Google is one of the best if not the best positioned company in AI. Their Gemini models are extremely cheap and fast while their hardware is great for inference.

DeepSeek just needs to scale up, and if not, what's the harm, R1 is open source so other companies can scale it up if DeepSeek can't or won't
Google is slow but has the well-rounded talent and resources to progress. Google Voice is superior to US competitors for both English and Chinese. Gemini is free and doesn't require phone number registration, a big deal.



A even bigger hack on 4090 rumored:
1000005476.jpg
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Please, Log in or Register to view URLs content!

YY has introduced DeepSeek to its platform as YYDS AI digital tools

Please, Log in or Register to view URLs content!

Lenovo has created AI server and work stations for DeepSeek usage. Supporting local 70B model inference

Please, Log in or Register to view URLs content!

iFlyTek fully supporting both v3/R1 on its app

Please, Log in or Register to view URLs content!

DeepSeek R1 full version now available for free on Dangbei AI

Please, Log in or Register to view URLs content!

Wanxing tech subsidiary is integrating DeepSeek-R1 into its Draw Icons software and plug-in for PPT.
 

Sinofan

Just Hatched
Registered Member
A report from TD Cowen on 21 Feb 2025 writes:

"channel checks indicate that Microsoft has
1) canceled leases in the US, totaling 'a couple of hundred MWs' with at least two private data center operators
2) has pulled back on the conversion of SOQs to leases, and
3) has re-allocated a considerable portion of its international spend to the US."

The report concludes "it points to a potential oversupply position for MSFT."

And if Microsoft is first in canceling leases and generally blowing up the "Capex to the Sky" narrative (on which the "market to the sky" narrative is built), which it would have to be since it has the biggest projected capex hockeystick of all its Mag7 peers...... then everyone else will promptly follow.

The companies responsible for the very Capex binge that is supposed to propel markets ever higher and justify the S&P's ludicrous 22x PE multiple, were quietly cutting their losses.

The Capital Expenditure bubble - which was supposed to inject up to half a trillion dollars in new growth capital in just a few years to keep up with an exponential AI demand curve - may have just gone pop.
 

Attachments

  • Large-cap Capex By Companies.png
    Large-cap Capex By Companies.png
    195.6 KB · Views: 5
Top