Artificial Intelligence thread

OptimusLion · Feb 23, 2025

About half an hour ago, Deepseek just updated the first of its five open source library unlocking plans: FlashMLA.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.

https://twitter.com/i/web/status/1893836827574030466

Hyper · Feb 23, 2025

OptimusLion said:
About half an hour ago, Deepseek just updated the first of its five open source library unlocking plans: FlashMLA.
This is an efficient MLA decoding kernel developed by Deepseek specifically for optimizing Hopper GPUs. It is optimized for variable-length sequences. Compared with the traditional attention mechanism, MLA eliminates the bottleneck of key-value cache during inference through low-rank key-value joint compression, and FlashMLA further improves the decoding efficiency of MLA, which is critical for natural language processing (NLP) tasks where the length of input data (such as sentences or documents) varies greatly. This flexibility gives it a significant advantage in processing dynamic-length data, and is very suitable for processing tasks such as natural language processing (such as machine translation, text generation), speech recognition (processing audio sequences of different lengths), and time series analysis (such as data prediction in the fields of finance and meteorology), because in these applications, the sequences are not uniform.
The kernel is currently in production, and the technical focus is:
Support BF16! This is a compact numerical format that takes into account both model accuracy and computational efficiency. Compared to higher precision formats such as FP32 (32-bit floating point), BF16 reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks. This is especially useful for deploying LLMs on resource-limited hardware or scaling to larger models.
Paged KV cache (block size of 64)! By splitting data into manageable chunks, FlashMLA is able to manage memory more efficiently, especially for large-scale models. This is particularly beneficial for LLMs where memory limitations can bottleneck performance.
Using CUDA 12.6, the kernel achieves excellent performance metrics on the H800 SXM5, achieving up to 3000 GB/s in a memory-bound configuration for fast data access and transfer. Achieving 580 TFLOPS in a compute-bound configuration, providing high computational throughput for workloads, this performance shows that FlashMLA is very efficient when processing large amounts of data and complex computations, and is particularly suitable for AI model inference tasks that require high throughput.
3000 GB/s memory transfer rate is crazy.
FlashMLA also provides easy-to-use APIs, such as get_mla_metadata (for obtaining MLA metadata) and flash_mla_with_kvcache (supports FlashMLA decoding with key-value cache), which reduces the learning cost for developers, and it is open source, which means that with the participation of open source community developers, FlashMLA will continue to be improved, updated and optimized to maintain its technological leadership.

https://twitter.com/i/web/status/1893836827574030466

Release is one per week or one per day for a week?

9dashline · Feb 23, 2025

Hyper said:
Release is one per week or one per day for a week?

one per day, five total, to friday

im hoping they will release a clone to chatgpt Deep Research etc

9dashline · Feb 24, 2025

qwen dropping new model tonight, either Qwen 3 or full QwQ (nonlite)

luminary · Feb 24, 2025

Overbom said:
Nevertheless, I have long said that Google is one of the best if not the best positioned company in AI. Their Gemini models are extremely cheap and fast while their hardware is great for inference.

DeepSeek just needs to scale up, and if not, what's the harm, R1 is open source so other companies can scale it up if DeepSeek can't or won't

Google is slow but has the well-rounded talent and resources to progress. Google Voice is superior to US competitors for both English and Chinese. Gemini is free and doesn't require phone number registration, a big deal.

A even bigger hack on 4090 rumored:

https://twitter.com/i/web/status/1893686635239289307

tphuang · Feb 24, 2025

https://twitter.com/i/web/status/1893736193432359239

DeepSeek R1 is now the most liked model on Huggingface. Huge accomplishment given that it's only been around for a month.

tphuang · Feb 24, 2025

Please, Log in or Register to view URLs content!

YY has introduced DeepSeek to its platform as YYDS AI digital tools

Please, Log in or Register to view URLs content!

Lenovo has created AI server and work stations for DeepSeek usage. Supporting local 70B model inference

Please, Log in or Register to view URLs content!

iFlyTek fully supporting both v3/R1 on its app

Please, Log in or Register to view URLs content!

DeepSeek R1 full version now available for free on Dangbei AI

Please, Log in or Register to view URLs content!

Wanxing tech subsidiary is integrating DeepSeek-R1 into its Draw Icons software and plug-in for PPT.

Sinofan · Feb 24, 2025

A report from TD Cowen on 21 Feb 2025 writes:

"channel checks indicate that Microsoft has
1) canceled leases in the US, totaling 'a couple of hundred MWs' with at least two private data center operators
2) has pulled back on the conversion of SOQs to leases, and
3) has re-allocated a considerable portion of its international spend to the US."

The report concludes "it points to a potential oversupply position for MSFT."

And if Microsoft is first in canceling leases and generally blowing up the "Capex to the Sky" narrative (on which the "market to the sky" narrative is built), which it would have to be since it has the biggest projected capex hockeystick of all its Mag7 peers...... then everyone else will promptly follow.

The companies responsible for the very Capex binge that is supposed to propel markets ever higher and justify the S&P's ludicrous 22x PE multiple, were quietly cutting their losses.

The Capital Expenditure bubble - which was supposed to inject up to half a trillion dollars in new growth capital in just a few years to keep up with an exponential AI demand curve - may have just gone pop.

tokenanalyst · Feb 24, 2025

Please, Log in or Register to view URLs content!

tokenanalyst · Feb 24, 2025

Please, Log in or Register to view URLs content!

Artificial Intelligence thread

OptimusLion

Junior Member

Hyper

Junior Member

9dashline

Captain

9dashline

Captain

luminary

Senior Member

tphuang

General

tphuang

General

Sinofan

Just Hatched

Attachments

tokenanalyst

Lieutenant General

tokenanalyst

Lieutenant General