Artificial Intelligence thread

tphuang · Aug 22, 2025

so if all of this is true, then it does seem like DeepSeek delayed some of its model development in order to adapt to this new FP format so that it can fully work and utilize next generation Ascend chips, which it will likely then get some low cost access to.

keep in mind that while next Ascend chip hasn't been unveiled yet, DeepSeek likely has access to it already.

european_guy · Aug 22, 2025

tphuang said:
But then the mantissa and sign are stored in other byte, right?

So effectively, you are storing floats in 2 bytes and int in 1 byte. Otherwise, I'm not seeing the advantage of UE8M0 FP8 over FP16

No, the key point is you don't have mantissa. In the paper in the tweet post is explained how, but intuitively is like if you use only a subset of numbers (those that can be expressed with only an exponent) and live with them. According to the paper it still works very well.

But after I wrote my post I stumbled on this:

https://twitter.com/i/web/status/1958627849759596779

UE8M0 is for scales, not for the actual tensor elements, which remain E4M3 or E5M2;

IOW they use this new format only to scale up/down other fp8 values So DeepSeek 3.1 doesn't use this (quite disruptive) idea, but simply stores the scales in UE8M0 instead of the usual fp16 or fp32. This is to mimic how NVIDIA MXFP8 format works and today we know that also some future local chip will use the same MXFP8 (like NVIDIA from H100 on).

So from technical POW is no more so interesting, it remains a big news for the indirect announcement of a soon new Chinese chip (the new Ascend?)

siegecrossbow · Aug 22, 2025

european_guy said:
New DeepSeek data format

Please, Log in or Register to view URLs content!

This is something that went a bit under the radar, but IMHO is a very interesting part.

What it means?

A model parameter/weight is a number, no more no less. DeepSeek has 671B parameters, it means it works by processing input across a huge number of big tables called matrices, each one with millions of parameters / numbers.

Now, how a computer represents a number? It can use a 16 bit format like fp16 where each number is stored in 16 bits, or a fp8 where each number is stored in 8 bits.

It can store the numbers as integers like in int8 or as floating point numbers as in fp8. Floating point it means that a number N is represented as a power of 2 multiplied by a fractional part:

N = s * m * 2^e, where e = exponent, m = mantissa, s = sign (-1 or 1)

View attachment 158942

In the picture above the orange part stores the exponent, the green the mantissa, so E4M3 it means 4 bits for the exponent and 3 for mantissa + 1 (the blue one) for the sign for a total of 8 bits -> FP8

So how is done the UE8M0 FP8 used by DeepSeek?

It is a 8 bit floating point number with all the 8 bits used for exponent, no mantissa. It is like you can only represent powers of 2, so for instance

exponent 1 (01) -> corresponds to 2^1 = 2
exponent 2 (10) -> corresponds to 2^2 = 4
exponent 3 (11) -> corresponds to 2^3 = 8

Now suppose you want to perform a multiplication:

2 * 4 = 8

In this UE8M0 format, 2 corresponds to exponent 1, 4 corresponds to 2 and 8 corresponds to 3. So

2 * 4 = 8 corresponds to summing the exponents 1 + 2 = 3

In this format multiplication can be implemented with a sum, and instead of the costly hardware multiply circuit a much simpler adder circuit can be used!

The above example is not a single case, this is called logarithmic number system and rely on the property that log(a*b) = log(a)+log(b)

https://twitter.com/i/web/status/1958573570688524596

Because in the models the biggest operation by far is matrix multiplication, this trick could simplify a lot the hardware.

Isn’t this how slide rules work?

tphuang · Aug 22, 2025

https://twitter.com/i/web/status/1958943860559638748

Large data centers being built across China powered by low cost renewable energy.

tphuang · Aug 22, 2025

https://twitter.com/i/web/status/1958982694613787027

Another one here by China Mobile in Zhangjiakou. 15 EFLOPS to be installed here eventually. This is probably benefiting from clean energy in near by Inner Mongolia region.

tokenanalyst · Aug 22, 2025

Please, Log in or Register to view URLs content!

tphuang · Aug 23, 2025

Please, Log in or Register to view URLs content!

DeepSeek V3.1 is now online with VolCEngine And available at 20-40ms time TPOT latency.

jnd85 · Aug 23, 2025

tamsen_ikard said:
I think AI is moving towards more efficient networks achieving same performance instead of better performance. Cause it really hard to achieve better performance on chat bots without significant expansion in data quantity and quality.

If you look back at PC chip development, it followed kind of a similar trend with clock-cycles and performance to heat ratios. For the first couple decades it was all about boosting performance while accepting that heat would increase as a necessary evil. Then eventually the inefficiency was just too much to ignore and chip producers had no choice but to pump out several generations of chips that really only performed the same or moderately better than past generations, but ran cooler and more efficiently.

AI is kind of at the same point, performance levels already make it a useful tool, but it is incredibly inefficient. So now they have to focus on efficiency for a while, and that seems to be a trend across the board for all the LLM companies.

But where the chip manufacturers had to spend years before they could focus on performance again, I wager the effieciency gainst will be realized and LLMs will start focusing on better performance and features much faster.

european_guy · Aug 24, 2025

siegecrossbow said:
Isn’t this how slide rules work?

I looked up "slide rule", because English is not my native language:

Keuffel_and_Esser-Model_4181-1_Log_log_Duplex_Decitrig_slide_rule-IMG_5821-white_(cropped).jpg

It was quite a surprise for me, I remember my father used this strange "tool" when I was a child. To answer your question, yes it is how slide rules work.

BTW, I saw this tweet reposted

https://twitter.com/i/web/status/1959245865328984144

For clarification purposes I'd like to just point out that it is wrong.

In particular this part needs clarification:

In the next-generation Blackwell architecture, NVIDIA has further introduced "microscaling formats," encompassing a variety of new representations, including MXFP8 (8-bit) ...... In contrast, the UE8M0 FP8 format proposed by the Chinese team DeepSeek in its V3.1 model takes a completely different approach. UE8M0 adopts a minimalist design

There is no contrast between NVIDIA MXFP8 and UE8M0, in particular

MXFP8 = FP8 for parameters + UE8M0 for microscaling

Please, Log in or Register to view URLs content!

The key issue with FP8 is that it is difficult to store with same precision both big numbers and small numbers, for instance suppose you have these parameters to store:

parameters = [0.001, 0.023, 1240, 2300]

Now storing all of them in just 8 bits you have to compromise on the precision of the big ones or on the little ones. To solve the problem microscaling is used. with microscaling you store the following

parameters = [1, 23, 12.40, 23] , and additional 2 scales = [0.001, 100]

and then you multiply parameters with their corresponding scale before you use them:

p1 = 1 * 0.001, p2 = 23 * 0.001, p3= 12.40 * 100, p4 = 23 * 100

UE8M0 is used only to store the scaling factors and is used in this way by both NVIDIA and DeepSeek.

DeepSeek explicitly mentions that UE8M0 is used for microscaling:

Please, Log in or Register to view URLs content!

Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.

tphuang · Aug 24, 2025

https://twitter.com/i/web/status/1959780021712666657

Apparently, they had 788 EFLOPS FP16 in AI data center compute as of June of 2025

Artificial Intelligence thread

tphuang

General

european_guy

Junior Member

siegecrossbow

Field Marshall

tphuang

General

tphuang

General

tokenanalyst

Lieutenant General

tphuang

General

jnd85

Junior Member

european_guy

Junior Member

Attachments

tphuang

General