Artificial Intelligence thread

Hyper

Junior Member
Registered Member
No. DeepSeek has been well-known in the AI space, especially at the open-source community.
Afaik the only well known Chinese LLMs in the West, were Qwen and DeepSeek

There is no way Huawei didn't know about DeepSeek unless they were really incompetent
Deepseek was more popular in the west than in China. Certainly was discussed more on reddit and ycombinator than here.
 

antwerpery

Junior Member
Registered Member
No. DeepSeek has been well-known in the AI space, especially at the open-source community.
Afaik the only well known Chinese LLMs in the West, were Qwen and DeepSeek

There is no way Huawei didn't know about DeepSeek unless they were really incompetent
Like I said, Deepseek probably didn't reach out and they probably weren't famous enough until recently for Huawei to contact them without promoting. Still not a good look for Huawei, but better than Deepseek reaching out and not getting a respond back.
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Please, Log in or Register to view URLs content!

Groq which makes very competitive inference chips is now compatible with DeepSeek r1 distilled 70B version.

Very interesting that they are only doing this with Llama version. it's like they are not comfortable with using a fully Chinese models, so they took the best distilled American one.

Still, getting it out there is good.

The world does not run on Nvidia.
 

vincent

Grumpy Old Man
Staff member
Moderator - World Affairs
Please, Log in or Register to view URLs content!
……..
This will simultaneously show you a few important things:
  • One, this model is absolutely legit. There is a lot of BS that goes on with AI benchmarks, which are routinely gamed so that models appear to perform great on the benchmarks but then suck in real world tests. Google is certainly the worst offender in this regard, constantly crowing about how amazing their LLMs are, when they are so awful in any real world test that they can't even reliably accomplish the simplest possible tasks, let alone challenging coding tasks. These DeepSeek models are not like that— the responses are coherent, compelling, and absolutely on the same level as those from OpenAI and Anthropic.
  • Two, that DeepSeek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.

  • A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process……DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network.
  • Another major breakthrough is their multi-token prediction system.
  • One of their most innovative developments is what they call Multi-head Latent Attention (MLA).
  • They also made major advances in GPU communication efficiency through their DualPipe algorithm and custom communication kernels.
  • Another very smart thing they did is to use what is known as a Mixture-of-Experts (MOE) Transformer architecture, but with key innovations around load balancing.
……..
The real advantage of this approach is that it allows the model to contain a huge amount of knowledge without being very unwieldy, because even though the aggregate number of parameters is high across all the experts, only a small subset of these parameters is "active" at any given time, which means that you only need to store this small subset of weights in VRAM in order to do inference.
……..
Beyond what has already been described, the technical papers mention several other key optimizations. These include their extremely memory-efficient training framework that avoids tensor parallelism, recomputes certain operations during backpropagation instead of storing them, and shares parameters between the main model and auxiliary prediction modules. The sum total of all these innovations, when layered together, has led to the ~45x efficiency improvement numbers that have been tossed around online, and I am perfectly willing to believe these are in the right ballpark.
………
Before moving on, I'd be remiss if I didn't mention that many people are speculating that DeepSeek is simply lying about the number of GPUs and GPU hours spent training these models because they actually possess far more H100s than they are supposed to have given the export restrictions on these cards, and they don't want to cause trouble for themselves or hurt their chances of acquiring more of these cards. While it's certainly possible, I think it's more likely that they are telling the truth, and that they have simply been able to achieve these incredible results by being extremely clever and creative in their approach to training and inference. They explain how they are doing things, and I suspect that it's only a matter of time before their results are widely replicated and confirmed by other researchers at various other labs.
 

Overbom

Brigadier
Registered Member
Please, Log in or Register to view URLs content!

Groq which makes very competitive inference chips is now compatible with DeepSeek r1 distilled 70B version.

Very interesting that they are only doing this with Llama version. it's like they are not comfortable with using a fully Chinese models, so they took the best distilled American one.

Still, getting it out there is good.

The world does not run on Nvidia.
Saw some insanely fast times for Groq running it.

YouTuber Matt Wolfe using Groq had a 275 t/s inference speed

Whereas DeepSeek API speed was 15 t/s (when I checked a few days ago from OpenRoute). Edit: Didn't remember correctly, that speed was about the full R1 model

And I agree, I think inference is much more open for disruption on Nvidia. A lot less moat
 

Overbom

Brigadier
Registered Member
Qianwen just released Qwen2.5-Max! The score is higher than DeepSeek-V3 and GPT-4o-0806! And it is a MoE model! It feels like the domestic MoE model has completely exploded.

However, it should be a closed source model, only API and web access are provided.

View attachment 144588


Please, Log in or Register to view URLs content!
Looks to comprehensively beat DeepSeek V3. If they did (assume they are working on this already) similar advancements on their reasoning model based on this...

Would be nice to know how big the model is
1000008586.jpg
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Alibaba just launched a new Qwen model. Qwen2.5-Max.

Looks to comprehensively beat DeepSeek V3. If they did (assume they are working on this already) similar advancements on their reasoning model based on this...

Would be nice to know how big the model is
View attachment 144589
it is funny to see the newer Chinese open source models now that compare themselves to DeepSeek first before Llama and GPT-4O.

Guys, show some self confidence. It's absolutely pathetic of Chinese culture to only be proud of something only once it's validated in Western countries. The level of discourse on Chinese social media around DeepSeek in the past 2 days is several magnitude higher than even last week.
 
Top