Artificial Intelligence thread

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member

9dashline

Captain
Registered Member
Seems like they already made it available on the DeepSeek app (.apk andriod etc)

This is the first open weight reasoning model to solve the original o1 cipher problem, nothing before it could even get close... not QwQ, and not the previous DS V3 nor R1 Lite Preview

"""
oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz"""
 

Overbom

Brigadier
Registered Member
With this new DeepSeek release, Anthropic is over if they don't release a new SOTA model.

OpenAI is now more pressured than ever before, they better do some more fake benchmarks

For META, Zuck got some explaining to do to his investors why he blows billions for what DeepSeek does with a fraction of their costs

And as for Google, tbh I believe that they have the strongest foundational research team in AI, so not too worried about them
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
With this new DeepSeek release, Anthropic is over if they don't release a new SOTA model.

OpenAI is now more pressured than ever before, they better do some more fake benchmarks

For META, Zuck got some explaining to do to his investors why he blows billions for what DeepSeek does with a fraction of their costs

And as for Google, tbh I believe that they have the strongest foundational research team in AI, so not too worried about them
no, you are over analyzing things. Anthropic has amazing backing.

And beyond that, America is capable of copium to a level nobody here can imagine. We can just pretend that China and deepseek don't exist.

And we will keep shouting the loudest so that rest of the "free world" will continue to buy our bullshit.

btw, I'm only half joking about this. As long as American propaganda is well and alive, the idea of American being ahead in AI will keep getting pumped out there.

Deepseek is now very famous in the AI community, but how many people in the investor community know about it?

And what about all the other AI companies in China that have developed great stuff? Who knows about the work Bytedance is doing? And byte dance is huge.
 

Topazchen

Junior Member
Registered Member
With this new DeepSeek release, Anthropic is over if they don't release a new SOTA model.

OpenAI is now more pressured than ever before, they better do some more fake benchmarks

For META, Zuck got some explaining to do to his investors why he blows billions for what DeepSeek does with a fraction of their costs

And as for Google, tbh I believe that they have the strongest foundational research team in AI, so not too worried about them
Alibaba had better cook up something amazing, or they risk becoming the Anthropic of open source
 

Overbom

Brigadier
Registered Member
no, you are over analyzing things. Anthropic has amazing backing.

And beyond that, America is capable of copium to a level nobody here can imagine. We can just pretend that China and deepseek don't exist.

And we will keep shouting the loudest so that rest of the "free world" will continue to buy our bullshit.

btw, I'm only half joking about this. As long as American propaganda is well and alive, the idea of American being ahead in AI will keep getting pumped out there.

Deepseek is now very famous in the AI community, but how many people in the investor community know about it?

And what about all the other AI companies in China that have developed great stuff? Who knows about the work Bytedance is doing? And byte dance is huge.
In prod. everyone (I know) uses OpenAI for (SOTA), Anthropic (SOTA/Closest to SOTA), and Gemini (cheap but still not very bad model for 3rd AI model validation)

IMO DeepSeek nicely slots in Anthropic and even OpenAI case (nobody is crazy enough to use o1 for volume). So we could use the newest DeepSeek model and one of its distilled models (Qwen probably).
Yes I know that Americans have enough copium but in business nobody cares about this. We just want reliable and cheap models. And sans any security regulation I see no reason why DeepSeek won't be used. OpenRouter time..

Small note though that I am talking like that based on benchmarks, but if my testing proves that in real world scenarios DeepSeek models aren't that far behind the benchmarks, then I will heavily push into using them. Both in business and personal capacity
 

european_guy

Junior Member
Registered Member
Deepseek r1 weights released...

This is like dropping an atomic warhead on OpenAI

Mother of God this one is stronk... beats o1 medium....

I have read the paper
Please, Log in or Register to view URLs content!
.

It is incredibly interesting, especially the part about RL (reinforcement learning) that is what enables the "reasoning". RL is a vast and active research field and there are many RL recipes that researchers are continuously inventing/testing.

Deepseek guys are the first ones with a SOTA reasoning model to openly show their recipe.

Their recipe is called Group Relative Policy Optimization (GRPO). Is their independent invention, an improvement above
Please, Log in or Register to view URLs content!
invented 1.5 years ago that is itself an improvement over
Please, Log in or Register to view URLs content!
(2017).

RL improves a model by forcing it to maximize an objective. It is not a "next token prediction", but quite a more complex one. In case of this new model is:


1737387907563.png
This seems a bit intimidating, but actually is not.

I will try to explain in simple terms because what we are witnessing here is IMHO the beginning of something big.

To teach their model to reasoning they do the following:

1. They sample N answers for each question: They ask the same question to the model N times and get N answers, every time a different one.

2. They compute the probability of each answer as the product of the probability of each token (word) given the previous ones. For instance, remembering that LLMs generate one token at a time, if the question + current partial generated answer is "What is the capital of US? The capital of " They read and store the probability of generating "US" token (probability for each emitted token is always computed by the LLM as part of its generation process). They continue until the end of "question + answer" sentence.

3. They reward the answers, i.e. they verify if answer is correct or not. Assume the reward is 1 for correct answer and 0 for incorrect answer and LLM generates 10 responses: 3 correct and 7 wrong: the mean reward will be 0.3 (i.e. 3/10)

4. They compute the Advantage (eq 3) that is simply subtracting the mean for each answer reward (ignore the std deviation at denominator). So correct answers will have advantage of 1 - 0.3 = 0.7, while wrong answers will have advantage 0 -0.3 = -0.3 (here negative is crucial!)

5. They compute the objective to maximize as sums of this term (ignore clipping and the last Dkl term are needed only to give stability to training):

1737389268598.png

Here the pi terms at numerator and denominator are the new optimized LLM and the old LLM from which training epoch start. Let's assume they are the same. So that term means the probability that the LLM, given the question q ("What is the capital of US?") answers with answer Oi (our i-th answer out of 10 we got before). Crucially that term is multiplied by the advantage Ai (i.e. 0.7 if answer ok or -0.3 if answer wrong). The first 1/G followed by big E it just means the sum of each term, one for each answer, so 10 in total. Divided by 10 (i.e. the average).

So the quantity that LLM training has to maximizes is like: 0.7*probability(answer 1) -0.3*prob(a2) +.....+ 0.7*prob(a10)

We can see immediately that any change in LLM weights that improves probability of a correct answer increases the objective, while those that improve probability of a wrong answer decrease it (due to negative advantage coefficient).

The RL training just makes thousands of questions and sample many more answers, compute the above term and updates the LLM weights accordingly and slowly but surely it seems that...it works!

One critical observation is that only the full answer is rewarded, not every token of it. IOW the LLM is free to think whatever it wants as long as the result is ok. There is no external constrain on the thinking process.

And with great surprise of researchers, during the middle of training the LLM started to "evolve" in a totally independent and autonomous way:

1737390353281.png
 
Last edited:
Top