Artificial Intelligence thread

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
I have read the paper
Please, Log in or Register to view URLs content!
.

It is incredibly interesting, especially the part about RL (reinforcement learning) that is what enables the "reasoning". RL is a vast and active research field and there are many RL recipes that researchers are continuously inventing/testing.

Deepseek guys are the first ones with a SOTA reasoning model to openly show their recipe.

Their recipe is called Group Relative Policy Optimization (GRPO). Is their independent invention, an improvement above
Please, Log in or Register to view URLs content!
invented 1.5 years ago that is itself an improvement over
Please, Log in or Register to view URLs content!
(2017).

RL improves a model by forcing it to maximize an objective. It is not a "next token prediction", but quite a more complex one. In case of this new model is:


View attachment 144001
This seems a bit intimidating, but actually is not.

I will try to explain in simple terms because what we are witnessing here is IMHO the beginning of something big.

To teach their model to reasoning they do the following:

1. They sample N answers for each question: They ask the same question to the model N times and get N answers, every time a different one.

2. They compute the probability of each answer as the product of the probability of each token (word) given the previous ones. For instance, remembering that LLMs generate one token at a time, if the question + current partial generated answer is "What is the capital of US? The capital of " They read and store the probability of generating "US" token (probability for each emitted token is always computed by the LLM as part of its generation process). They continue until the end of "question + answer" sentence.

3. They reward the answers, i.e. they verify if answer is correct or not. Assume the reward is 1 for correct answer and 0 for incorrect answer and LLM generates 10 responses: 3 correct and 7 wrong: the mean reward will be 0.3 (i.e. 3/10)

4. They compute the Advantage (eq 3) that is simply subtracting the mean for each answer reward (ignore the std deviation at denominator). So correct answers will have advantage of 1 - 0.3 = 0.7, while wrong answers will have advantage 0 -0.3 = -0.3 (here negative is crucial!)

5. They compute the objective to maximize as sums of this term (ignore clipping and the last Dkl term are needed only to give stability to training):

View attachment 144002

Here the pi terms at numerator and denominator are the new optimized LLM and the old LLM from which training epoch start. Let's assume they are the same. So that term means the probability that the LLM, given the question q ("What is the capital of US?") answers with answer Oi (our i-th answer out of 10 we got before). Crucially that term is multiplied by the advantage Ai (i.e. 0.7 if answer ok or -0.3 if answer wrong). The first 1/G followed by big E it just means the sum of each term, one for each answer, so 10 in total. Divided by 10 (i.e. the average).

So the quantity that LLM training has to maximizes is like: 0.7*probability(answer 1) -0.3*prob(a2) +.....+ 0.7*prob(a10)

We can see immediately that any change in LLM weights that improves probability of a correct answer increases the objective, while those that improve probability of a wrong answer decrease it (due to negative advantage coefficient).

The RL training just makes thousands of questions and sample many more answers, compute the above term and updates the LLM weights accordingly and slowly but surely it seems that...it works!

One critical observation is that only the full answer is rewarded, not every token of it. IOW the LLM is free to think whatever it wants as long as the result is ok. There is no external constrain on the thinking process.

And with great surprise of researchers, during the middle of training the LLM started to "evolve" in a totally independent and autonomous way:

View attachment 144005
sampling multiple answers for each question, computing probability and assigning weights and calculating averages is pretty much the only way to get around hallucination and just uncertainty with AI.

A lot of people don't realize that we are not at the point where we can do 100% confidence on answers with AI yet. So what really helps is when they can give us an answer as well as how confident they are about it.

gpt can give you a confidence score too, but it takes a lot of work to get it to be logical. Maybe deepseek's confidence level is more intelligent.

The problem imo with reasoning models is that they still take too long. For a lot of queries when it comes to just automating interactions in real life, you need the AI to answer with something after 3 to 4 seconds. o1-preview was basically taking way too much time to ever be useful that kind of application. gpt-4o can very often also take up 10 seconds when you pass in quite a few tokens and wait for output to be generated.

In many ways, openAI's current level of service is garbage, but since we all wrote to their tech stack a year ago, changing to something else takes time.

When I tested my scripts and prompts vs deepseek v3, the answers were pretty much as good as gpt-4o answers from day 1 even though the prompts were written based on testing with gpt-4o, so they are not optimized for deepseek v3.
 

european_guy

Junior Member
Registered Member
sampling multiple answers for each question, computing probability and assigning weights and calculating averages is pretty much the only way to get around hallucination and just uncertainty with AI.

This is the training process I was explaining, not what it does at inference time.

During training they sample and perform RL. The probabilities are not important for their absolute value (i.e. is not important if reliable or not). It is part of the machinery to compute the gradients, i.e. the directions where to "move" the weights.

Inference time (when user actually use the model) is done in the usual way, there is no oversampling. The model read the question and just answer it once (although the answer can be longer).

But because it went through the RL process during training, its answer will be better pondered and improved compared to the old model.

My post was referring to how model is trained, not how is used (sorry if not clear).
 

Fatty

Junior Member
Registered Member
I cannot stress enough how this basically implies that US AI companies are all burning money for nothing. You are spending billions of investors money to build something that’s replicated or surpassed in a month or two by a lab that sells for 50 times cheaper and makes it open source. Deepseek will kill the AI industry when companies start realizing this. There is no path to profitability because of them!
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
This is the training process I was explaining, not what it does at inference time.

During training they sample and perform RL. The probabilities are not important for their absolute value (i.e. is not important if reliable or not). It is part of the machinery to compute the gradients, i.e. the directions where to "move" the weights.

Inference time (when user actually use the model) is done in the usual way, there is no oversampling. The model read the question and just answer it once (although the answer can be longer).

But because it went through the RL process during training, its answer will be better pondered and improved compared to the old model.

My post was referring to how model is trained, not how is used (sorry if not clear).
Maybe that's why for inference, I still have to do over sampling of o1 to deal with hallucination issues.

That's why r1 lowering inference cost is a game changer. The current o1 pricing level is just completely unworkable if you want to automate stuff. What OpenAI charges is ludicrous and they have the nerve of telling the world that they are losing money.

I cannot stress enough how this basically implies that US AI companies are all burning money for nothing. You are spending billions of investors money to build something that’s replicated or surpassed in a month or two by a lab that sells for 50 times cheaper and makes it open source. Deepseek will kill the AI industry when companies start realizing this. There is no path to profitability because of them!
well, the great thing about deepseek and Qwen making the model weights and RL process available is that anyone in the world can utilize their algo to create their own open source reasoning model. So that we the public have the power of controlling AI and that we are no longer just subservient to the tech overloads in Silicon Valley. Whose goal is to achieve global domination and techno feudalism over rest of us, who they want to rule over in their serfdom.
 

tokenanalyst

Brigadier
Registered Member
This DeepSeek model is impressive and is even more impressive is how much training with this model data can improve other smaller models. This is absolutely insane. This means that even smaller Chinese AI companies don't really need a lot of GPUs to make really great AI models.
1737409428504.png
It was something that I noted with this model called Sky-T1-32B-Preview from UCB is that it was trained using data from QwQ reasoning model and increased the model performance significantly at a very low cost.

1737409951634.png


For the laughs.

1737409208554.png
 

tokenanalyst

Brigadier
Registered Member
Looks like MiniCPM launch was overshadowed by the launch of DeepSeek

A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone​

MiniCPM-o 2.6​

MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:

-Leading Visual Capability. MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks.
-State-of-the-art Speech Capability. MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices in English and Chinese.
-Strong Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 2.6 can accept continous video and audio streams independent of user queries, and support real-time speech interaction.
-Strong OCR Capability and Others. Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344).
-Superior Efficiency. In addition to its friendly size, MiniCPM-o 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token).
-Easy Usage. MiniCPM-o 2.6 can be easily used in various ways: (1) llama.cpp support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with LLaMA-Factory, (5) quick local WebUI demo setup with Gradio, and (6) online web demo on server.

Please, Log in or Register to view URLs content!
 
Top