Maybe slightly off topic here, but does anyone know what GPT model Xiaohongshu uses for its translation feature?
sampling multiple answers for each question, computing probability and assigning weights and calculating averages is pretty much the only way to get around hallucination and just uncertainty with AI.I have read the paper .
It is incredibly interesting, especially the part about RL (reinforcement learning) that is what enables the "reasoning". RL is a vast and active research field and there are many RL recipes that researchers are continuously inventing/testing.
Deepseek guys are the first ones with a SOTA reasoning model to openly show their recipe.
Their recipe is called Group Relative Policy Optimization (GRPO). Is their independent invention, an improvement above invented 1.5 years ago that is itself an improvement over (2017).
RL improves a model by forcing it to maximize an objective. It is not a "next token prediction", but quite a more complex one. In case of this new model is:
View attachment 144001
This seems a bit intimidating, but actually is not.
I will try to explain in simple terms because what we are witnessing here is IMHO the beginning of something big.
To teach their model to reasoning they do the following:
1. They sample N answers for each question: They ask the same question to the model N times and get N answers, every time a different one.
2. They compute the probability of each answer as the product of the probability of each token (word) given the previous ones. For instance, remembering that LLMs generate one token at a time, if the question + current partial generated answer is "What is the capital of US? The capital of " They read and store the probability of generating "US" token (probability for each emitted token is always computed by the LLM as part of its generation process). They continue until the end of "question + answer" sentence.
3. They reward the answers, i.e. they verify if answer is correct or not. Assume the reward is 1 for correct answer and 0 for incorrect answer and LLM generates 10 responses: 3 correct and 7 wrong: the mean reward will be 0.3 (i.e. 3/10)
4. They compute the Advantage (eq 3) that is simply subtracting the mean for each answer reward (ignore the std deviation at denominator). So correct answers will have advantage of 1 - 0.3 = 0.7, while wrong answers will have advantage 0 -0.3 = -0.3 (here negative is crucial!)
5. They compute the objective to maximize as sums of this term (ignore clipping and the last Dkl term are needed only to give stability to training):
View attachment 144002
Here the pi terms at numerator and denominator are the new optimized LLM and the old LLM from which training epoch start. Let's assume they are the same. So that term means the probability that the LLM, given the question q ("What is the capital of US?") answers with answer Oi (our i-th answer out of 10 we got before). Crucially that term is multiplied by the advantage Ai (i.e. 0.7 if answer ok or -0.3 if answer wrong). The first 1/G followed by big E it just means the sum of each term, one for each answer, so 10 in total. Divided by 10 (i.e. the average).
So the quantity that LLM training has to maximizes is like: 0.7*probability(answer 1) -0.3*prob(a2) +.....+ 0.7*prob(a10)
We can see immediately that any change in LLM weights that improves probability of a correct answer increases the objective, while those that improve probability of a wrong answer decrease it (due to negative advantage coefficient).
The RL training just makes thousands of questions and sample many more answers, compute the above term and updates the LLM weights accordingly and slowly but surely it seems that...it works!
One critical observation is that only the full answer is rewarded, not every token of it. IOW the LLM is free to think whatever it wants as long as the result is ok. There is no external constrain on the thinking process.
And with great surprise of researchers, during the middle of training the LLM started to "evolve" in a totally independent and autonomous way:
View attachment 144005
sampling multiple answers for each question, computing probability and assigning weights and calculating averages is pretty much the only way to get around hallucination and just uncertainty with AI.
Maybe that's why for inference, I still have to do over sampling of o1 to deal with hallucination issues.This is the training process I was explaining, not what it does at inference time.
During training they sample and perform RL. The probabilities are not important for their absolute value (i.e. is not important if reliable or not). It is part of the machinery to compute the gradients, i.e. the directions where to "move" the weights.
Inference time (when user actually use the model) is done in the usual way, there is no oversampling. The model read the question and just answer it once (although the answer can be longer).
But because it went through the RL process during training, its answer will be better pondered and improved compared to the old model.
My post was referring to how model is trained, not how is used (sorry if not clear).
well, the great thing about deepseek and Qwen making the model weights and RL process available is that anyone in the world can utilize their algo to create their own open source reasoning model. So that we the public have the power of controlling AI and that we are no longer just subservient to the tech overloads in Silicon Valley. Whose goal is to achieve global domination and techno feudalism over rest of us, who they want to rule over in their serfdom.I cannot stress enough how this basically implies that US AI companies are all burning money for nothing. You are spending billions of investors money to build something that’s replicated or surpassed in a month or two by a lab that sells for 50 times cheaper and makes it open source. Deepseek will kill the AI industry when companies start realizing this. There is no path to profitability because of them!