My commentary on Qwen-2.5-Max. I think they got a little unlucky here with DeepSeek beating them to the punch and stealing everyone's heart away. They may have a better model now than V3, but it came a month too late and without a comparable reasoning model that is super addictive.
This model has the potential to become world-class.
It has been pre-trained on 20T tokens!!!
It is 20000 billions of tokens. For reference LLama 3 405B was trained on 15T tokens and was a record at that time. Maybe GPT-o1 and Google Gemini are also trained on similar token budgets but they don't disclose this important info.
After the pre-training phase you get what is called the base model. And their base model is already world class as we can see in this table of base models comparison (OpenAI, Anthropic and Google don't allow access to their base models). In particular it is stronger than DeepSeek V3 base, the base model of R1.
Pre-training is by far the most resource consuming phase of the training and it is when the model accumulates knowledge of the world.
Following phase called fine-tuning or post-training is much lighter from a resource budget POW, but is the key one to make all this knowledge come to fruition, to make all the good features of the model to emerge (including reasoning).
A strong base model will always develop, after post-training, into a strong instruct (i.e. finished) model.
So, now that the recipe for reasoning is out in the open (btw the recipe involves post-training, not pre-training) we can be very confident that this base model will evolve into a top class model better than R1 within 2/3 months.
Our base models have demonstrated significant advantages across most benchmarks, and we are optimistic that advancements in post-training techniques will elevate the next version of Qwen2.5-Max to new heights.
Last edited: