Artificial Intelligence thread

diadact

Just Hatched
Registered Member
BTW in the last 18 months the papers on eliciting reasoning in LLM sprung up like mushrooms. Many
Please, Log in or Register to view URLs content!
on internet say that OpenAI o1 is based
Please, Log in or Register to view URLs content!
...the other big players are months away from OpenAI, not years away
yeah it is based on the STAR method
Many of the authors of that paper are at XAI(Elon's company) now
What is most impressive to me is that with a 70B model trained on 18T tokens they reached the performance of the 405B Llama model trained on 15T tokens.
Qwen 2.5 has better performance due to more data
better filtered, bigger corpus, and more extensive RLHF data annotation
If only they had a bigger cluster and more compute they would have crushed 4o, sonnet 3.5 as well
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Have u tested o1?
Bear in mind that o1 preview is elementary
No tree search is happening at inference, they are just using CoT
Tree search will require lots of compute to serve millions of users which they don't have rn
CoT & MCTS + agents will solve the direction following problem
OpenAI will release Orion in Q4 2024 or Q1 2025
Claude 3.5 Opus, Gemini 2, Claude 4 will have agentic capabilities
AI has become a compute game
The one who has more compute will win
so we are constantly trying out different LLMs for our applications.
We've found OpenAI stuff to be the best thus far.
Again, things are improving. Some things that LLMs didn't do as well 6 months ago are now working much better.

Everyone makes a claim about how the latest LLM is the greatest ever and is reaching AGI.
Whatever
As I said, I test the latest available GPT-4o models to automate stuff and want to strangle it.

Sensetime v5, Claude 3, Deepseek v2/v2.5, GPT 4o, Qwen 2/2.5, gemini pro 1.5 were trained on synthetic data
If synthetic data was not working then we would have seen model collapses due to low entropy
Every major lab knows how to maintain data diversity and high entropy while using synthetic data
Qwen trained on 18T tokens had lots of high-quality synthetic tokens

All the top labs have solved this issue be it China or US or France

This is where human annotated & human generated data comes in
ScaleAI does this
sure, they all have some synthetic data, which in my mind get rid of some of the lower quality data out there and clean those up a little bit.

But there is a difference between using half real data and half synthetic data.

Vs 10% real data and 90% data generated from an older AI.

Fundamentally, AI models we have right now is just a prediction model of the next token. If 90% tokens are generated using an older prediction model, how is the new model going to be significantly better?
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
What is most impressive to me is that with a 70B model trained on 18T tokens they reached the performance of the 405B Llama model trained on 15T tokens.

For (dense) Transformer models computation per token is proportional to the number of parameters, this means that Qwen 70B reached the same performance of Llama 405B with a 5 times smaller computing budget!

This is even more impressive considering that Llama 3 is only few months old and that the Meta team behind Llama is world class, they are state-of-the-art.

Since about 2 years there is a new trend toward better training techniques applied to smaller models: this indirectly greatly helps China to workaround GPU limitations for the time being.

The model range 7B-70B is here to stay, these models will be the workhorses of future applications, even more when the new breed of reasoning models (that is actually 90% new training techniques) will spread, so that even a small 7B model will gain "reasoning" capabilities.

BTW in the last 18 months the papers on eliciting reasoning in LLM sprung up like mushrooms. Many
Please, Log in or Register to view URLs content!
on internet say that OpenAI o1 is based
Please, Log in or Register to view URLs content!
...the other big players are months away from OpenAI, not years away.

See this chart for 10 EFLOPS data center and computation time.

MT-TrainingTimeVsParameterToken.jpeg

so for 15 T tokens. Difference in computation between 300B parameter and 70B parameter in training time is 4 fold. So yeah, you are probably right on the training resource difference.

Which btw indicates one thing. Once you have certain number of tokens, there is like an optimal number of parameters. Having too many parameters don't make model significantly better.

Someone told me a while back that the ideal ratio of tokens to parameters is 20:1. But it seems like the ratio should be larger than that, since for Qwen-2.5, it was around 200:1.
 
Top