Artificial Intelligence thread

jli88

Junior Member
Registered Member
yes, did you see my comment right above this?

seriously, have people actually used GLM-5.0-Turbo or GLM-5.1 before they write crap like this?

Today, my prompt on GLM-5.0-Turbo.



Give me the scripts, json files and everything in 5 minutes. Almost fully worked on a first pass.

See Design Arena


Let's agree to disagree here I guess.

Fact of the matter is that Claude/OpenAI is being used widely even in China.

I asked Kimi K2.5 to look for Chinese sources and find which is best model for coding, and even that says it is Opus 4.6. Here is one benchmark of building a real product.
Please, Log in or Register to view URLs content!


It will be hard for K2.5 to be better than Opus 4.6 when the latter has 5 times the params of the former, and is trained at 1/5th or less of the tokens.
 

solarz

Brigadier
Have you tried coding? I can assure you that Claude models are absolutely the best (followed by latest OpenAI models) when it comes to general coding use cases, hence their widespread use by coders. If they could escape by using much cheaper Chinese models they would have, but unfortunately not.

By size of the models, Opus is rumored to be 5T, Mythos to be > 10T. And these are being trained on upwards of 100 T tokens.

By comparison, Chinese models are much smaller. Kimi K2.5 for example is 1.04 Trillion params, with training done on 15 T tokens. It is fantastic for its size. But China needs to scale right now!

I use LLM models in coding, we have access to various models including Claude and GPT. I have to say I failed to notice any significant difference between any of them.

The problem with these token or parameter counts is that they’re meaningless without context. By far the biggest hurdle for LLM coding is context size. It doesn’t matter how many parameters your model has, or how many tokens it has trained on, when it can’t keep even a few thousand lines of code in the context.
 

jli88

Junior Member
Registered Member
I use LLM models in coding, we have access to various models including Claude and GPT. I have to say I failed to notice any significant difference between any of them.

The problem with these token or parameter counts is that they’re meaningless without context. By far the biggest hurdle for LLM coding is context size. It doesn’t matter how many parameters your model has, or how many tokens it has trained on, when it can’t keep even a few thousand lines of code in the context.

Even there clearly Claude is ahead, Opus 4.6 is 1M in context. While Kimi K2.5 is 256 k and GLM 5.1 is 200k in context size.

Clearly, Chinese models need to scale by a factor of 10.
 

solarz

Brigadier
yes, did you see my comment right above this?

seriously, have people actually used GLM-5.0-Turbo or GLM-5.1 before they write crap like this?

Today, my prompt on GLM-5.0-Turbo.

Give me the scripts, json files and everything in 5 minutes. Almost fully worked on a first pass.

Honestly, that’s just an API scan along with some data parsing.

Try telling it to refactor an engine class for dependency injection, or come up with test cases based on code logic.

Hell, LLMs often fail at even simple tasks. I once tried to get it to refactor a large number of unit tests using a particular template, and it just crapped out on me. This was the classic boilerplate/repetitive task that AI is supposed to be good at. This goes back to my previous comment about context size being the biggest bottleneck right now.
 

solarz

Brigadier
Even there clearly Claude is ahead, Opus 4.6 is 1M in context. While Kimi K2.5 is 256 k and GLM 5.1 is 200k in context size.

Clearly, Chinese models need to scale by a factor of 10.

Not sure where you are getting that Opus 4.6 has 1M tokens context size, I haven’t heard of it and I certainly haven’t noticed it. Like I said, all the models that I’ve seen pretty much behave in the same way.
 

jli88

Junior Member
Registered Member
Not sure where you are getting that Opus 4.6 has 1M tokens context size, I haven’t heard of it and I certainly haven’t noticed it. Like I said, all the models that I’ve seen pretty much behave in the same way.


"Opus 4.6 features a 1M token context window"

Please, Log in or Register to view URLs content!
 

bsdnf

Senior Member
Registered Member
Even there clearly Claude is ahead, Opus 4.6 is 1M in context. While Kimi K2.5 is 256 k and GLM 5.1 is 200k in context size.

Clearly, Chinese models need to scale by a factor of 10.
1M context window is a lie.

The reality is that the upper limit for all current models is 256k. Once you hit this limit, model performance will drops significantly, you have to go context compression.

Especially now that Opus suffers from severe computational power deduction, problems that could previously be solved by brute force using powerful benchmarking capabilities after exceeding the window limit are now exposed.

Can you imagine? It can't even solve a car wash problem now.
 
Last edited:

tphuang

General
Staff member
Super Moderator
VIP Professional
Registered Member
Let's agree to disagree here I guess.

Fact of the matter is that Claude/OpenAI is being used widely even in China.

I asked Kimi K2.5 to look for Chinese sources and find which is best model for coding, and even that says it is Opus 4.6. Here is one benchmark of building a real product.
Please, Log in or Register to view URLs content!


It will be hard for K2.5 to be better than Opus 4.6 when the latter has 5 times the params of the former, and is trained at 1/5th or less of the tokens.
why would Kimi K2.5 know who is the best model at coding.

I actually do think Opus 4.6 is probably the best. But for all the tasks I have, Chinese models can do it pretty well, so why do I pay money? Now, if you say Claude is widely used in China is a fact, maybe provide evidence to support that's the case still?

Honestly, have you tried to use GLM-5-Turbo for coding tasks? If you don't try, how do you know it's not good?

Honestly, that’s just an API scan along with some data parsing.

Try telling it to refactor an engine class for dependency injection, or come up with test cases based on code logic.

Hell, LLMs often fail at even simple tasks. I once tried to get it to refactor a large number of unit tests using a particular template, and it just crapped out on me. This was the classic boilerplate/repetitive task that AI is supposed to be good at. This goes back to my previous comment about context size being the biggest bottleneck right now.

Why would I need to tell it refactor engine class? What's wrong with my existing engine class? lol.

You make it sound like vast majority of dev work is ground breaking. I just need it to do shit work and it does that pretty well. Can it replace a guy that I would otherwise pay $150k a year? Yes, it can.
 

bsdnf

Senior Member
Registered Member
That obviously is not true . The only model with true 1 million token context window is Deepseek.
Deepseek V4 has only demonstrated potential exceeding 256K, and the actual improvement remains to be seen. Furthermore, if its baseline capabilities are insufficient, it will still lag behind context-compressed Gpt or Opus.
 
Top