Artificial Intelligence thread

european_guy

Junior Member
Registered Member
Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.

We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.

English is an interesting language:

1. Regarding its vocabulary, it is the most indo-european language of all because has large amount of words from other European languages: by origin is a Germanic language, but half of its vocabulary is from Latin or strictly derived languages (French/Spanish/etc) {f.i. in the sentence I have just written above, these are the words with Latin origin: vocabulary, language, large, amount, Germanic, strictly, derived, sentence]. It has also many words from Greek: Europe, music, cemetery, galaxy, democracy, etc. For historical reason, it has much less from the other big branch of the European family, that is Slavic languages.

2. Instead regarding it's grammar, it is totally different from the others. It is much, much simplified. The original proto-indo-european language was quite a complex one, with 9 cases and a lot of stuff. Today the slavic languages keep a grammar similar to the orginal one, in the ancient times also Latin and Greek had a similar complex grammar, the other modern European languages simplifed it somehow...but English language is the only one to have totally trivialized it. This is definitely a bonus for people that has to learn it: I knew a dutch man that told me that New York was a ducth town in origin and that only by chance US choose English as their language and not Dutch....well it was definetly a big luck for all the world :) ducth is way harder.

3. Finally let me add what is probably the most peculiar aspect of the language: the pronunciation/spell of the words. for some unfathomable reason, somehow during middle age period, English people decided that they had to pronounce words differently from how they are written. It is the only European language that took this apparently uncomfortable road. In all the other languages you can just read the words as they are written, mainly because the "written form" was invented just for that....but in Great Britain they begged to differ.
 
Last edited:

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
So I found this comment on Reddit about Chinese being a more efficient for AI:


I heard from a scientist in China that their models are faster and borderline superior with less data because they used chinese as internal language for their LLM's it seems chinese (or "mandarine") has some superior language properties for chain of thoughts, grouping of topics and implicit reasoning. Its fascinating because western languages are "limited languages" so their whole databases (based on German, France, English etc.) is less effective. They expanded topics by letting the LLMs explain topics in complex chinese and use that dataset for retraining. The next logical step would be to find the ultimate symbolic language.

Chinese / Mandarine has like 50.000 symbols, average mainland citizen is using 1.000 to 2.000 symbols but the language has the "build in" property to expand certain topics as far as possible - so you can describe whole domains in its complexity. ca. 3000 symbols cover 98% of everything in a life of a modern chinese citizen (blogs, newspapers, decrees, etc) - the rest, the rarely used 2% or ca. 47.000 symbols cover very specific topics in science, literature and art...

F.e. biology in English you would use the term "cell" but it has many use cases as a term (cell membrane, battery, compartment etc.), in chinese you have specific characters that really mean (biological) cells and nothing else. For example, the character "细胞" (xìbāo) means "cell" in biology, and it is different from "电池" (diànchí), which means "battery" (in a physics/energy context).

For some reason thats easier for LLMs - super complex character set but super exact at the same time.

This may be the reason why chinese companies speed-up with smaller datacenters and shorter trainings.”

Someone also pointed out that this theory about more data dense languages being more efficient has actually been tested and shown here:
Please, Log in or Register to view URLs content!



Wondering if @tphuang can shed some more light on this, but if true, it’d be interesting if mandarin became de-facto required for AI research.
I'm not involved in the research side of things, Will have to ask Steve Hsu on this.
The point is that there is a natural and large source of readily available Chinese language data (the Chinese internet). Translating a large amount of data in other languages to Chinese (or any other language) correctly is a large and expensive undertaking. A large part of the cost to train models is also in data preparation. I’m not sure if American companies scrape the Chinese internet but Chinese AI companies definitely do
I would assume OpenAI has been scraping Chinese internet for as much as data as they can access.
 

tokenanalyst

Brigadier
Registered Member

DeepSeek guys used some low level assembly "Shifts Registers" tinkering to highly optimize the training of their models.

Does this imply that cutting-edge LLM development no longer needs large-scale GPU clusters? Were the massive computing investments by Google, OpenAI, Meta, and xAI ultimately futile?
The prevailing consensus among AI developers is that this is not the case. However, it is clear that there is still much to be gained through data and algorithms, and many new optimization methods are expected to emerge in the future. Since DeepSeek’s V3 model was released as open source, the technical report on V3 has been described in great detail. This report documents the extent of low-level optimizations performed by DeepSeek. In simple terms, the level of optimization could be summed up as “it seems like they rebuilt everything from the ground up.” For example, when training V3 with NVIDIA’s H800 GPUs, DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks.
This customization was carried out at the PTX (Parallel Thread Execution) level, a low-level instruction set for NVIDIA GPUs. PTX operates at a level close to assembly language, allowing for fine-grained optimizations such as register allocation and thread/warp-level adjustments. However, such detailed control is highly complex and difficult to maintain. This is why higher-level programming languages like CUDA are typically used, as they generally provide sufficient performance optimization for most parallel programming tasks without requiring lower-level modifications. Nevertheless, in cases where GPU resources need to be utilized to their absolute limit and special optimizations are necessary, developers turn to PTX. This highlights the extraordinary level of engineering undertaken by DeepSeek and demonstrates how the “GPU shortage crisis,” exacerbated by U.S. sanctions on China, has spurred both urgency and creativity.


That means that may also bypassed CUDA which could open the door for other GPUs makers like Huawei Ascend or Moore Thread to be used in the training process.

 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member


That means that may also bypassed CUDA which could open the door for other GPUs makers like Huawei Ascend or Moore Thread to be used in the training process.
????? Ascend has been used in training by quite a few companies, what are you talking about?
I mean, they can tune Ascend chips that they acquire in the same way. There is nothing magical about this.

Kimi 1.5 by Moonshot is out and it's sad that they are probably not going to get much recognition for this because DeepSeek is sucking up all the oxygen.

I'm not involved in the research side of things, Will have to ask Steve Hsu on this.
so Steve says that using Chinese language probably does help, but he is not sure it actually helps all that much.
 

siegecrossbow

General
Staff member
Super Moderator

DeepSeek guys used some low level assembly "Shifts Registers" tinkering to highly optimize the training of their models.

Does this imply that cutting-edge LLM development no longer needs large-scale GPU clusters? Were the massive computing investments by Google, OpenAI, Meta, and xAI ultimately futile?
The prevailing consensus among AI developers is that this is not the case. However, it is clear that there is still much to be gained through data and algorithms, and many new optimization methods are expected to emerge in the future. Since DeepSeek’s V3 model was released as open source, the technical report on V3 has been described in great detail. This report documents the extent of low-level optimizations performed by DeepSeek. In simple terms, the level of optimization could be summed up as “it seems like they rebuilt everything from the ground up.” For example, when training V3 with NVIDIA’s H800 GPUs, DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks.
This customization was carried out at the PTX (Parallel Thread Execution) level, a low-level instruction set for NVIDIA GPUs. PTX operates at a level close to assembly language, allowing for fine-grained optimizations such as register allocation and thread/warp-level adjustments. However, such detailed control is highly complex and difficult to maintain. This is why higher-level programming languages like CUDA are typically used, as they generally provide sufficient performance optimization for most parallel programming tasks without requiring lower-level modifications. Nevertheless, in cases where GPU resources need to be utilized to their absolute limit and special optimizations are necessary, developers turn to PTX. This highlights the extraordinary level of engineering undertaken by DeepSeek and demonstrates how the “GPU shortage crisis,” exacerbated by U.S. sanctions on China, has spurred both urgency and creativity.


That means that may also bypassed CUDA which could open the door for other GPUs makers like Huawei Ascend or Moore Thread to be used in the training process.


My NVIDIA investment is cooked but I’ve never been so happy about it.
 

mossen

Junior Member
Registered Member
Feel bad for Kimi. They keep reminding us that their model is out, it's a great model, but got completely overshadowed by DeepSeek. Shows that in life a lot is just timing/luck. Hopefully they get the recognition they deserve in the West eventually.

-----

CS professor complains that DeepSeek is not open enough to be called truly open source.


OK, but they are still sharing more information than most frontier model papers from closed labs. Also, they still need some amount of moat, no? I think they strike a nice balance.
 

9dashline

Captain
Registered Member

This is actually pretty big deal. This is first time an open weight model reached 1 million context length window, all prior max out at 128k

This means enterprises can do local inferencing with a much more use case etc and coupled with workflows of Qwen 1million + DS R1, it can now top and surpass the Chatgpt o1 Pro in many meaningful ways (see "Claude + R1" > o1 pro, but Claude isnt open weight etc)

While Qwen had this for a while, they kept it behind gated API... with this release its a new game changing for long context workload and use cases... will probably get installed in a lot of enterprised worldwide

Coordinated or not, feels like China is doing a double tap rugpull on US ClosedAI etc


Dang that was quick...
 
Last edited:
Top