Why do we even bother with this when we all know where Western chatbots stand on the Palestinian genocide.
Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.
We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.
I'm not involved in the research side of things, Will have to ask Steve Hsu on this.So I found this comment on Reddit about Chinese being a more efficient for AI:
“
I heard from a scientist in China that their models are faster and borderline superior with less data because they used chinese as internal language for their LLM's it seems chinese (or "mandarine") has some superior language properties for chain of thoughts, grouping of topics and implicit reasoning. Its fascinating because western languages are "limited languages" so their whole databases (based on German, France, English etc.) is less effective. They expanded topics by letting the LLMs explain topics in complex chinese and use that dataset for retraining. The next logical step would be to find the ultimate symbolic language.
Chinese / Mandarine has like 50.000 symbols, average mainland citizen is using 1.000 to 2.000 symbols but the language has the "build in" property to expand certain topics as far as possible - so you can describe whole domains in its complexity. ca. 3000 symbols cover 98% of everything in a life of a modern chinese citizen (blogs, newspapers, decrees, etc) - the rest, the rarely used 2% or ca. 47.000 symbols cover very specific topics in science, literature and art...
F.e. biology in English you would use the term "cell" but it has many use cases as a term (cell membrane, battery, compartment etc.), in chinese you have specific characters that really mean (biological) cells and nothing else. For example, the character "细胞" (xìbāo) means "cell" in biology, and it is different from "电池" (diànchí), which means "battery" (in a physics/energy context).
For some reason thats easier for LLMs - super complex character set but super exact at the same time.
This may be the reason why chinese companies speed-up with smaller datacenters and shorter trainings.”
Someone also pointed out that this theory about more data dense languages being more efficient has actually been tested and shown here:
Wondering if @tphuang can shed some more light on this, but if true, it’d be interesting if mandarin became de-facto required for AI research.
I would assume OpenAI has been scraping Chinese internet for as much as data as they can access.The point is that there is a natural and large source of readily available Chinese language data (the Chinese internet). Translating a large amount of data in other languages to Chinese (or any other language) correctly is a large and expensive undertaking. A large part of the cost to train models is also in data preparation. I’m not sure if American companies scrape the Chinese internet but Chinese AI companies definitely do
????? Ascend has been used in training by quite a few companies, what are you talking about?
That means that may also bypassed CUDA which could open the door for other GPUs makers like Huawei Ascend or Moore Thread to be used in the training process.
so Steve says that using Chinese language probably does help, but he is not sure it actually helps all that much.I'm not involved in the research side of things, Will have to ask Steve Hsu on this.
DeepSeek guys used some low level assembly "Shifts Registers" tinkering to highly optimize the training of their models.
Does this imply that cutting-edge LLM development no longer needs large-scale GPU clusters? Were the massive computing investments by Google, OpenAI, Meta, and xAI ultimately futile?
The prevailing consensus among AI developers is that this is not the case. However, it is clear that there is still much to be gained through data and algorithms, and many new optimization methods are expected to emerge in the future. Since DeepSeek’s V3 model was released as open source, the technical report on V3 has been described in great detail. This report documents the extent of low-level optimizations performed by DeepSeek. In simple terms, the level of optimization could be summed up as “it seems like they rebuilt everything from the ground up.” For example, when training V3 with NVIDIA’s H800 GPUs, DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks.
This customization was carried out at the PTX (Parallel Thread Execution) level, a low-level instruction set for NVIDIA GPUs. PTX operates at a level close to assembly language, allowing for fine-grained optimizations such as register allocation and thread/warp-level adjustments. However, such detailed control is highly complex and difficult to maintain. This is why higher-level programming languages like CUDA are typically used, as they generally provide sufficient performance optimization for most parallel programming tasks without requiring lower-level modifications. Nevertheless, in cases where GPU resources need to be utilized to their absolute limit and special optimizations are necessary, developers turn to PTX. This highlights the extraordinary level of engineering undertaken by DeepSeek and demonstrates how the “GPU shortage crisis,” exacerbated by U.S. sanctions on China, has spurred both urgency and creativity.
That means that may also bypassed CUDA which could open the door for other GPUs makers like Huawei Ascend or Moore Thread to be used in the training process.