When I was young (about 30 years ago), certain circles in China advocated that Chinese characters were more conducive to scientific research and technological development than English (Latin alphabet). With a core set of 6,000-8,000 commonly used characters that can be flexibly combined to create modern technical terminology, Chinese enables intuitive comprehension of technical concepts through literal interpretation without requiring specialized jargon acquisition. In contrast, English academic literature relies heavily on abbreviations that demand prolonged specialized study to decipher, thereby elevating professional barriers. Chinese materials present minimal obstacles for cross-disciplinary learning - I can effortlessly navigate and comprehend various specialized documents within Chinese literature databases.Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.
We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.
Many assessments of China's technological capabilities overlook a critical factor: decades of sustained literature digitization and systematic translation/reorganization of English scientific works into Chinese educational resources. This infrastructure enables most Chinese researchers to learn and conceptualize in their native language, bypassing the cognitive burden of English interpretation. (While Shakespeare's era considered 20,000-30,000 English words as linguistic mastery, modern English vocabulary has ballooned to millions, making cross-disciplinary learning particularly challenging. Chinese learners face concentrated difficulty in elementary literacy acquisition, but by high school achieve seamless cross-disciplinary comprehension across STEM fields - a stark contrast to Western educational trajectories.)
In my analysis, China stands as the world's second nation to establish a comprehensive independent scientific literature ecosystem. This system constitutes both a formidable competitive advantage and a strategic moat. The full magnitude of this advantage remains underappreciated today, likely requiring 20-30 years and a generation of bilingual researchers to articulate convincingly to non-Chinese audiences.
My recent experience with Deepseek reveals fascinating linguistic dimensions. Its Chinese Q&A examples circulating online demonstrate astonishing linguistic sophistication. While users like myself employ various AI tools (ChatGPT included), there's growing recognition that Western observers miss Deepseek's capabilities in classical Chinese composition and poetry - artistic expressions that reveal Chinese linguistic richness (which makes English expression appear comparatively impoverished through Chinese cultural lenses). Few recognize that Deepseek's Chinese performance might surpass its English capabilities, a testament to its fundamentally Chinese cognitive architecture. This linguistic foundation - not mere technical superiority - may constitute Deepseek's true strategic moat in open-sourcing its technology globally.