Artificial Intelligence thread

european_guy · Jan 26, 2025

Fatty said:
I heard from a scientist in China that their models are faster and borderline superior with less data because they used chinese as internal language for their LLM's it seems chinese (or "mandarine") has some superior language properties for chain of thoughts, grouping of topics and implicit reasoning.

I didn't know this interesting observation. It seems a specific case of the more general "concept model" instead of word based model.

We know that LLM works on word tokens, i.e. each word (or sub-word for long / unusual words) is mapped into a token, that is a numerical representation of the word. The LLM processes this token and outputs a new token that is then mapped back into the next word that we see on our screen. The new word is fed back into the LLM and the process repeats for the next word in the LLM's output sentence.

Now, there is an ongoing research that tries to replace these word tokens with "concept tokens", where instead of a single word, an entire sentence (or in general a chunk of text) is mapped into a single token that does not represent anymore a word but an entire "concept".

The LLM processes this concept token and generates a new token that is then expanded in a whole sentence / chunk of text. This approach is promising for many reasons:

1. Efficiency: with a single "loop" through the LLM generates a whole sentence instead of a single word

2. Language independent: the concept token does not represent a word, but an idea, so it is independent from the underline language.

3. Intuitively more similar to how our brain works: usually we don't spell out our thoughts in full text inside our brains, we just convert them into voice / written form before we "output" them.

I found this recent paper for interested people.

Please, Log in or Register to view URLs content!

I remember I read another IMHO even better one, but I am not able to find it anymore...

P.S: Actually I have found it!

Please, Log in or Register to view URLs content!

This paper has the additional advantage to read raw bytes (i.e. characters) instead of words, to make a long story short, it allows the LLM do not fall for those token-induced artifacts like not being able to correctly answer "How many R's are in strawberry?". This is not because LLM is dumb, but because it does not see "strawberry" as a sequence of chars/bytes as we do, but as a single token.

Fatty · Jan 26, 2025

pbd456 said:
If it is true, it wouldnt really matter as other languages can be translated to a denser language before being processed by models?

The point is that there is a natural and large source of readily available Chinese language data (the Chinese internet). Translating a large amount of data in other languages to Chinese (or any other language) correctly is a large and expensive undertaking. A large part of the cost to train models is also in data preparation. I’m not sure if American companies scrape the Chinese internet but Chinese AI companies definitely do

xypher · Jan 26, 2025

littleicebook2 said:
so modi saying sanskrit is the best language for AI isn't true?

Jai Hind, India AI superpower confirmed. CyberHindutva incoming, all Israeli women will soon belong to them

Fatty said:
For some reason thats easier for LLMs - super complex character set but super exact at the same time.

That's all theory, of course, but it wouldn't surprise me if it was somewhat true. Having a large character set where one character has an exact mapping to one meaning makes the tokenization process significantly easier and good tokenization directly translates to model quality because the neural network does not work on words or characters but on tokens.

Tiberium · Jan 26, 2025

european_guy said:
I didn't know this interesting observation. It seems a specific case of the more general "concept model" instead of word based model.

We know that LLM works on word tokens, i.e. each word (or sub-word for long / unusual words) is mapped into a token, that is a numerical representation of the word. The LLM processes this token and outputs a new token that is then mapped back into the next word that we see on our screen. The new word is fed back into the LLM and the process repeats for the next word in the LLM's output sentence.

Now, there is an ongoing research that tries to replace these word tokens with "concept tokens", where instead of a single word, an entire sentence (or in general a chunk of text) is mapped into a single token that does not represent anymore a word but an entire "concept".

The LLM processes this concept token and generates a new token that is then expanded in a whole sentence / chunk of text. This approach is promising for many reasons:

1. Efficiency: with a single "loop" through the LLM generates a whole sentence instead of a single word

2. Language independent: the concept token does not represent a word, but an idea, so it is independent from the underline language.

3. Intuitively more similar to how our brain works: usually we don't spell out our thoughts in full text inside our brains, we just convert them into voice / written form before we "output" them.

I found this recent paper for interested people.

Please, Log in or Register to view URLs content!

I remember I read another IMHO even better one, but I am not able to find it anymore...

P.S: Actually I have found it!

Please, Log in or Register to view URLs content!

This paper has the additional advantage to read raw bytes (i.e. characters) instead of words, to make a long story short, it allows the LLM do not fall for those token-induced artifacts like not being able to correctly answer "How many R's are in strawberry?". This is not because LLM is dumb, but because it does not see "strawberry" as a sequence of chars/bytes as we do, but as a single token.

Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.

We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.

Engineer · Jan 26, 2025

siegecrossbow said:
I’ve been suggesting for a while that 文言文 would make the best promoting language, if we can get enough people good enough at it to be practical.

You can ask DeepSeek to respond in Classical Chinese.

https://imgur.com/QaclkHc

I haven't managed to get it to think in Classical Chinese though.

GulfLander · Jan 26, 2025

Tiberium said:
Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.

We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.

I have read some twits from the "manju" guy in twitter, making argumemts that chinese mandarin in written form , if i remember correctly, in very condensed and kinda basically straight forward, like example abt diabetes translated as "sweet urine(?)", I CANT REMEMBER EXACTLY, maybe i remember wrong..

siegecrossbow · Jan 26, 2025

GulfLander said:
I have read some twits from the "manju" guy in twitter, making argumemts that chinese mandarin in written form , if i remember correctly, in very condensed and kinda basically straight forward, like example abt diabetes translated as "sweet urine(?)", I CANT REMEMBER EXACTLY, maybe i remember wrong..

Yes, 糖尿病 means that in a straight translation.

tonyget · Jan 26, 2025

european_guy said:
I have checked chat.deepseek.com on netcraft

Please, Log in or Register to view URLs content!

And it's on Cloudflare, so a US CDN provider.

Is not clear where the actual servers are, maybe they are in part out of China, maybe rented somewhere....

chat.deepseek.com comply with Chinese cencensorship rule if you try taiwan related question, that is a clear indication that the server is located in China

Engineer · Jan 26, 2025

Tiberium said:
Actually I have read some arguments about this: chinese maybe a better language to train LLM than english.

We all know that LLM works on word tokens, but English itself is a creole phonetic language and basically have all kinds of loan worlds from other languages even in some very basic layer, so it maybe cause people, and also LLM to confuse. For example, how can you tell the relationship between pork and pig and hog? Moreover, tokenize english word, we can only get some random combination of letters. For example, we tokenize "lighter", get "ligh" and "ter", it does not have any meaning. If we spit it to "light" and "er", it has meaning, but from word itself, we don't see the direct relationship between lighter and fire.
But chinese is inherent tokenized, by character itself. "打火机” can be tokenized as “打”“火”“机”, we can, and LLM also can see the inherent logic relationship between "lighter" and "fire". It is hard to say it might be an huge advantage, but it doesnot suprise me if it is.

Even German would be a better prompting language than English. "Lighter" in German is "Feuerzeug", literally "fire tool". German words are also precise in definition, such as having different words to describe the act of listening depending on how attentive the listener is.

GulfLander · Jan 26, 2025

tonyget said:
chat.deepseek.com comply with Chinese cencensorship rule if you try taiwan related question, that is a clear indication that the server is located in China

https://twitter.com/i/web/status/1883153784651870253

Artificial Intelligence thread

european_guy

Junior Member

Fatty

Junior Member

xypher

Senior Member

Tiberium

Junior Member

Engineer

Major

GulfLander

Colonel

siegecrossbow

General

tonyget

Senior Member

Engineer

Major

GulfLander

Colonel