I heard from a scientist in China that their models are faster and borderline superior with less data because they used chinese as internal language for their LLM's it seems chinese (or "mandarine") has some superior language properties for chain of thoughts, grouping of topics and implicit reasoning.
I didn't know this interesting observation. It seems a specific case of the more general "concept model" instead of word based model.
We know that LLM works on word tokens, i.e. each word (or sub-word for long / unusual words) is mapped into a token, that is a numerical representation of the word. The LLM processes this token and outputs a new token that is then mapped back into the next word that we see on our screen. The new word is fed back into the LLM and the process repeats for the next word in the LLM's output sentence.
Now, there is an ongoing research that tries to replace these word tokens with "concept tokens", where instead of a single word, an entire sentence (or in general a chunk of text) is mapped into a single token that does not represent anymore a word but an entire "concept".
The LLM processes this concept token and generates a new token that is then expanded in a whole sentence / chunk of text. This approach is promising for many reasons:
1. Efficiency: with a single "loop" through the LLM generates a whole sentence instead of a single word
2. Language independent: the concept token does not represent a word, but an idea, so it is independent from the underline language.
3. Intuitively more similar to how our brain works: usually we don't spell out our thoughts in full text inside our brains, we just convert them into voice / written form before we "output" them.
I found this recent paper for interested people.
I remember I read another IMHO even better one, but I am not able to find it anymore...
P.S: Actually I have found it!
This paper has the additional advantage to read raw bytes (i.e. characters) instead of words, to make a long story short, it allows the LLM do not fall for those token-induced artifacts like not being able to correctly answer "How many R's are in strawberry?". This is not because LLM is dumb, but because it does not see "strawberry" as a sequence of chars/bytes as we do, but as a single token.
Last edited: