tencent announces it's own LLM with > 100B parameters & >2B tokens
they are a little late to this imo
Well, to train a 100B model with 2T tokens takes a lot of time.
It took
(from Jan to June) to train a 70B model with 2T tokens.
As a ball-park estimate,
training time is proportional to model size * number of tokens / number of GPUs
Meta does not reveal on how many GPU it was trained LLama 2, but you can bet it is a huge number!
Currently 2T tokens is state-of-the-art, apart from GPT 4 that according to leaks has been trained on a whopping 13T tokens, all other big models are in the 2/3T tokens range or lower...also a lot lower than that actually. For instance the open source 176B parameters
model (honestly, not among the top ones) has been trained on
only 350B tokens, i.e. almost 7 times less than the new Tencent's one.
Just to give an idea of what it means to train on such big datasets, we can assume a token corresponds to a word in English (it is actually less than a word, a 70% of a word on average), and to a single character in Chinese.
Now, in English and other languages with Latin alphabet, 1 page is about 500 words, so a 200 page book is about 100K words, and 1B tokens corresponds to the equivalent of training on 10K books.
In our case, Tencent model has been trained on 2T tokens, i.e. on the equivalent of
20 million books of 200 page each.