News on China's scientific and technological development.

FairAndUnbiased

Brigadier
Registered Member
China's AI effort is dictated by the same two forces that shape Western AI: chips and software.

China can longer use CUDA or other software stacks. Building its own is possible, but will take time. As for chips: it can't buy top-end GPUs anymore and while it can design its own, it cannot fab them because they are forbidden from using bleeding edge fabs from TSMC.

As a result, China's AI efforts will be also-rans until and unless it manages to fix these two vulnerabilities. Probably not what people here want to hear, but I also attacked Zero Covid when lots of folks here defended it long past its usefulness. Harsh truths need to be told.
Don't need data? Don't need models? Don't need applications to pay for it? Don't need energy to run everything?
 

Eventine

Junior Member
Registered Member
You are discussing raw data, I am talking about datasets.
But it also applies to data sets. English language data sets are larger and more diverse than Chinese language data sets in the domains I mentioned. As a comparison, the LAION data set is 5.85 billion image-text pairs. There is no comparable Chinese language data set.
 

Michaelsinodef

Senior Member
Registered Member
Don't need data? Don't need models? Don't need applications to pay for it? Don't need energy to run everything?
Not to mention for hardware(chips) and software, they can always make some chinese shell company in Vietnam and skirt around it lol.

Or for hardware, there was that news of a supercomputer using older domestic chinese chips. It just had to use a lot more of them, and at the end of the day, for supercomputers and the likes, more than the individual chip, it's actually the connections and infrastructure that's probably more important.
 

Coalescence

Senior Member
Registered Member
But it also applies to data sets. English language data sets are larger and more diverse than Chinese language data sets in the domains I mentioned. As a comparison, the LAION data set is 5.85 billion image-text pairs. There is no comparable Chinese language data set.
I keep seeing this data being limited by language argument being tossed around a lot, can anyone with a background in Machine Learning explain what's stopping Chinese companies from just translating the data/datasets in Chinese and correcting translation errors, or just train the model on English datasets and then translating the output?
 

Andy1974

Senior Member
Registered Member
But it also applies to data sets. English language data sets are larger and more diverse than Chinese language data sets in the domains I mentioned. As a comparison, the LAION data set is 5.85 billion image-text pairs. There is no comparable Chinese language data set.

LAION is not manually annotated, and thus is of errors.

As a general rule the number of data annotators you have the more and better your datasets are. From my experience with data annotation and trying to sell quality datasets to China I have found it to be practically impossible because ‘we just send it to the villages’.

The US has no such industry, and companies like Scale outsource annotation to countries like the Philippines and Malaysia where they use low paid, non expert workers to annotate. My company, before it failed, also did this.

In contrast, China has annotation factories all over the country, not all done by low educated workers, but by combining expert knowledge and manual annotation which produces fantastic datasets.

They are not advertising how big their datasets are, but there are millions of annotators working full time on this right now.
 

Eventine

Junior Member
Registered Member
As much as I don't like quoting Western sources for anything, it is a legitimate problem:

Please, Log in or Register to view URLs content!

The issue is that the Chinese language internet is far more censored, far less diverse, and most importantly, a lot younger than the English language internet. So if you want a Chinese equivalent to ChatGPT, training on English is currently necessary.

is that a fundamental obstacle? Probably not, because you can build a multi-language model that does a text-to-vector embedding in a language agnostic way as a first step. So built-in translation, so to speak.

But it's certainly not a data advantage. Which is what was argued above.

China has an advantage in data in certain domains - computer vision being the most obvious due to all the cameras in Chinese cities. But not others.
 

Andy1974

Senior Member
Registered Member
I keep seeing this data being limited by language argument being tossed around a lot, can anyone with a background in Machine Learning explain what's stopping Chinese companies from just translating the data/datasets in Chinese and correcting translation errors, or just train the model on English datasets and then translating the output?
The first thing to realize is this: Bigger is not better. Only quality counts, if you have a big dataset with errors it is bad. If you have a small dataset with no errors, it is better than big dataset with errors.

The goal is to have a big dataset with high quality, and the English language resources on the internet does not qualify as high quality.

So. While your suggestion has merit, the results will still be low quality.

China has been focusing on image annotation, and they lead the world, all they need to do it shift annotation resources to NLP for a while and they will lead here too.
 

Eventine

Junior Member
Registered Member
The first thing to realize is this: Bigger is not better. Only quality counts, if you have a big dataset with errors it is bad. If you have a small dataset with no errors, it is better than big dataset with errors.

The goal is to have a big dataset with high quality, and the English language resources on the internet does not qualify as high quality.

So. While your suggestion has merit, the results will still be low quality.
Quality is important but difficult to obtain at scale, especially with language models. The trick to models like ChatGPT is that they leverage the entire internet's "knowledge" to answer questions. If you tried to do that via manual labeling, it just doesn't work because it's the equivalent of trying to reproduce the entire internet's worth of knowledge.

It's like Google Search, in that regard. Baidu doesn't have a worse algorithm, necessarily; it just has a worse internet to work with.
 

Quan8410

Junior Member
Registered Member
ChatGPT is a generative AI, not predictive or prescriptive AI. And in a certain extend, data quantity matters a ton . It need a lot of data to generate answer that resemble real life as much as possible, not necessarily correct answer. The Chinese internet is entirely different from western internet so if you expect a Chinese chatgpt, you could be dissappointed. Need to know what you should aim for, not merely copy other is the key to move forward.
 
Top