But it also applies to data sets. English language data sets are larger and more diverse than Chinese language data sets in the domains I mentioned. As a comparison, the LAION data set is 5.85 billion image-text pairs. There is no comparable Chinese language data set.
LAION is not manually annotated, and thus is of errors.
As a general rule the number of data annotators you have the more and better your datasets are. From my experience with data annotation and trying to sell quality datasets to China I have found it to be practically impossible because ‘we just send it to the villages’.
The US has no such industry, and companies like Scale outsource annotation to countries like the Philippines and Malaysia where they use low paid, non expert workers to annotate. My company, before it failed, also did this.
In contrast, China has annotation factories all over the country, not all done by low educated workers, but by combining expert knowledge and manual annotation which produces fantastic datasets.
They are not advertising how big their datasets are, but there are millions of annotators working full time on this right now.