The problem Chinese domestic AI chip facing,is ecosystem/software compatibility/ease to use,thus the overall cost is high compare to Nvidia chip. That's why even though NVIDIA Chip is being pushed to sky high price,Chinese companies are still rush to buy these chips like crazy
实际上,除了硬件性能差距外,软件生态也是国产AI芯片厂商的短板。
芯片需要适配硬件系统、工具链、编译器等多个层级,需要很强的适配性,否则会出现这款芯片在某个场景能跑出90%的算力,在另一场景只能跑出80%效能的情景。
上文提到,英伟达在这方面优势明显。早在2006年,英伟达就推出了计算平台CUDA,这是一个并行计算软件引擎,CUDA框架里集成了很多调用GPU算力所需的代码,工程师可以直接使用这些代码,无须一一编写。开发者可使用CUDA更高效地进行AI训练和推理,更好的发挥GPU算力。时至今日,CUDA已成为AI基础设施,主流的AI框架、库、工具都以CUDA为基础进行开发。
如果没有这套编码语言,软件工程师发挥硬件价值的难度会变得极大。
英伟达之外的GPU和AI芯片如要接入CUDA,需要自己提供适配软件。据业内人士透露,曾接触过一家非英伟达GPU厂商,尽管其芯片和服务报价比英伟达更低,也承诺提供更及时的服务,但使用其GPU的整体训练和开发成本会高于英伟达,还得承担结果和开发时间的不确定性。
虽然英伟达GPU价格贵,但实际用起来反而是最便宜的。这对有意抓住大模型机会的企业来说,钱往往不是问题,时间才是更宝贵的资源,大家都必须尽快获得足够多的先进算力来确保先发优势。
因此,对于国产芯片供应商来讲,哪怕能通过堆芯片的方式能堆出一个算力相当的产品,但软件适配与兼容让客户接受更难。此外,从服务器运营的角度,它的主板开销、电费、运营费,以及需要考虑的功耗、散热等问题,都会大大增加数据中心的运营成本。
因为算力资源常需要以池化的形式呈现,数据中心通常更愿意采用同一种芯片,或者同一家公司的芯片来降低算力池化难度。
算力的释放需要复杂的软硬件配合,才能将芯片的理论算力变为有效算力。对客户而言,把国产AI芯片用起来并不容易,更换云端AI芯片要承担一定的迁移成本和风险,除非新产品存在性能优势,或者能在某个维度上提供其他人解决不了的问题,否则客户更换的意愿很低。
In fact, in addition to the gap in hardware performance, the software ecosystem is also a shortcoming of domestic AI chip manufacturers.
The chip needs to adapt to multiple levels such as hardware system, tool chain, compiler, etc., and needs strong adaptability. Otherwise, it will appear that this chip can run 90% of the computing power in one scene, but only in another scene. Run out of 80% performance scenario.
As mentioned above, Nvidia has obvious advantages in this regard. As early as 2006, Nvidia launched the computing platform CUDA, which is a parallel computing software engine. The CUDA framework integrates a lot of codes required to invoke GPU computing power. Engineers can directly use these codes without writing them one by one. Developers can use CUDA to perform AI training and reasoning more efficiently, and make better use of GPU computing power. Today, CUDA has become an AI infrastructure, and mainstream AI frameworks, libraries, and tools are all developed based on CUDA.
Without this set of coding languages, it will be extremely difficult for software engineers to realize the value of hardware.
If GPUs and AI chips other than Nvidia want to access CUDA, they need to provide their own adaptation software. According to industry insiders, I have contacted a non-NVIDIA GPU manufacturer. Although its chip and service quotations are lower than NVIDIA’s and promise to provide more timely services, the overall training and development costs of using its GPU will be higher than NVIDIA’s. Undertake the uncertainty of results and development time.
Although Nvidia GPUs are expensive, they are actually the cheapest to use. For companies that intend to seize the opportunity of large-scale models, money is often not a problem, and time is a more precious resource. Everyone must obtain enough advanced computing power as soon as possible to ensure the first-mover advantage.
Therefore, for domestic chip suppliers, even if a product with comparable computing power can be stacked by stacking chips, it is more difficult for customers to accept software adaptation and compatibility. In addition, from the perspective of server operation, its motherboard expenses, electricity charges, operating expenses, and issues such as power consumption and heat dissipation that need to be considered will greatly increase the operating costs of the data center.
Because computing power resources often need to be presented in the form of pooling, data centers are usually more willing to use the same chip or chips from the same company to reduce the difficulty of computing power pooling.
The release of computing power requires complex software and hardware cooperation to turn the theoretical computing power of the chip into effective computing power. For customers, it is not easy to use domestic AI chips. Replacement of cloud AI chips requires certain migration costs and risks, unless the new product has performance advantages, or can provide problems that others cannot solve in a certain dimension. Otherwise, the willingness of customers to replace is very low.
实际上,除了硬件性能差距外,软件生态也是国产AI芯片厂商的短板。
芯片需要适配硬件系统、工具链、编译器等多个层级,需要很强的适配性,否则会出现这款芯片在某个场景能跑出90%的算力,在另一场景只能跑出80%效能的情景。
上文提到,英伟达在这方面优势明显。早在2006年,英伟达就推出了计算平台CUDA,这是一个并行计算软件引擎,CUDA框架里集成了很多调用GPU算力所需的代码,工程师可以直接使用这些代码,无须一一编写。开发者可使用CUDA更高效地进行AI训练和推理,更好的发挥GPU算力。时至今日,CUDA已成为AI基础设施,主流的AI框架、库、工具都以CUDA为基础进行开发。
如果没有这套编码语言,软件工程师发挥硬件价值的难度会变得极大。
英伟达之外的GPU和AI芯片如要接入CUDA,需要自己提供适配软件。据业内人士透露,曾接触过一家非英伟达GPU厂商,尽管其芯片和服务报价比英伟达更低,也承诺提供更及时的服务,但使用其GPU的整体训练和开发成本会高于英伟达,还得承担结果和开发时间的不确定性。
虽然英伟达GPU价格贵,但实际用起来反而是最便宜的。这对有意抓住大模型机会的企业来说,钱往往不是问题,时间才是更宝贵的资源,大家都必须尽快获得足够多的先进算力来确保先发优势。
因此,对于国产芯片供应商来讲,哪怕能通过堆芯片的方式能堆出一个算力相当的产品,但软件适配与兼容让客户接受更难。此外,从服务器运营的角度,它的主板开销、电费、运营费,以及需要考虑的功耗、散热等问题,都会大大增加数据中心的运营成本。
因为算力资源常需要以池化的形式呈现,数据中心通常更愿意采用同一种芯片,或者同一家公司的芯片来降低算力池化难度。
算力的释放需要复杂的软硬件配合,才能将芯片的理论算力变为有效算力。对客户而言,把国产AI芯片用起来并不容易,更换云端AI芯片要承担一定的迁移成本和风险,除非新产品存在性能优势,或者能在某个维度上提供其他人解决不了的问题,否则客户更换的意愿很低。
In fact, in addition to the gap in hardware performance, the software ecosystem is also a shortcoming of domestic AI chip manufacturers.
The chip needs to adapt to multiple levels such as hardware system, tool chain, compiler, etc., and needs strong adaptability. Otherwise, it will appear that this chip can run 90% of the computing power in one scene, but only in another scene. Run out of 80% performance scenario.
As mentioned above, Nvidia has obvious advantages in this regard. As early as 2006, Nvidia launched the computing platform CUDA, which is a parallel computing software engine. The CUDA framework integrates a lot of codes required to invoke GPU computing power. Engineers can directly use these codes without writing them one by one. Developers can use CUDA to perform AI training and reasoning more efficiently, and make better use of GPU computing power. Today, CUDA has become an AI infrastructure, and mainstream AI frameworks, libraries, and tools are all developed based on CUDA.
Without this set of coding languages, it will be extremely difficult for software engineers to realize the value of hardware.
If GPUs and AI chips other than Nvidia want to access CUDA, they need to provide their own adaptation software. According to industry insiders, I have contacted a non-NVIDIA GPU manufacturer. Although its chip and service quotations are lower than NVIDIA’s and promise to provide more timely services, the overall training and development costs of using its GPU will be higher than NVIDIA’s. Undertake the uncertainty of results and development time.
Although Nvidia GPUs are expensive, they are actually the cheapest to use. For companies that intend to seize the opportunity of large-scale models, money is often not a problem, and time is a more precious resource. Everyone must obtain enough advanced computing power as soon as possible to ensure the first-mover advantage.
Therefore, for domestic chip suppliers, even if a product with comparable computing power can be stacked by stacking chips, it is more difficult for customers to accept software adaptation and compatibility. In addition, from the perspective of server operation, its motherboard expenses, electricity charges, operating expenses, and issues such as power consumption and heat dissipation that need to be considered will greatly increase the operating costs of the data center.
Because computing power resources often need to be presented in the form of pooling, data centers are usually more willing to use the same chip or chips from the same company to reduce the difficulty of computing power pooling.
The release of computing power requires complex software and hardware cooperation to turn the theoretical computing power of the chip into effective computing power. For customers, it is not easy to use domestic AI chips. Replacement of cloud AI chips requires certain migration costs and risks, unless the new product has performance advantages, or can provide problems that others cannot solve in a certain dimension. Otherwise, the willingness of customers to replace is very low.