Ultra-high-speed Lossless Networks for Exascale Computing Power
A major technical breakthrough by Huawei will ramp up the performance of computing and storage networks.
By Li Xinyuan, Huawei
Huawei's R&D efforts have resulted in a breakthrough that has greatly improved the performance computing and storage networks: fully lossless Ethernet.
At Huawei Connect 2021, Wang Lei, President of the Data Center Network Domain of Huawei Data Communication Product Line, invited customer experts to discuss how Huawei's hyper-converged data center network can help improve supercomputing power and artificial intelligence. The experts also shared their insights into and visions for supercomputing applications.
In Liu Cixin's science-fiction novel The Three-Body Problem, "the world's most powerful computers," used to simulate nuclear explosions, "can perform 500 trillion floating-point operations per second." That seems like an incredibly huge number. However, on July 1, 2021, China's Pengcheng Cloud Brain II topped the IO500 ranking at the World Supercomputing Conference, exceeding one quintillion floating-point operations per second (FLOPS), 2,000 times faster than the most powerful computers in The Three-Body Problem.
Real-world supercomputing has gone beyond the imaginings of science fiction.
Huawei Tech: We know that supercomputers have tremendous computing power. In layman’s terms, what does that mean?
Wang Lei: Exascale computing is truly powerful. Technically speaking, it is 1018 FLOPS, equivalent to what half a million of our latest laptops can do combined. AlphaGo, the AI that beat Lee Sedol at Go a few years ago, was powered by a petaflop-level [more than one quadrillion] FLOPS supercomputer. Today's exascale computers are 1,000 times as powerful as the computer that powered AlphaGo.
Huawei Tech: Is building a supercomputer simply a matter of stacking many computers together?
Customer 1: The consensus in the supercomputing world is that "1 + 1 < 2". For example, when we bind 100 computers together, do they deliver 100 times the computing power of a single computer? The answer is definitely no. For two connected computing units to work together, complex coordination is required, and the underlying condition is ensuring unimpeded communication between them. Lossless communication channels mean low latency and high efficiency for communications. Huawei's lossless network was developed to solve these kinds of communication problems.
Wang Lei: When we say "supercomputer", we are referring to a super computing cluster. It appears to be just a bunch of computing units clustered together, but the underlying super-fast network connecting them is what enables them to run at high speeds. Therefore, a supercomputer is not just a stack of regular computers, but an integrated system supported by an ultra-high-speed lossless network.
Customer 2: Both supercomputing centers and AI computing centers need high-speed networks and software to centrally manage physically scattered computers to form a logically unified computing cluster which serves as a computing resource pool that can be used on-demand.
Until now, supercomputing clusters that relied on Ethernet interconnect have always suffered from packet loss.
Huawei Tech: Ethernet is inherently prone to packet loss. What led Huawei to believe that it could solve a problem that has been around for 40 years? Has Huawei overcome this challenge?
Wang Lei: Huawei overcame this technical challenge two years ago. Back when we just started investment and research in computing, our researchers found that simply binding servers together could not create a linear increase in computing power. For example, they found that doubling the number of GPU servers increased computing power by just 4%. Through analyzing the computing process, we found that the problem was caused by packet loss, an inherent problem with conventional Ethernet. Packet loss of just 0.1% can result in computing power loss of 50%, meaning half the server computing power is wasted. To address this, Huawei began examining the question of how we might create lossless Ethernet networks. We finally solved this problem two years ago, thereby realizing 100% utilization of servers' computing power.
Huawei Tech: The digital economy is said to have entered the computing era. But will supercomputing have any impact on the daily lives of ordinary people?
Wang Lei: For most people, the whole topic of supercomputing seems like something very remote, because this technology is mainly used in relatively high-level applications such as weather forecasting, earthquake monitoring, and human genetic testing. However, supercomputing is much closer to our daily lives than most of us are aware of.
For example, in recent years, it’s played a role in increasing the variety of new, affordable cars. Vehicle crash testing is one of the most time- and investment-intensive processes in automobile manufacturing. Using physical vehicles for testing means each crash results in a scrapped testing vehicle, and the cost can add up to millions of RMB. However, using supercomputers to simulate crash tests can shorten the development cycle of new cars from 36 months to 12 months. Now, with Huawei's hyper-converged data center network, that process can be expedited even further.
Customer 1: We can look at supercomputing and its implications from the public health perspective. In the early days of the pandemic, we didn’t understand COVID-19 so well. Through extensive analysis, it was later found that the cytokine storm was an important factor increasing the morbidity rate. Supercomputing played a major role in the process of deepening our knowledge. Scientists and doctors working together found that the overreaction of the human immune system to the invading virus affected certain normal bodily functions and led to the failure of those functions. With the support of supercomputing, a way to cut off the cytokine storm signal pathways was discovered. This knowledge was put to good use in Wuhan, where it saved lives.
Customer 2: If we compare AI computing power to electric power, the AI computing center is like a large-scale power station. AI applications, like electricity, will be widely used in numerous industries and households. The use of AI will make urban management more precise – self-driving vehicles and license plate recognition are examples of AI in our daily lives and urban management.
A more powerful future
Albert Einstein once said, "We cannot solve our problems with the same level of thinking that created them." The history of humanity has been a long process of creating and solving problems. Our understanding of supercomputers and AI will become clearer as technology advances, and technological innovation in supercomputing will play a key role in this process. We believe that Huawei will provide high-quality computing infrastructure for scientific research in all kinds of industries and key fields, powering economic growth and social development, and enabling everyone to step into the computing era.