So, looks like we might see Tianhe-2 reach the 100 PFLOP mark later this year, using new domestic processors to achieve the benchmark, despite the US ban on selling Intel processors.
100 PFLOPS: CHINA’S SUPERCOMPUTER CIRCUMVENTS U.S. SALES BAN
APRIL 13, 2016 3
China's Tianhe-2 supercomputer is world's fastest supercomputer, at 33 PFLOPS demonstrated and 55 PFLOPS theoretical performance.
A year ago, we revealed that the U.S. State Department blocked the further sales of Intel Xeon and Xeon Phi processors to Chinese institutions, most notably the Tianhe-2 supercomputer. The U.S. Administration also blocked the move in which a China-based investment fund would invest in AMD i.e. one of original reasons for Radeon Technologies Group – which is even without the said investment, performing above and beyond its financial capabilities.
The reason to move against Tianhe-2 is complicated yet simple – ever since its debut in June 2013, the Tianhe-2 supercomputer from NUDT (National University for Defense Technologies) sits on top of the World’s 500 fastest computers list. From the looks of it, Tianhe-2 (the name translates to ‘Milky Way’) looks to keep on sitting on top even after we see the launch of U.S. supercomputers Summit and Sierra (IBM + Nvidia), as well as Aurora and Theta (Intel).
With its 32,000 Intel Xeon E5-2692 v2 processors, and 48,000 Intel Xeon Phi 31S1P co-processors, Tianhe-2 delivers a peak performance of fantastic 54.9 PFLOPS, and a sustained performance of 33.86 PFLOPS. What is little known is that Tianhe-2 is not a fully built supercomputer. In fact, Tianhe operated at a 50% capacity, as the original target for the system was 100 PFLOPS peak and 80 PFLOPS sustained.
According to our sources, China did not react in a way the current administration expected. Rather than pressuring with (empty) threats that affect the commerce between the two of world’s largest economies, China invested all the funds intended for Intel and other foreign vendors – into the development of in-house Alpha and ARM superprocessors, which have the potential to beat the traditional x86 architecture. In terms of funds, NUDT planned to buy 32,000 more Xeon processors (this time, based on Haswell-E) and 48,000 more Xeon Phi co-processors. We’ve been hearing that over $500 million was invested in bringing the Chinese silicon from a prototype phase to production-grade level.
The New Tianhe-2: Meet The 100 PFLOPS Supercomputer
At the 2016 Supercomputing Frontiers conference in Singapore, we learned the first details of the fully developed Tianhe-2 supercomputer, scheduled to debut in June 2016 during the 2016 International Supercomputing Conference in Frankfurt, Germany. This system is expected to deliver over 100 PFLOPS peak performance, and keep the crown of the world’s fastest (super)computer.
The new Tianhe-2 represents a hybrid design, featuring two new additions, as the old Xeon Phi cards are being phased out. Phytium Technologies recently delivered their “Mars” processors in the form of PCI Express cards that replaced the Xeon Phi cards, and motherboards to upgrade the system. Given that there are 48,000 add-in boards installed, the new 64-core design enables the system to reach its original performance targets. With the three million new ARM cores inside the Tianhe-2, its estimated Rpeak performance in the Linpack benchmark should exceed 100 PFLOPS.
Should Tianhe-2 reach its full deployment of 32,000 Xeons, 32,000 ShenWei processor, and 96,000 Phytium accelerator cards, we might see an upgrade in the range of 200-300 PFLOPS – if the building can withstand the thermal and power challenges associated with it.
Meet Phytium Mars, A 64-Core ARM Superprocessor
In August 2015, a little known company Phytium Technologies planned to demonstrated “Mars” processors at the HotChips conference in Cupertino, CA. However, its Lead scientist was denied a visa to enter the U.S. and we could not see the physical boards which featured this extremely powerful processor. The slide above shows the base architecture of the initial engineering sample, with the final delivered boards featured significantly higher performance specifications.
Mars processor silicon
While we were not privy to see the final silicon, we known that the performance went up by almost three fold, and that the final production board delivers 1.5 TFLOPS of compute power, most probably in a dual chip arrangement (akin to Tesla K80 and FirePro S9300 x2).
There are several implementations of this processor in Tianhe-2: add-in card that replaces the Xeon Phi, and motherboards featuring upgradable memory, all using very affordable DDR3-1600 memory. Phytium Technology delivered motherboards with multiple processors and up to 256 GB per Mars processor. Typical implementation measns the company achieves a triple 64 – 64-bit ARM core inside a 64-core processor attaches to 64 GB memory using 8-channel memory interface, not the 16-channel as mentioned in slides – that is for onboard (G)DDR memory.
Bottom line is, the sales restriction enabled a small startup to deliver a product which achieves higher performance than the products it was supposed to replace. All in all, a win for NUDT, and a small company that ‘no one ever heard off’. We will see how the market will develop, and is there a space for Phytium Technology on the supercomputing market. Tianhe-2 might be just the beginning.
Also, this is not the only development coming from mainland China. Jiāngnán Computing Lab successfully developed a new multi-core Alpha processor. Considered a sixth generation design, ShenWei Alpha processors achieve more than 1 TFLOPS of compute performance. However, we were not able to confirm what volumes are involved with the new batch of ShenWei processors. What makes them mysterious is the fact that Wikipedia only lists three generations of their Alpha processors, while the scientists are talking about fifth, sixth and seventh generations.