I appears to be very significant that China has achieved to build it´s first
- Petaflop class supercomputer.
It´s incredible to see how many leaps in technology capability has China maded in a time spam of only 20 years.
- Petaflop class supercomputer.
It´s incredible to see how many leaps in technology capability has China maded in a time spam of only 20 years.
Tianhe-1, China's first Petaflop/s scale supercomputer
Fri, 2009-11-13 14:49
The Chinese National University of Defense Technology (NUDT) recently unveiled China’s fastest supercomputer, also the World Fifth fastest computer, which is able to do more than one quadrillion calculations per second theoretically at its peak speed.
The Tianhe-1(TH-1) supercomputer has been built by NUDT for NSCC-TJ. The TH-1 supercomputer system will be a key node linked into the national grid of China. The TH-1 system will be used to provide high performance computing service for the Tianjin area and the northeast of China. NSCC-TJ plans to use this system to solve the computing problems in data processing for petroleum exploration and the simulation of large aircraft designs. Other uses for the TH-1 supercomputer include the sciences, financial, automotive and shipping industries. The TH-1 supercomputer was installed in Changsha in the middle of 2009, and will be moved to NSCC-TJ soon.
The TH-1 system is a hybrid design with Intel Xeon processors and AMD GPUs. The TH-1 uses AMD GPUs as accelerators. Each node consists of two AMD GPUs attached to two Intel Xeon processors.
The TH-1 is made up of 80 compute cabinets including 2560 compute nodes and 512 operation nodes. There are two kinds of Intel processors used in the system, including 4096 Intel Xeon E5540 processors with a frequency of 2530MHz and 1024 Intel Xeon E5450 processors with a frequency of 3000MHz. The L1 cache size of the E5540 and E5450 processors is 128KB, the L2 cache sizes are 1MB and 12MB respectively. The E5540 processor has a L3 cache with the size of 8MB. Each compute node has two Intel Xeon processors with 32GB of memory. ATI Radeon HD 4870×2 GPUs are connected via PCI-E connections on each compute node. The maximum power recorded during the execution of LINPACK was 0.58KW per node. The Tianhe system used 20480 CPU cores (2 cpu*4 cores*2560 nodes) and 4096000 SPU (1600 Stream Processing Units* 2560 nodes) while executing the LINPACK benchmark. The cores of nodes not in operation whilst running the LINPACK benchmark were not included in the measurements. Each operation node also has two Intel Xeon E5450 processors and 32GB memory. The theoretical peak performance and the total memory of the whole system, which includes compute nodes and operation nodes, are 1.206PFlops and 98,304GB.
The compute nodes with E5540 processors are connected to 9 first-stage Infiniband switches. Each first-stage switch is connected to each second-stage switch through 18 uplinks, which makes a total of 72 uplink connections for 4 second-stage switches. The compute nodes with E5450 processors are connected by 64 Infiniband switch modules in the cabinets. Each switch module is connected to the second-stage switches through 8 uplinks.
The DGEMM and DTRSM, which are the core procedures of Linpack benchmark, are accelerated by cooperations of CPU and GPU on TH-1 system.
There are two processes located on each node. Each process is executed by a CPU and GPU pair. The three cores of CPU are used to execute part of the computing tasks, and the left core is used to control the GPU to participate the execution of computing tasks. The part of program executing on CPU uses the Intel MKL-10.2.1.017 library and the part of GPU uses the AMD ACMLG-1.0 library especially optimized by NUDT. The optimization algorithms used to achieve a better result of LINPACK benchmark are as follows.
First, dynamic load balance technique is used when allocating the tasks between CPU and GPU.
Second, the instruction of streaming load/store is adopted to reduce the conflict between CPU and GPU’s data accessing. The third, software-pipelining technique is used to overlap the execution of GPU and the transmission of data between GPU and CPU. The forth, affinity-scheduling technique is used to reduce the performance fluctuation by utilizing processor cores' computing and controlling ability. The fifth, we optimize the functions of DGEMM in AMD ACMLG library and DTRTRI and DTRMM in Intel MKL library to speed up the performance of DTRSM.
For the sake of the stabilization, the frequency of GPU core is decreased from 750MHz to 575MHz. Besides this, the frequency of GPU’s memory is also decreased from 900MHz to 650MHz.
Source: TOP500 Submission