More info on Chinese indigenous CPU, material pilfered from xyz's post from CDF.
Chinese high end CPUs are now in the game
Reported by Nebojsa Novakovic on Wednesday, December 21 2011 11:18 am
Last week's report on CPUs, mentioning the Chinese new-generation entries, did raise some waves on various online forums. Here's a bit more on some of those processors.
China has now officially gone deep into the core of high end computing, way to the deepest level - designing and manufacturing its own CPUs - to complete the whole vertical stack from the processor to the application. That includes having own designs covering everything from smartphone to supercomputer, based on three main architectural families: ARM, MIPS and Alpha.
Our last week's report, and its coverage of the Chinese CPUs, has sparked quite a few online comments on various forums, from those of encouragement and seeking more diversified CPU futures, to outright dismissal of these chips as copies or inferior designs, or not having, out of all things, X86 architecture - widely regarded as the worst ever CPU architecture from a design point of view - as a 'proof of true capability'.
Well, let's take a look at the three chosen main architectures here. ARM, MIPS and Alpha are all native RISC architectures - meaning simple, symmetric, orthogonal instruction sets with only a few addressing modes and options, uniform instruction format and easy scalability to both wide cores, multi-cores and a range of speeds from low power to top performance, with much lower gate count required than any X86. Since China doesn't want to depend on Western software stack for its public and, especially, government use, it doesn't need to rely on X86 as this architecture's winning chip is software compatibility with hundreds of thousands of past applications.
So, why bother with the X86 complexities - both technical and legal - then? The internal market is more than good enough to, coupled with Linux and other open source stacks, provide complete solutions and the volumes required to justify these processors even commercially over long run.
Talking about legality: No, these are not fakes or illegal copies right now. The ARM and MIPS processors made in China are fully licensed by the relevant ARM and MIPS IP owning consortia, while the Shenwei Alpha-compatible chip is based on Digital (DEC) IP that is well over 15 years old now - quite ironic for a CPU that matches the best current X86 processors based on 2010 IP and in 2 generations later process.
MIPS - Dragon's Progeny
Loongson (Godson) is the name for the Chinese MIPS processors, developed by Institute of Computing Technology (ICT) at Beijing's Chinese Academy of Sciences, with Prof Hu Weiwu being the design leader. Prof Hu also happens to be a deputy at National People's Congress, which surely is helpful in gaining support for the overall project. For the past 9 years, the effort is run as a joint venture between the government and private enterprises through a company called BLX, a partnership between CAS and Jiangsu Zhongyi Group.
There were 3 major generations of these processors up to now, with the latest one - Loongson 3B - being an 8-core 1.05 GHz CPU, with each CPU having a 256-bit vector FP unit as well. Despite the low clock and 65 nm process, the efficient 4-way out-of-order cores and vector units with dual 256-bit FP ops per core per cycle, allow Loongson 3B to reach 16 GFLOPs per core at 1 GHz, some 130 GFLOPs peak FP rate in double precision at 1.05 GHz clock. For a comparison, the 3.3GHz Core i7 3960X with AVX would achieve some 160 GFLOPs peak in DP, while the Westmere (Core i7 990X) and Bulldozer CPUs would be at not more than two-thirds of this - Core i7 990X is at 90 GFLOPs peak, and AMD FX8150 at some 110 GFLOPs peak, all in DP. And, oh yes, the Loongson 3B achieves this performance at just 40 watts TDP, less than one third of the above competing CPUs.
Something even more interesting is that Loongson 3B has over 200 extra instructions in a separate box, which doesn't affect the main core integrity, that speed up execution of X86 software when using QEMU translator. The benefit of this, at a 5% die area cost, is running lots of X86 software at near native speeds - an approach that Alpha perfected over a decade ago with FX!32 software that enabled Alpha Windows NT to run many X86 titles at the time at high speed.
Anyway, since the core is reasonably efficient already, the next step for Loongson 3 is a 16-core version in 28 nm process, expected sometime in min 2012. The minor core improvements will be there in addition to a much higher clock rate, around 1.6 GHz, as well as larger L2 cache, greater than the current 4 MB. The 2 x 64 KB per core L1 caches are expected to stay on.
What about the software? Several major Linux distros do run - including Debian, Gentoo, Mandriva and China's own Red Flag. The BSD OS ports are done quite a while ago, as well as Windows CE port. Since there are quite a few consumer devices based on the previous Loongson / Godson processors, who knows, one day we may even see Android and Windows 8 ports, although there doesn't seem to be much pressure felt on the Chinese about it.
Earlier, we looked at the background of Chinese high end microprocessor effort, as well as the most widely known of them, the Loongson MIPS family. In this second part, we cover Alpha.
REVIEW
Alpha was, for the long time around the turn of the century, the Formula 1 of microprocessors with its very simple, elegant yet extremely scalable RISC architecture focused on raw speed, and pure 64-bitness without any 32-bit modes or compatibility baggage. Between 1993 and 2001, the time of its untimely murder, it owned the majority of performance records, especially when it came to the processor performance - DEC (Digital Equipment Corp) system designers were sometimes too stingy with the memory and I/O systems, allowing other vendors to occassionally win the accolades in those tests. The most well known of those cores, the one that had the highest comparative performance advantage vs the competition, was 21164 a.k.a EV 5 family, which span three semicon process generations - 0.50, 0.35 and 0.25 microns.
The most widely spread volume-wise was the 0.35 micron 21164A in 1996-7, reaching up to 667 MHz, and beating the contemporary 266 MHz Pentium II by over two times in most benchmark tests of the time. The 21164 core, a simple but very high clock-optimised four-issue in-order design with two FP ops per clock, was also the most performance efficient of all Alphas, taking some 25 Watts at 667 MHz vs 75 Watts for the 600 MHz Pentium III 'Katmai' which followed few years later, still at lesser performance. The subsequent Alpha cores, such as 21264 EV6, brought up to double the performance per-clock, however at three times the power consumption per clock, a point very important when looking at the choices made later in this story.
The 21264 out-of order core was also scaled across three processes, including derivatives made by Samsung, the major Alpha architecture licensee. It, and its successor 21364 EV7, carried the performance torch until 2002 or so, well after Alpha's further public development was stopped. Do note the memory and I/O interconnect revolution with the EV7 - while the core was basically the same EV6 type, the on-chip 1.75 MB L2 cache, a 10-channel integrated Rambus memory controller with humongous memory bandwidth basically matching that of the L2 cache and enabling that cache to act as a low latency buffer for the memory system, and four parallel 6.4 GB/s coherent interconnect links to other 4 processors, scaling up to 512 sockets with directory support, were a revolution for year 2000 computing. Such things were only seen in PCs 5 years later with HyperTransport from AMD first, later followed by QPI from Intel. BOTH THESE INTERCONNECTS ARE DERIVED FROM OVER A DECADE-OLD ALPHA EV7.
Add to that more. The 21464 EV8, aimed for release in 2002 if things continued as originally planned, was to be the first processor with eight-issue wide superscalar out-of-order symmetrically multithreaded core, and we mean four threads out of each core here. The 'EV9' 21564 design was expected to add multi-core and huge, wide vector unit - up to 1 KILOBYTE wide - capability to the mix, enabling well over 100 GFLOPS DP floating point performance per core for 2004 timeframe. Remember, we are only now reaching such capabilities in late 2011, and need 6 to 8 cores for that. Anyway, the multithreading and vector enhancements designed well ahead of their time into the EV8 and EV9, never saw the light of the day in the open market.
In the late nineties, China saw the value and capability of Alpha, and built a number of Alpha systems, some of them very large for the time. It also fully licenced the Digital / Tru64 UNIX and related software stack, including getting the full source code, from Compaq after the latter bought DEC then, giving China the critical software control part. At the same time, having seen the business instabilities linked to the Digital-Compaq-HP transition, China seems to have been working on having its own Alpha flavour.
After over a decade of work and three generations of CPUs, Jiangnan Reseach Lab has shown the ShenWei (Sunway) SW-3 processor, the Chinese flavour of Alpha, not in a small workstation, not in a server, but in no less than a huge petaflop-class supercomputer machine in Jinan, Shandong - the Sunway BlueLight MPP, this past October. The CPU itself runs for over a year in a variety of systems, but displaying it running a petaflop machine was probably the best PR one could get, especially since foreign supercomputing dignitaries such as Jack Dongarra, the man behing TOP500 list and Linpack FP benchmark.
SW3 aka SW1600 is a 16-core, 64-bit RISC processor, with each core looking a lot like an improved version of the 21164A EV56 Alpha core, plus vector FP unit extension added to each core. While the initial speed range was 1 to 1.2 GHz in the 65nm process, the standard speed grade is a 1.1 GHz chip with 141 GFLOPs DP FP performance. The speed set for the Bluelight Petaflop machine's Top 500 run was 975 MHz, though. The quad-channel 128-bit DDR3 on-chip memory controller offers 68 GB/s bandwidth - yes, equivalent to 8 channels of DDR3-1066 server RAM.
The L1 and L2 cache sizes are still rather minuscule for modern CPUs, being kept at the original 21164 sizes of 2 x 8 KB L1 and 96 KB L2, however it has enabled both very small cores and also very, very low cache latencies, down to two clock cycles for L1. You can see the CPU block diagram here.
As mentioned before, 21164 core was the most power efficient of all Alphas, and also one of the most power/performance 64-bit high end CPU cores of all time, excluding the mainstream, entry level or embedded processors. So, the choice of that core for all these years by the Chinese, although they obviously - as the Loongson case shows - had plenty of resources to improve the EV6 or even EV8 cores if they wanted to - seems to prove right at this point. Remember Intel's Knights Corner, or the AMD GCN GPU architecture for compute?
The Knights Corner, being a compute version of the abandoned Larrabee project, uses a core even simpler - and slower - than Alpha 21164, basically a 64-bit version of the old Pentium, enhanced with much higher bandwidth, to act as a feeder to a vector unit behing it that provides very very fast FP. Stick a 50-odd of those on one chip, with the right cache and interconnect in between, and you got a good accelerator. The Compute Units in the AMD 7970 aren't that much different, although they are based on a native optimised architecture, rather than cumbersome X86.
So, in the Shenwei SW3, you have a simple, well proven 4-way (still double the issue of Pentium or Atom per cycle) superscalar in-order core with very small die footprint for today's processes, yet improved and with enhanced bandwidth to feed a simple, AVX-like throughput vector unit. What's the vector unit's speed then? If you normalise the speed to 1 GHz, it'd give you 8 GFLOPs DP per core, or 8 flops per cycle - not bad at all for a 2010 chip using an enhanced 1995 core! All that at very low, below 40 watts (official figures not available) per socket power consumption despite the old 65 nm process.
And, the sustained performance and power consumption in the Sunway Bluelight petaflop system were the proof of the pudding: the water-cooled 9-rack machine has 8,704 ShenWei SW1600 processors (only 8,575 of them ran the Top100 bench at 975 MHz each) organized as 34 Super Nodes (each consisting of 256 compute nodes), 150TB main memory, 2PB external storage, peak performance of 1.07 PFLOPS, sustained performance of 796 TFLOPS, efficiency 74.37%, and total power consumption 1074KW, figures that compare very well against competitive US supercomputer systems such as X86-based Jaguar.
What does the future hold for Shenwei? Well, it can either confinue where the Alpha was stopped, moving to 8-issue cores (even in-order architecture can do it these days since the compiler and scheduling evolved a lot over the past decade) and much faster FP per core, with fresh cache and memory architectures , or just tweak the current core and pack more of them in a single die at higher clock speeds as well, with wider vector units and more memory bandwidth to feed all that, a bit like RISC cousin of Knights Corner, but a true CPU here, instead of just an accelerator. Either can lead to teraflop-on-chip soon too, and either will require a rapid jump in semiconductor process used, down to 32 nm or 28 nm nodes - just like Loongson is expected to do this coming year.
Keep in mind that Alpha left behind a strong software library, not forgetting the Alpha-based Cray T3 system series here as well, and this includes one of the best UNIXes ever, as well as great compilers, optimised libraries, and much more. Coupled with its own software base, China has sufficient resources to confinue developing Shenwei on its own, with sufficient internal market. However, when it decides to go fully commercial with the effort, there will be plenty of interested partners worldwide to embrace the old-new Formula 1 of microprocessors yet again, this time with a far more stable supplier, business wise, than DECompaq was.
In this final part of the Chinese CPU development coverage, we look at the local ARM processor flavours, as well as China's own instruction set attempts aimed at the general market.
REVIEW
While MIPS and Alpha were at the forefront of RISC high end architecture development, the sole Europe-developed surviving instruction set architecture, ARM, was from the very start in 1985 aimed at the entry level - whether it was the BBC micro home computer successor then, or the myriad of smartphones and netbooks today. The Chinese have embraced ARM architecture as well for this part of the market, with several licenses up to now. These cover the full spectrum of consumer devices, from smartphones and tablets to netbooks, DTV settop boxes and car gadgets.
The Fuzhou-based RockChip offers Cortex A8-based custom ARM CPUs and SoC chips for personal entertainment devices. Their newest RK29xx is the first chip to decode Google's WebM VP8 in hardware. The 1.2 GHz CPU with 512 KB L2 cache also has an integrated 60 million polygons/s GPU as well as DSP-accelerated 1080p playback and encoding in most formats. It supports tablets and smartphones with up to 1280x800 displays. A dual-core version is supposedly under development as well.
The Hangzhou-based NationalChip licensed the ARM over 3 years ago, with specific focus on derivatives for digital entertainment, mainly digital TV sets and set-top boxes. Considered as one of top ten Chinese IC design companies by EETimes China, the company offers GX1100, 1200, 1500, and 3000 families of integrated SoC-approach components for digital entertainment.
Then, the Shanghai-based Leadcore Technology, the chip design arm of Chinese communications equipment company Datang Group, is working on custom ARM processors based on the Cortex-A9 MPCore, the ARM Mali-400 MP graphics core and Cortex-A9 optimization pack for the TSMC 40 nm low power process technology. Their focus is putting together uni and dual-core versions of such chips with its own baseband chip to target high-end smartphones based on the China's 3G standard, TD-SCDMA.
Another Shanghai company, Brite Semiconductor Corp., a fabless startup founded in 2008, has licensed most major ARM processor cores, including Cortex, ARM9, ARM11 and Mali on a long term arrangement. The license also covers Coresight debug and trace technology and peripherals that are compliant with the AMBA on-chip bus. Brite provides design services to electronics companies and works with SMIC, the local foundry, on the manufacturing side. They have already successfully output 40 nm chips from this foundry earlier this year.
Yet another company from 'New New York' of Asia, Shanghai InfoTM Micro-electronics, has licences the ARM11 processor core, Cortex-A5 and Cortex-A9 processor cores and the Mali300 and Mali400 GPUs for 3-D enabled mobile computing devices to be manufactured by Shuoying Digital Science & Technology (China) Co. Ltd. which is its both owner and main customer. They also have multicore ICs ready as of now.
In Zhuhai, AllWinner, focusing on HD media semiconductors, took the ARM Cortex-A8 processor and the Mali-400 MP GPU for their own HD-enabed procssors to be used with a range of Android OS-based tablets, smart TVs, personal media players, eBooks, smart media boxes, IP cameras and automotive multimedia gadgets. The Allwinner Technology SoC designs are available since this past summer.
Finally, we look at the ultimate approach - designing your own instruction set from ground up, a venture few dare to try, especially these days since X86 is pre dominant for the past decade across the board. ICube, a Shenzhen company, created the Harmony Unified Processor Technology, which is supposed to tightly integrate two different processor types, CPU and GPU, into one unified core - sounds somewhat like AMD Fusion approach, but with a fresh instruction set optimised from scratch for the purpose. This technology consists of the Multi-Thread Virtual Pipeline parallel computing core (MVP), an independent instruction set architecture (ISA), an optimizing compiler and the Agile Switch dynamic load balancer.
Even though these are big-named things reminiscent of what you see in servers, ICube's technology is actually used in small SoC solutions for the hand-held computing and communication market, with a focus on the Android OS. The initial product, ICube IC1, is a 600 MHz dual core 32-bit SoC with 8 threads (4 per core) in parallel and 5160 DMIPs declared throughput, a 70 million polygon/s, 600 Mpixel/s GPU, and a host of integrated features such as FullHD display driver up to 1920x1200 with HDMI/DVI, a camera interface, 720p video acceleration, 5.1 audio, memory card, USB, 3G and Wifi connectivity.
What's interesting here is not only the fine grained CPU multithreading with OpenMP and Pthread (both used in HPC and general SMP apps a lot) support , but also the GPU support for Data parallel, Task parallel, and Function parallel computing with minimised interrupt and context switch overhead due to multithreading, and heterogeneous GPGPU applications with both OpenGL ES2.0 and OpenCL support. Each core has 64KB I-Cache , 64KB D-Cache, 64KB SRAM and 32-bit GPR file, 8-channel DMA and 16-source interrupt controller. Each core only takes 3.0 mm2, including memory, with operating power of about 300mw.
The built in support both homogeneous (OpenMP and such) and heterogeneous (OpenCL and such) parallel programming APIs through native compiler and MVP drivers is quite a good news here, as a new ISA needs the easiest possible programming enablement to ensure software support.
In summary, China is covering the ground well at the mainstream level as well, ensuring a well varied supply of CPUs for all classes of consumer devices, having ARM compatibility yet local cost, design and manufacturing control. At the same time, going for its own instruction sets is the next frontier.
Source:
Part 1:
Part 2:
Part 3