China's server chip dream

Beijing wants its own server chip. Really, really badly. I wanted to try to lay out its progress so far - at least that I’m aware of from public sources and private chats. For those completely unfamiliar with semiconductors, there’s a simple intro; the more silicon-minded can skip the intro.

Quick introduction

An indigenous server CPU is a critical plank of China’s semiconductor ambitions. Put aside all that flash talk about AI, and just focus on how any modern IT system functions. It runs off a server somewhere. The CPU is a bit like the brain (though you usually need a few of them); it receives information, analyses it and then sends signals to other bits of your body to do stuff.

While your laptop or PC has a CPU, the servers keeping the hyper-scalers like Amazon, Facebook and Google running, as well as any other company with an in-house server, need large numbers of specialized-for-servers CPUs, optimized to manage the huge, diverse and constantly-evolving workloads they run. When people talk about ‘The Cloud’ they are talking about linking your computer to, say, Amazon’s global network of server farms. High-powered computers (or, going up a level, “super-computers”) are just fancy servers - the difference is one of processing scale and speed, not type.

Of course, you also need a place to store the data which you’re computing, so to make that more efficient, we have semiconductors specialized for memory; NAND and DRAM. NAND is for storing content permanently; DRAM is for holding data while its being computed. In addition, an increasing amount of the workload running on hyper-scaler servers is image-recognition and recommendation engine machine-learning (ML) algorithms (usually what folk talking about AI are referring to), and that math is best run on chips optimized for massive but relatively simple matrix multiplication math. GPUs, originally designed for gaming, are good at that, though they’re really expensive. Thus, firms are designing ASICs (a chip designed for a specific purpose) to run those algorithms most effectively. Google’s TPU is the most famous example.

So the CPU is the brain, DRAM and NAND are the memory, the GPU handles graphics/games, as well as ML, and ASICs for ML are just arriving. All these chips work together in the servers, together with the network of switches and routers which move all the data around. And servers are currently migrating from the compute, storage and network in different bits of kit, to having it all integrated (so called hyper-converged infrastructure) along with a virtualized software layer for managing everything. This transition means the server CPU has even more work to do.

Firms in China, with much state support, are making efforts across all these semiconductor types, but given the CPU is the brain, it’s mission-critical. Not just to running Alibaba’s cloud servers, or the ones which China Mobile and ICBC run themselves, but government and military servers too. Right now, China is reliant upon, just like everyone else, Intel and its x86-based server chip (known as Xeon). There are competitors (which we’ll get to below), but the Xeon range, particularly the latest versions, is just way-better (and more expensive) than anything else on offer. Cue a severe case of paranoia (lately a little more justified) in Beijing about being cut off from this silicon.

One final thing. We are not just talking about how to design and manufacture an incredibly complicated bit of silicon, but also about the base-level software that makes it work, the Instruction Set Architecture (ISA). This is the code which communicates between the software and hardware. For Intel, its the x86 platform.

So that’s the 101 over. Let’s get on with the story.

China’s efforts in server chips, first with AMD

The strategy, as with so many other instances of industrial policy in China (which I wrote about here), involves the central government setting a broad, ambitious goal and entities then competing to fulfill the target. “We want a server CPU - Go!”. Some of the resulting efforts are more meaningful than others.

Of course, you have to first develop a CPU that works first, and then optimize it to function in a server. But I’m not going to deal with China’s indigenous CPU development efforts here as they’re pretty limited and unsuccessful. China has tried to leapfrog that step, essentially, with two attempts to indigenize leading-edge foreign technology. One attempt involving AMD, a US semiconductor firm, appears to have died with the Entity Listing of Sugon, its Chinese HPC firm partner, while the other is in A&E, since it involves Huawei. ARM is the partner there.

First, AMD, a firm which has a license from Intel to use the x86 ISA, but which builds its own cheaper chips. It developed a rap as an also-ran in the 2000s, but the firm got a lease of life in 2014 under a new CEO, Lisa Su. Recent product launches have gone well.

Under Su, and as part of a new strategy, in 2015, AMD management looked for a way into China, worried that ARM would be China’s preferred partner. It approached the Ministry of Industry and Information Technology (MIIT), who told them it would look kindly on a JV aimed at developing an “indigenous” server CPUs. After multiple meetings, it was “suggested’ that AMD work with Tianjin Haiguang Advanced Technology (THATIC), an entity controlled by Sugon, a leader in China’s HPC space. They then haggled about an up-front fee and annual royalties for x86 and SoC IP licenses.

The structure of the deal was a tad complicated – as Washington did not want x86 technology being transferred to China, and China wanted IP it could indigenize. To achieve this magic, two joint ventures were set up in 2016:

  1. Chengdu Haiguang Microelectronics (HMC), majority owned by AMD (51%), with THATIC taking 49%. It licensed the x86 IP from AMD, and subcontracted out the fabrication of the chips AMD’s majority ownership meant that the IP was still controlled by a US entity.

  2. Chengdu Haiguang IC Design (HyGon) then licensed the IP from HMC, had access to all the IP, designed the chips and sold them. 70% owned by THATIC, with AMD taking a 30% stake. Thus chips are sold in China (and nowhere else) under the HyGon name, with some a chunk of the revenues going back to AMD.

AMD sold a “soft-core” x86 license to HMC, a black-box which can’t be back-engineered, AMD argued, but HyGon could design software around it. The JV was supposed to develop its own chip, but in practice they worked on a Zen replica, AMD’s server CPU (known as EPYC), which they called Dhyana.

One view of this relationship is that it gave Sugon a huge PR win, but was essentially worthless in terms of absorbing the technology. But some folk in the US government were really not happy, the WSJ reports, believing it constituted ‘giving away the keys of the kingdom’ of the x86. I am still not sure what to make of these concerns. This great piece lays out reasons for skepticism. That said, there may be evidence that’s not public that more was going on. This piece) is also good.

Anyhow, the US national security folk got their way a few weeks ago when the Department of Commerce put Sugon on the Entity List (here), effectively shutting down AMD’s JV and any further transfers of IP. AMD’s Su has said that the new version of the EPYC chip was not shared. I’ve not heard what’s going on at the JVs now.

Second up, ARM…

ARM is a UK-based firm which designed the chips which run your smartphone. It’s an amazing success story. Qualcomm’s Snapdragon System-on-Chip (SoC) is an industry leader, it uses ARM; so does Huawei’s Kirin and Apple’s A12 chips. (A SoC is basically a CPU with other functions, including memory, integrated in together.) Critically, ARM has its own ISA. Unlike Intel, it licenses its ISA, and designs based on it. Other firms, like Qualcomm, then design on top of ARM’s base ISA. Given its huge success in smartphones, ARM has been trying to develop its ISA for other sectors of the semiconductor universe, including servers. Before it was bought by Marvell, for instance, Cavium used version 8 (v8) of ARM’s ISA to design some server chips called Thunder, which showed up in some US supercomputers.

Progress has been dripping slow, though. Chips with ARM’s ISA have around 4% of the global server-CPU market. They start from a position of weakness; an ISA designed for a smartphone is very different from one designed to carry far heavier, way-more-complicated workloads. And given you’re going up against Intel’s Xeon, a number of firms have simply given up server chip projects in recent years.

But there’s more stuff happening now. One reason is that the US hyper-scalers are interested in designing their own custom silicon. While Google has talked much about its TPU for ML workloads, Amazon is doing that too, but has also designed its own server CPU. It bought Israeli chip-design firm Annapurna in 2016 and in 2018, it rolled out Graviton, a server CPU using ARM’s architecture. It’s now running on their servers (to some extent, perhaps just for storage control) and is available for their cloud clients to use. The performance is not great, though; it’s much less powerful than Intel or AMD’s offerings. But the next generation will be better, etc. (You can read about its performance here, here and here.)

Make-your-own-server-chip projects only make sense if you have the deep pockets to license from ARM, buy in a decent chip design team, fund a tape-out (i.e. manufacture) at a foundry like TSMC, have the ability in-house to test the chip’s performance, and then the time to keep improving the design over several iterations as you optimize for the workloads you have. It’s only really the Amazons and Googles of this world who have those resources. And Huawei, of course, who we’ll get to below.

Another challenge for new entrants in this field is the software that firms run on their servers and in the cloud. Some 80+% of that is written for the Xeon platform and just won’t work on ARM-based ISA. But if an ARM chip is sufficiently cheaper (or if you desperately want to be free of Intel), you can write/convert your software to be ‘scripted’ (i.e. compiling just before execution), and is thus platform agnostic. Some 20% of the workload at Amazon is said to be scripted now, and ARM is keenly working with cloud software firms to boost that share.

Some 20% of ARM’s revenues come from China - thanks Huawei, and all the other smartphone makers! And, like AMD, looking to the future, the firm wanted to bolster its presence in China. Their ISA is also ideally suited for IoT chips - your wired TV, fridge, lighting system, etc. - which require much simpler chips to operate. The firm had an engineering team in Shenzhen working on that, given China’s huge future IoT market; and in 2018 they set up a formal joint venture, Arm Technology China, in which they have 49%, with majority (51%) owned by Chinese investors (which includes CIC, the sovereign wealth fund, Silk Road Fund, another state entity, and Huopo Investment, a PE fund led by the exceedingly enigmatic Fang Fenglei). The JV became the distribution platform for all of ARM’s IP sales into China, so that it can make some revenue (though most are transferred back to Cambridge, England). It has access to the firm’s historical IP, as well as rights to future ARM IP and the design roadmap.

And this is where it becomes interesting. ARM was bought by Japan’s SoftBank in 2016 here for USD 31bn. So that China JV deal was organized by SoftBank. Now, SoftBank has a 30% stake in Alibaba, and has pegged its future to e-commerce and IoT, where China is going to be a huge player. Some might speculate that ARM is now happier to transfer IP over to China. And if China offered USD 50bn for ARM, what would SoftBank say? Could it say no? I’m not familiar with how things work in Japan. (Does Tokyo have the legal authority to block a sale like that? Could Washington put pressure on Tokyo to put pressure on SoftBank to restrict China’s access to ARM IP?).

But put those questions aside, as there’s now another problem.

Where does Huawei fit into all this?

Huawei, ARM’s biggest China customer, has been busy trying to develop a server CPU - based on the ARM architecture. They are like Amazon in a few ways: they have a ton of money to throw at the problem, they have a lot of experience building large server systems for clients (though they are nowhere near AWS in their cloud offering), they have lots of opportunity to try out and iterate on their designs (maybe on the government systems they install, on non-critical workloads, rather than commercial servers), and they have a bunch of very smart, motivated engineers. They have some extra advantages too: acres of chip design folk familiar with the ARM architecture, with which they built their Kirin smartphone SoCs. And lots of state support (like Huahong, who I wrote about in this piece). And now a very friendly bunch of ARM architects just down the road in Shenzhen.

And when it comes to the ISA, Huawei have an architectural license. In contrast to the AMD arrangement, this gives Huawei access to the core IP. That’s a big plus. The minus is that AMD has been working on server chips for years, albeit not that successfully, so Huawei has to first get the ARM architecture up to the AMD level.

They’ve made progress. In January, they announced the Kunpeng 920, which has lots of impressive-sounding metrics (here). I’m blind as to how good this chip really is - but my suspicion is that it can ‘get the job done, but not very well’. (Which is already something!) Here’s the technical breakdown. As with Amazon, this is a process; they’ll now be testing. And given they’re working with TSMC to make the chips, they’ll have the benefit of their skills and engineering experience. TSMC recently presented research on ARM-based HPC chip design here.

But now they have a big problem; the US Entity Listing. As far as I understand it, it prevents ARM licensing new versions of its ISA to Huawei, even via the JV. So while they cannot take back v8, Huawei won’t be able to license v9 in a year or two when ARM releases it. (The reason appears to be that there is US content in ARM’s ISA, but exactly how this law exactly works is more confusing that quantum physics, so who knows if that ban will stand). Without access to ARM IP, Huawei’s chip capabilities will gradually die - and not only for the nascent server-CPU project, but also for the phone. v8 is fine of course - and that will mean everything can continue as is, but over time it will degrade relative to new upgrades. (And the server-chip side gets hit more as ARM will not be able to share any R&D on a server-optimized ISA, while Huawei already have a lot of that expertise in-house on the phone side.)

It is also worth noting that the listing also stops Intel selling them x86s - so in effect, Huawei won’t be able to sell new servers (unless the clients buy the x86s instead?).

Huawei certainly seemed to be banking on a, err, long-term relationship with ARM. It plans to establish a HQ in Cambridge, UK, right next to ARM’s home base, and all the geeky kids at the university there here. A masterful move if you ever thought about hiring away ARM engineers…

There’s been some talk about China’s server-chip wanabees being able to make use of RSIC-v5, an open-source ISA originally developed at the University of California, Berkeley. But that will take years, given its a really reduced-form ISA, way more basic than ARM’s, and will require years of work to get anywhere close to where the x86 is today.

Conclusion

And there we have it - at least to the best of my knowledge. China’s firms have been trying really hard to leapfrog into server chips. There’s some progress. But they need foreign cooperation and help.

Foreign companies are willing to share technology, particularly firms which are attempting to catch technology-leaders themselves. The act of cooperation empowers the China JV and erodes, over the long-run, the competitive position of the tech-leader in the space, in this case Intel. A JV also generates ample informal opportunities for tech-transfer, via training, friendships-made, hacks made a bit easier, etc.

It is only US legislation (and Japanese?) which is going to put up barriers. And the firms will work hard to work around these, while following the letter of the law.

The ARM relationship with Huawei really is critical.

Back