New cores underpin Cavium’s Thunder X2

Computex 2016: OoO and more in the new server offering

Cavium logoFor those of you waiting for a ‘real’ ARM server SoC to arrive, the Cavium Thunder X2 is here. While it is a ways out of production, today’s announcement is the first non-wimpy core chip to officially break cover.

You might recall the saga of the Cavium Thunder X 48-core SoC from a few years back That part had lots of in-order ‘wimpy’ cores coupled to lots and lots of accelerators and I/O, in short it was a throughput and networking chip. Thunder was aimed at four markets, networking, storage, security, and compute but because of the ‘wimpy’ cores it was widely seen as really only fit for the first three markets. For some compute tasks it could do well but those were not the majority of the market.

That all changes with today’s Thunder X2(TX2) because the cores have been massive upgraded to out of order (OoO) ‘beefy’ ARM v8.2 cores. Not only that the core count goes up from 48 to 54 but this is relatively minor compared to what Cavium claims the new cores can do. The official number is 2-3x more performance in the same power envelope, a gap that can’t be attributed to the ~10% core count increase. The clocks also go up but the 2.5 to 2.6GHz increase is again not a performance multiplier. For those interested in details, each core has 64K I$ and 40K D$.

Cavium Thunder X2 die plot

The new core marked with yellow lines

This performance upgrade puts Cavium in a position to attack a large part of the compute market that was previously unaddressable. If you don’t have a clear understanding of the market and what I mean by this, read this and this, they define what server competition will be for the next few years. With the new cores, Cavium upped the bar significantly. If you don’t think they are going after the mainstream compute market, look at the growth in cache sizes from the Thunder X(TX) to the TX2, 16MB to 32MB. If you want to be general purpose you need big caches. The TX didn’t have it, the TX2 does, coincidence? Also see the Octeon TX for more on this front.

Moving out from the cores things get a little fuzzy because Cavium is not disclosing details about the SoC yet, info comes in broad strokes. As you would expect the normal swathe of accelerators is present in the TX2 but there are no new classes thereof, just iterations. This doesn’t intone a lack of serious improvements, they are there too but more detail than macro changes.

Cavium Thunder X2 block diagram

The broader SoC has a lot of goodies

Starting out with the macro level features of the SoC we have PCIe3 x16 slots, something the old TX could not support, it had the lanes but could not do a single 16x channel. Cavium strongly hinted at multiple 16x slots so expect a large bandwidth increase here. Cavium’s telco and networking roots meant their SoC tended toward an overabundance of I/O and TX2 ups the ante there too. Etherenet is listed as 10/25/40/50/100GbE with multiple 100GbE stream again hinted at. 25Gb PHYs are obviously there and the ability to support ‘non-standard’ configs like 25 and 50GbE suggest a very flexible control arrangement. This should not surprise anyone following the company, I/O is the core of their competence.

Similarly SATA is listed as “Multiple SATA v3” ports with multiple probably being a big number. Since the TX2 is aimed at the same four market segments as the TX and one of those is storage, expect a lot of ports with support for many more via PCIe. SemiAccurate expects the TX2 to be capable of a lot more high level functions on the storage streams, encryption RAID, and the rest are effectively solved problems in this class of device.

All of this throughput needs the aforementioned I/O, CPU cycles, cache, and accelerators but those all need memory bandwidth to work at speed. If you compare the bandwidth of the I/O to that of modern memory systems, the memory is the slow point. Intel has elegantly solved this with their DDIO block starting with Romley and we expect Cavium to have similar functionality to keep DRAM from being a choke point. This hasn’t been a problem in the past so we don’t expect it to be here but more when the chip is detailed later on. That said the TX2 raises the bar from four to six DDR4 channels.

All of these features are nice but the markets that Cavium is aiming at cares about two things, minimum single threaded performance and TCO. Assuming the TX2 can meet the minimum performance criteria for a workload, that leaves TCO as the battleground. Price, performance, manageability, energy use, and software all play a key role here but we will focus on the last three for the moment. It is too early to talk price and performance, Cavium put out some numbers but we will wait until the chip is closer to release before repeating them.

Back to manageability, effectively table stakes for the datacenters and markets Cavium plays in. Since they already have products in these markets you can safely assume the company knows what it needs to do but generic compute adds in a different set of needs. For this Cavium is using AMI (American Megatrends) and their AptioV management products. This tried and true management standard should fit in nicely to the infrastructure most customers have in place, and a standard UEFI system reduces integration headaches.

On the software side the massive gains in ARM compatibility for Linux means everything you need should be there. Ubuntu has been 99.9(big number here)% ARM clean for years now, and other relevant commercial distros are in the same boat. There is a big difference between clean compiling and running and hardened to the liking of data centers but SemiAccurate feels the last two years were more than enough to get that part done. We consider software to be there on the open source side and not yet announced on the proprietary side.

The biggest software unknown on the ARM server front is virtualization. Linux has had KVM baked in for years now and it is a known and capable tool. TX2 supports virtualization and Cavium claims it is effectively feature competitive with x86 and ahead in a few areas. A good example of this is buffer management, software on x86, hardware in Thunder. Similarly there is a DPDK engine in hardware, you know those accelerators we mentioned earlier. By the time TX2 is released x86 may have caught up and surpassed all of Cavium’s features, the take home message should be that there is hardware virtualization support and it is not a sparse feature set.

That brings us to the last bit, power management. Cavium is claiming to have put a lot of effort into power management this generation. Core frequency and voltage scale of course but can do so on a per-core basis. Again this is table stakes for a modern many-core SoC and Cavium is there now. More interestingly they can scale the voltage and frequency of the uncore as well but Cavium suggested most customers won’t do this.

Why? Think about potential latency hiccups when sucking in multiple 100GbE streams, processing them, and spitting them out with the lowest possible latency. How much time does a frequency change for the interconnect take, and how many dropped packets is that? It is hard to argue their position on this point, networking at this level is quite finicky. A move to 14FF from an unnamed foundry should help with leakage/idle power too with a claimed 30% efficiency increase.

We will leave you with two other technical tidbits about the Thunder X2. First is that they natively support type P NVDIMMs on this chip. We don’t know of anyone supplying those memory modules but since it is in hardware you can assume someone will be making them and someone buying them. Nothing gets into hardware like the Thunder without a firm vision of the intended customer.

Another nicety if you are in the large data center space is that the ability to directly connect to LR (Long Reach) fiber on chip. This simplifies device design and drops cost for things like backplanes and even some external connectivity options. LR isn’t a game changer but it does add a bit to the TCO side of the story, every little bit helps.S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also a council member with Gerson Lehman Group. FullyAccurate