AMD FINALLY STARTED to publicly talk about Magny-Cours and socket G34 during the Hot Chips 21 conference. The socket has a lot of complexities, so for now, we will only take a look at the interconnects, both on chip and off.
Magny-Cours is the CPU itself, a 12-core MCM that consists of two Istanbul 6-core CPUs. Each core has 512KB of L2 cache and 12MB of L3, half of which is on each die. The package also has four HT links and four channels of DDR3/1333. Clock speeds were not revealed, but the hint was about a 25% downclocking compared to Istanbul.
Magny-Cours die, or Istanbul if you squint
If you look at Istanbul, the first thing you will notice is that it is exactly half those specs with the exception of the HyperTranspor(HT) link count. Istanbul has three, Magny-Cours has four, not the expected six. That is because two links are used to connect the two dies internally. Sort of.
Magny-Cours MCM and links
Here is where the fun begins, with the MCM itself. The red links are memory channels, two per die, four per socket. Green, blue and grey are all HT, with wide lines representing 16-bit links, narrow ones are 8-bit. It doesn’t take much to realize that things are complex here.
The wide green link off the bottom is is the external I/O, basically the connection to the chipset, one per socket. Actually, since you can “ungang” the 16-bit HT link into two 8-bit HT links, you could theoretically put two chipsets off of one socket. That said, this is very unlikely to happen, it is much easier to add one off each socket.
This link is non-coherent HT (ncHT), meaning that it can’t be used for CPU to CPU interconnects. All of the other links, blue and grey are cache-coherent HT (ccHT). If you are sharp eyed, you will notice that the blue ccHT links between the dies on package are different widths.
The ‘extra’ link is extra for a good reason, but more on this later. AMD added it to the mix because it could, more or less for free. It increased the bandwidth between the cores by 50%, but real world performance does not go up by much because the cores are rarely bandwidth bound.
Of more interest is that, because these dies are soldered to the package, not run through a socket, they are of a set length and made of known materials. AMD took some liberties here, basically because it can keep tolerances much tigher; it upped the bandwidth on these links a lot. Unfortunately, AMD did not say how much. The links are ccHT like all the other, just notably faster and likely lower latency as well.
In a four socket system using the ‘old way’ AMD did things, that is, a square, two of these three links were used to connect the chips to the two neighbors. A chip in the top left would be connected to the one on the right, and the one below, but not the socket diagonally across. The third was not used.
Diagonal connections could be done, but rarely if ever actually were done. The third ccHT connection was used to connect two 4-way squares to make an 8-way system. While this was a good thing for packing more CPUs into a box, it was hobbled by the latency caused by multiple hops across HT links. CPU 0 loading from CPU 7’s memory might need four hops to get to the data and four hops to get back. Add in cache coherency, and you had those hops taking the whole system to its knees.
The way around this is to directly connect each socket to every other one in the system. On a two socket box, that is easy, you just connect point A to point B. On a four socket, you make a square with an X in the middle, exactly what AMD traded off to allow for eight sockets on socket F and before.
With the new socket G34, AMD did just that. The grey 8-bit ccHT link is basically a diagonal link, the X in the square. If there was only one die in each socket, that would work wonderfully, problem solved! Unfortunately, G34 has two dies per socket, and they are connected to the two dies in the other socket using one of those 8-bit links per die.
On a two socket system, the links directly between the dies are 16-bit and the ones going diagonally are 8-bit. Since there are two full ccHT links per die, there are four per socket to connect everything on a four socket system. To connect between sockets, you don’t need the full bandwidth that a 16-bit HT links brings.
In the end, each socket is connected to every other socket directly, but every die is only connected to every other die on a two socket system. The worst case in G34 is to have any die two hops away from any other die. It all looks like this.
Socket diagrams for two and four sockets
If you think this looks like a mess, then send flowers to the AMD engineers who had to write the routing algorithms to make it all work, and work perfectly. On a more theoretical level, the 2-way G34 is the same as the older socket F 4-way with a cross connect. The G34 4-way is like the 2-way, but extended into the Z-plane.
The scheme that AMD uses for connections on a four socket system requires three 8-bit lanes per die, six 8-bit lanes or three 16-bit lanes per socket. Adding a fourth link would take die space and add a lot of complexity to the routing plus many more pins on the package. The G34 already has 1944 pins, the most we are aware of in large scale production, and adding to that for minimal benefit is not a good idea.
To fully connect this four socket G34, you would need that fourth 16-bit HT link. P1 needs to connect to P2 and P7, and the cost/performance tradeoff wasn’t enough justify another link. Maybe in socket G44. In any case, two hops is a lot better than the four it used to take.
AMD is quoting that the average number of hops used to get to memory, diameter in AMD speak, is 0.75 for a two socket system, 1.25 in a four socket. In order to alleviate the problem of cache snoop traffic, AMD put in something called a snoop filter, termed HT assist in AMD speak. It is more complex when you combine it with the G34 topologies, so complex that we will have to cover it in a different article.
That is not to say that functionality drops off. The metrics AMD used to demonstrate this are DRAM bandwidth and Xfire (Crossfire, but not that ATI Crossfire) bandwidth. DRAM bandwidth is just the aggregate memory bandwidth of the system. Xfire bandwidth is how much memory is available when each core is reading from all the other cores in a round-robin fashion.
A two socket G34 system has a DRAM bandwidth of 85.6GBps and a Xfire bandwidth of 71.7GBps. On a four socket system, those numbers increase to 170.4GBps and 143.4GBps, almost exactly double. This makes sense for DRAM, but to do it on the Xfire side is a lot more impressive. It looks like the interconnects did what they were supposed to do, and adding a fourth link would not be worth the cost.
In the end, AMD fixed the biggest problem with multiple sockets, latency. Part of that was the probe filter in Istanbul, but a much more important step was the new interconnect scheme. It isn’t fully interconnected, but socket G34 is very close. For what is effectively an eight socket system, packaged into four MCMs, AMD seems to have done a very good job.S|A
Latest posts by Charlie Demerjian (see all)
- Thing go bump(gate) in the night for Nvidia’s GP100 Pascal GPU - May 3, 2016
- Cavium’s Octeon TX blends compute and packet moving - May 2, 2016
- Qualcomm releases Zeroth API to developers - May 2, 2016
- Another detail about Qualcomm server SoCs revealed - Apr 27, 2016
- AMD finally really honestly launches the dual Fury - Apr 26, 2016