Part 1 of the series on Trinity covered the core itself, and the changes from Bulldozer to Piledriver. Part 2 looks at some of the surrounding pieces of the chip, mainly the interconnects and uncore. Part 3 is mainly about the GPU and related parts.
The GPU itself – Been there, done that:
By the time we look at the GPU itself, it is, well, a bit underwhelming, but that is about what you would expect from the spiritual successor to integrated graphics. While Llano shed the abjectly awful reputation of its chipset graphics forebearers, Trinity moves the bar up, but won’t threaten a top end GPU. Unfortunately, it is still based on a ‘last gen’ architecture.
Yes, Trinity may be labeled HD7000, but it is anything but 7000-class in architecture. Real 7000 generation GPUs use GCN shaders, basically VLIW4+Scalar units. Trinity uses the same VLIW4 shaders found in the HD6900 GPUs, but not the lesser 6000 series parts. Not that this is a bad thing, HD6900 parts were no slouches, but VLIW4+Scalar would have been welcome here.
Trinity has 384 of these shaders, less than Llano’s 400, but Llano used older VLIW5 shaders. Llano has 5 clusters of 80 shaders, Trinity has six clusters of VLIW4 shaders, the functional equivalent of 480 Llano shaders. Trinity also clocks them faster than Llano by a considerable margin, 10% up base, 50+% up with turbo. This accounts for much of Trinity’s claimed 40% lead in GPU performance.
One thing that Trinity does do differently is pull in the UVD video decode block from the current HD7000 GPUs. This is a welcome advance, mainly because it adds dual HD stream decoding to the picture, pun intended, along with MPEG-4 and the rather pointless ‘3D’ BD specs. The sooner this noxious 3D fad dies off, the better for humanity, but it is supported in hardware on AMD CPUs now. Yay?
Far more important to the whole concept of Trinity is how the GPU talks to the outside world. As we mentioned above, it talks the same language as x86 memory controllers, that is to say really backwards and broken, but still the same. The CPU MMU can do full address translations for the integrated GPU, and external GPUs can do atomic memory ops across PCIe.
Since the GPUs can use the same nested page tables as the CPU cores, they are fully virtualizable from the beginning, meaning Trinity is about the perfect chip for services like Onlive. Add in a Page Request Interface for handling page faults, and you have about as much integration as you could hope for in an on-die GPU. The only thing better is to integrate the two MMUs, but as we said earlier, not until next year.
In order to get things off the CPU, Trinity has four display controllers powering three, sort of, display outputs. Since Trinity supports Displayport 1.2, you can chain multiple displays off a single controller so three outputs is just fine for four screens. This becomes a lot more interesting when you realize that the physical layer of Displayport is basically PCIe.
Trinity has 24 lanes of PCIe and one hard digital display out. You can take the PCIe lanes and use them as a display output if you want, quite handy on a laptop, less so on a desktop. In short, if you want three digital outs on your laptop, no problem. On a desktop, you might have to give up 8x or even 16x PCIe lanes.
Then again, anyone serious enough about graphics to want a 16x slot over an 8x slot will probably not buy a Trinity to use the internal GPU. This is the long way of saying that AMD has some really clever engineers working on their CPUs, and the tradeoffs they made were pretty solid. In a laptop, it is exactly the right thing to do. On a desktop, there might be some unhappy corner cases, but the overwhelming majority of buyers get what they need.
Power, power, power – Less is more:
When we began this look at Trinity’s architecture, we mentioned that the raw performance improvement for the CPU core was, well, not massive. What is massive is the claimed doubling of performance per watt, something that most people have trouble putting in to perspective. This means a 17W Trinity performs on par with a 35W Llano, and that chip was no slouch. How did AMD do it?
Once again, like the CPU core performance gains, there was no big bang, just lots of little changes that added up. One thing to keep in mind is that this truly massive advance in energy savings comes on the same process as Llano. Interestingly, Llano was a 1.45 billion transistor, 228mm^2 CPU, Trinity is only 1.303 billion transistors on 246mm^2.
The transistor density speaks volumes about the reasons for yield problems, or lack thereof, and the 10% drop in transistor count speaks equally loudly about the efficiency of the chip. Trinity is a marvel of efficiency, and it comes from many places.
The first big bang is the core, replacing the K10.5/Husky core with Piledriver, something that probably took quite a few less transistors. Bulldozer was known for its rather tepid efficiency, both in energy use and performance per watt. Piledriver unquestionably brings world class power efficiency to the chip, but the raw single threaded performance is a bit lacking.
Probably the biggest changes to the core are in the energy saving bits, not areas that boost IPC. Bulldozer could put a module to C6 sleep, basically hard power gating it. Piledriver adds Package C6 (PC6), basically putting the whole CPU to sleep in the same way as Core C6 (CC6) does to a module. Power can be shut off to the chip just like Llano did to the core. The GPU also has turbo, which means the GPU has caught up to the modern age for power savings as well.
Additionally, Piledriver can flush the caches, a big part of going to CC6, much more efficiently than Bulldozer or Llano. The faster this happens, the longer power can be turned off, so the more power saved. Similarly, PC6 means the UNB is power gated, and the GPU can be as well, but we are not completely sure if it is actually hard power gated or just clock stopped. It is definitely voltage dropped, and the UVD block can be hard power gated, but the shaders are a bit murky as to gating or not.
Llano has a power microcontroller, and it does a really good job. The controller in Trinity is vastly improved, and is a much more capable component. It in not only faster, but more precise too, all while acting on more inputs than the older variants in Llano and Bulldozer. We are somewhat sworn to secrecy about the internal workings, but we are convinced that it is a huge step forward, and that is backed up by the end result.
Couple this with some interesting memory tricks to save power, and you have the ability to drop power on more than just the CPU/SoC, something that Llano wasn’t nearly as capable of. Trinity can also dynamically ramp DRAM speeds up and down, something that saves huge amounts of power at low loads and idle. If you look back to how much energy was saved when GPUs started to do this, you get a very good idea about the power this can save in Trinity. Similarly, it can slow down PCIe clocks, and hard power off lanes as needed to save yet more power. Since Displayport uses PCIe physical layers, it too can be narrowed, slowed, or powered down as needed.
One unique trick in Trinity is called frame buffer compression. When the PC has a static screen, the frame buffer is copied from its normal location spread across both memory channels, to one. The second memory channel is powered down completely, and only the one with the screen image is kept awake. This cuts memory power in half by doing something that was previously impossible, quite a neat trick. If you add in on chip buffering of the display, active backlight control, and every other trick in the GPU energy savings book, you end up with a very efficient CPU.
What you end up with:
In the end, Trinity isn’t that much faster than Bulldozer or Llano in single threaded performance. It is a far cry from Intel’s latest cores there too, but handily beats them in GPU performance. On top of this, early indications show that Trinity is notably more thrifty with energy than its Intel competition when doing real world work.
Getting there wasn’t a big change, or a series of big changes, it was dozens and dozens of little changes, each component using a few tenths of a percent less energy than their predecessors. Llano had a fairly normal power efficiency curve, and Bulldozer was nothing unusual either. Trinity changes things, if you plot power vs voltage, you get the flattest curve SemiAccurate has seen, not the usual arch.
This, along with all the performance tweaks, power savings, and transistor efficiency advances lead to one thing, a chip that is world class at what it does. The flat curve, along with quicker and more comprehensive powering off, means that Trinity can take advantage of the HUGS (Hurry Up and Go to Sleep) philosophy. It takes advantage of high clocks to do work, and then turn off.
Trinity is faster than its predecessors in clock rate, performance per clock, performance per watt, and just about every other metric. More importantly, it is just plain faster, with no per anything caveats. This allows AMD to not only break in to markets that were previously closed, but gives them a better part than the competition in many areas. What’s not to like?S|A
Latest posts by Charlie Demerjian (see all)
- Thing go bump(gate) in the night for Nvidia’s GP100 Pascal GPU - May 3, 2016
- Cavium’s Octeon TX blends compute and packet moving - May 2, 2016
- Qualcomm releases Zeroth API to developers - May 2, 2016
- Another detail about Qualcomm server SoCs revealed - Apr 27, 2016
- AMD finally really honestly launches the dual Fury - Apr 26, 2016