AMD let the new ‘Cat’ out of the bag with the Jaguar core

Hot Chips 2012: Bobcat MkII is a nice step forward

AMD - logoAt Hot Chips today, AMD is revealing some of the details about their new Jaguar core, the successor to Bobcat, the first of their smaller ‘Cat’ family of cores. Take four of these second generation low power cores, add a GPU, and off you go to power lower end laptops and higher end tablets.

Those among you that follow chips closely might realize that Jaguar is not the second iteration of the smaller ‘Cat’ family of CPUs and APUs, that was supposed to be the Witchita and Krishna twins. As SemiAccurate exclusively pointed out last November, those fraternal twins were canned for a number of reasons and the Jaguar cored Kabini was pulled in. Today AMD starts the long process of revealing this chips, but things like the speeds, die areas, and most of the uncore were not revealed. In fact, AMD didn’t even mention that it is a 28nm part.

So what did they talk about? Jaguar supports one to four cores, has up to 2MB of L2 cache, and supports most of the latest instruction sets. The x86 additions that Jaguar adds to the Bobcat baseline are SSE4.1, SSE4.2, AES, CLMUL, MOVBE, AVX, XSAVE, XSAVEOPT, FC16, and BMI1. It also now supports 40 bits of physical memory, that would be 1TB or so, but the chip only has a one memory channel. AMD didn’t say that, but SemiAccurate’s moles are pretty adamant that this is the case, so good luck finding a matched pair of 512GB DDR3 DIMMs to test with.

AMD Jaguar core diagram

The Jaguar core; color coded by function and not to scale

The Jaguar core has three main areas, a green front end, an orange integer unit, blue for FP, and the other caches and L2 connections are in red or blue. This diagram is very similar to the Bobcat one, from this level the differences are pretty minor. That said, it is an all new design, and everything was updated, enhanced, and optimized. Lets look at them section by section.

Jaguar adds an Instruction Cache (IC) loop buffer, basically a small buffer that can be read on tight loops to save time and energy when there are tight loops in the code. There are four of these, each holds 32B. Like Bobcat before it, the L1 IC is still 32KB 2-way set associative and it is still located pre-decode. The prefetcher is also enhanced to look farther down the data stream, but it does not share the decoupled logic of Bulldozer, that is likely out of Jaguar’s power budget.

The last big change is an additions pipeline stage in the decoder that brings some frequency headroom to the core, not that AMD needed much more for the power envelope they were targeting. That said, more never hurts. Instruction Buffers have also been enlarged to avoids stalls.  This is especially useful when frequencies rise relative to I/O. Overall, Jaguar’s front end can issue two instructions per clock, can retire two instructions per clock, and all the internal plumbing is at least two instructions wide, sometimes much wider in specific areas.

Luckily for coders, Jaguar has two ALU pipelines to process those instructions on the integer side, so no bottle-necking here. The Int unit also has a Load Address Generation Unit (LAGU) as well as an independent Store AGU (SAGU). This is exactly the same as Bobcat, the differences are in how each unit executes those instructions. Of these optimizations the most comprehensive is the divide unit, it is more or less lifted from Llano.

This allows Jaguar to speed up that instruction from one bit processed per clock to two, a huge increase if you have divide heavy code. Several other ‘complex’ operations were similarly enhanced, although most don’t get near the 2x speedup that divides saw. More generally applicable is the Out of Order resources, they have been expanded to better support the increased IPC of Jaguar. That means a Scheduler that can handle more entries, and larger reorder buffers. Other than that, the general layout is what you know and love with Bobcat.

On the FP side, things are very different. Like Bobcat, the decoder can still issue two instructions per clock to two FP pipelines. Bobcat had 64-bit wide FP units, so 128-bit SSE instructions had to be processed with two passes though that pipe. This hurt performance but took less die area. With the shrink to 28nm, Jaguar’s cores have a lot more area to play with, so the FP pipes were widened to 128-bits for one pass SSE execution. Unfortunately for the architects, Jaguar supports the AVX instruction set, and that means 256-bit FP operations. To do those, Jaguar has to do two passes through a 128-bit pipe. Oh the irony, but at least AVX instructions are not very common in off the shelf code. Yet.

Like the front end, the FP decoder burned a little area and power to add an extra pipeline stage in the quest for clock headroom. Given the complexity of some of the new FP instructions, that seems to be a good trade, but frequencies have not been officially released yet, so stay tuned. That said, Jaguar can execute up to four SP multiplies and two SP additions per clock though clever use of multiple pipelines. Going to DP instructions halves the number of adds, but precludes some of those tricks, so the core and only do one DP op per clock on top of that. Even in light of these restrictions, Jaguar should be a massive improvement over Bobcat for FP heavy code.

The last bits like the L2 Data Cache and queues are all very similar to Bobcat. All of the functions have been significantly enhanced but only the FPU data path has been significantly changed. This was widened from 64-bits to 128 in order to support the added width of the FP pipelines. Other than that, nothing on the level of this diagram has really changed, it is all micro-architecture not macro functionality. In the end, the instruction pipeline for both Int and FP units looks like this.

AMD Jaguar core pipeline

Pipes and more pipes, by the clock

At the very end of the core block diagram, the Bus Unit (BU) is all new. Everything on Jaguar goes through the L2 cache interface unit, it is the heart of the system in many ways. It connects to four 512KB tiles of cache, giving 2MB of inclusive 16-way cache in total. The tags for each block are stored in the L2 interface, so the cache itself is only read when there is definitely correct data in it. The interface runs at the full core clock, with four cores it has to, but the caches can run at half clock to save power when needed.

All of the supporting functionality for the L2 has been enlarged, enhanced, and widened as well. There are now L2 prefetchers per core, and there can be up to 24 simultaneous read and write transactions in flight at once. This is made a lot saner by the addition of more L2 snoop queues, 16 more in Jaguar. It is all new, all better, and should support many more active cores than Bobcat without choking on itself.

Since each of the cores has its own connection to the L2 cache interface, they can all be CC6 clock gated individually.  With four cores, this is not really optional, especially at the power levels Jaguar will be operating in. On more fine grained level, the Jaguar core is much more granular in its power control than Bobcat. AMD is claiming 98.8% of the core can be power gated, up from Bobcat’s 91.8%. Trivial as it may sound, this can save tons of leakage, especially since Jaguar improves on anything Bobcat did. Take a look at the tables below for more exact numbers, it is a fairly large gap between the new and old.

AMD Jaguar vs Bobcat power gating

Bobcat vs Jaguar IPC and power gating

More profound is the IPC differences while running the different code bases used to measure the power gating. AMD is claiming more than 10% frequency gains with Jaguar, and the IPC goes up more than 15%. In total, those gains are pretty significant, and if the net SoC power comes in lower too, that is a clear win.

AMD Jaguar and Bobcat core blocks

The cores, new and old, portable and more portable

One last thing to think about, Bobcat was the first AMD CPU that was said to be portable across processes. Although this capability was never demonstrated at the product level, Jaguar takes it several steps farther. Bobcat has seven macro blocks in the core that are process specific, two more in the L2 cache, and three in the clock tree. To move Bobcat from one fab to another, you have to redo all of those blocks, not a trivial task. Jaguar reduces this to three in the core, one in the L2, and one in the clock. While not a drag and drop move, Jaguar definitely cuts the work needed to move fabs significantly.

In the end, the user probably won’t see much difference between Jaguar and Bobcat from the overview level, but core performance will definitely go up. Power is also likely to go down, at least in the cores, but the GPU side of the SoC can eat that margin if AMD allows it. Until the specs of the full SoC are released, we can’t say much more than the more important blocks in the second generation ‘Cat’ is a serious improvement, and the uncore will bring many more goodies. I can’t wait to see the end result.S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate