A close look at AMD’s Bulldozer

The devil is in the details

AMD LogoAMD IS FINALLY starting to talk about Bulldozer, the upcoming new desktop and server core. It is the largest architectural jump in standard x86 cores in a long long time.

The first thing to note is that AMD is most definitely not talking about any chip specifics, speeds, performance, release dates or anything else. They do promise that he upcoming bulldozer products will fit into the same G34 and C32 server sockets as the current chips, and will live in the same thermal envelopes. That is about it for product announcements, anything more will have to wait for 2H/2011, but we hear things are going well, so it is likely to be summer, not Christmas.

So, what is Bulldozer? It is a dual integer core module with a single shared FP unit between the cores. If you are familiar with Sun’s Niagara, that is the right idea, just not taken to the same extremes as Sun did. Since a picture is worth 10^3 words, it looks like this.

Bulldozer block diagram

Note the two integer and one FP cores

The first question this architecture brings up is, what is AMD’s definition of a core? That one is easy, they define it as an integer core with full pipelines. Sun and many others do the same, and there are a lot of cores out there without any FP hardware, so this makes a lot of sense. A Bulldozer module is therefore two cores plus shared FP resources, x86 decoders, and memory/cache interfaces.

Why would you want to do this? That gets a bit more tricky, it is basically a balancing act between performance, power, and die area. With Hyperthreading, Intel allows for the execution of two threads on a single core, basically putting unused resources to work where they can. This costs minimal die area while theoretically doubling performance.

If you look at the performance of the P4, it wasn’t anything close to double with HT on, but Nehalem and Westmere get a substantial boost from HT. None of these get 2x the performance of a single core on anything but the most pathological cases. That said, the cost is low, and a 25% speed gain for ‘free’ isn’t a bad deal.

AMD will agree that it is OK, but why not aim for the full 100% speed boost? That is what Bulldozer does, it adds a full second integer core, and is very likely to get the full 100% boost. The tradeoff in this case is die area.

Remember the part about shared units? That is the tradeoff in Bulldozer, but AMD went to great lengths to minimize the impact. In the CPU world, die area is the overwhelming cost, everything else pales in comparison.

What AMD did was to look at the cores and go down a list of which parts could be shared and which parts would be negatively impacted by sharing. The parts that could be shared were, and the parts that could not, basically the integer core, were not. That is the basis of Bulldozer.

How much area does the added integer core take? AMD says that for a 4 module, 8 core Bulldozer, the addition of a second integer core per module adds about 5% total die area, or about 1.25% per integer core. This number is a bit misleading though, the die has megabytes of cache, and many other things that are not core, so the cores are bloated by quite a bit more than 1.25% each. That said, the overall impact is fairly low for the level of added performance.

Since there were no performance numbers released, there is no way of knowing how well the added cores worked out, basically was the price worth it? If you think about it, unless AMD totally botched the job, the worst case scenario is that each core will be able to use half the available resources. The best case is that each integer core has access to more resources then they would in a shared nothing case.

What do we mean by that? If you think about a dual core chip, say an Athlon X2 or a CoreNumberNumeral with a 128-bit memory interface, each core effectively has 64-bits of memory available to it when both are working flat out. If one core is idle, the other core effectively has a 128-bit memory interface, so performance goes up.

While this is vastly oversimplified, if the OS you are running is mildly aware of things like this, the shared units will show more of an advantage than a disadvantage. Unless the implementation is totally botched, the CPUs will never have less than half the shared resources available to them, but will often have more than 50% allocated to them. Theoretically.

A really good example of this is the FP unit, the most obvious shared resource. It is a 256-bit wide unit that can do AVX along with four operand instructions. Each core sees a 128-bit FP, and if it has to execute a 256 bit AVX instruction, that has a two cycle latency. Overall, worst case, the core has an FP throughput of 128b/clock.

Since there is a shared FP scheduler, it is aware of what is being used and what is not. If it sees that one ‘core’ is not executing an FP instruction that clock, it can allow the other one to use all 256 bits. Instead of a two cycle latency for a 256-bit instruction, Bulldozer gets it done in one. You can also use the same logic for two 128-bit instructions.

Intel’s Sandy Bridge has two 256 bit wide AVX unit per core, so it has a minimum throughput of 512b/clock, and whatever else it can get from HT for idle periods. It will be interesting to compare Sandy to Bulldozer, on area, power and, of course, performance. Then again, until next year, it is all a theoretical debate.

Getting back to Bulldozer itself, a module starts out with a 64K Icache shared between the two cores, and each core has a 16K non-shared Dcache. One of the things that AMD is talking about quite a bit is filling that and the L2 and L3 caches, sizes unstated. The official word is that they are prefetching aggressively.

Bulldozer details

Details, can you spot the devil?

The first thing they did was have two prefetchers, L1 BTB and L2 BTB, and a prediction queue. The two B2Bs work in tandem, AMD isn’t saying exactly how, but the ideas behind them are fairly well understood. Basically you can have a fast predictor that has decent accuracy, or a slower one that has better accuracy. It is a classic tradeoff.

If you look at the diagram, you see that the BTBs feed into a prediction queue. What AMD did is put both a fast and a slow predictor in. The fast BTB predicts what should be loaded, then sticks it in a queue. Then the slow one also tries to do the same thing, and if it gets a different result, it updates the queue.

Once again, the devil is in the details. What is fast, what is slow, and how long is the queue? You can send things to the slow BTB every clock, or have a confidence algorithm that sends requests off only when needed. How this is implemented will play a large part in determining the success or failure of Bulldozer.

That prediction queue feeds into the Icache and fetch queue, which are then fed into four x86 decoders. This means that Bulldozer can issue 4 x86 instructions per clock, but looking at the diagram, you can see there are 12 execution units per module. This is potentially a huge bottleneck, but that is why we have caches, buffers, predictors, and other things to alleviate this bottleneck.

The integer core is pretty standard, the one big advance is fully out of order loads and stores. Bulldozer can issue two 128-bit loads and one 128-bit store per cycle. This should be a major boost over K8 and K10h CPUs, and lessen one of the big AMD bottlenecks of late.

Having a single FP core makes things a bit more interesting, and AMD has it listed as “Co-processor organization”. The scheduler is unified, that means one FP scheduler juggles and then executes commands from both cores. Once completed, each core sends results back to the Int core that it is associated with for retirement.

The FP unit can’t do any loads or stores on it’s own, that all goes through the Int core. If you look at the diagram, that is the line from the L1DCache to the FP Load Buffer.

Last up is power management, and Bulldozer has all of the expected features, basically a revised version of Llano’s power management technologies. The one big jump is that AMD now has a ‘turbo’ mode for server chips. Sources tell SemiAccurate that the turbo in Bulldozer is not the same as the one on current desktop chip, but it is a second generation turbo that has much better results. It is also tuned better for server workloads.

In the end, Bulldozer appears to have done everything right for the right reasons. The up side is effectively doubling the cores for a low cost, with no loss in performance over two single cores. The down side is, as always, in the details. You can cache thrash, run out of decoder bandwidth, and have all sorts of unforeseen problems. The list is long enough to extend from here to Barcelona.

With any luck, the simulations were done right, and all will go well. Until AMD releases numbers, benchmarks, and of course chips, we won’t know for sure. In any event, it looks like the server market is about to be competitive again for the first time in several years.S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate