Why did Bulldozer underwhelm?

Part 1: Death by 1000 cuts, not a single gun shot

AMD - logoBulldozer is finally here, and the numbers all show that it isn’t going to to be taking any performance crowns. That said, in a few areas it closes the performance gap with Intel’s CPUs, but is lagging in quite a few others.

The story of Bulldozer and why it does what it does, both good and bad, can be summed up as death by 1000 cuts. There isn’t really any high point to the architecture, nor are there any really low points. To make matters worse, there isn’t any obvious smoking gun as to why things ended up so, well, meh. What you can get now, what you should have been able to get, and what you will be able to get from this new architecture is a long and complex story. Lets get started.

Bulldozer_lineup

The initial lineup of Bulldozers

There are seven initial Bulldozer CPUs, ranging from 4-8 cores. Of those, there are four which will be available at launch, the 8150, 8120, 6100 and 4100, priced at $245, $205, $165, and $115 respectively. The last three, and probably a few more will be out in the not so distant future, likely joined by others as bin splits warrant.

In case you haven’t been paying attention, the performance of Bulldozer has been seriously lackluster. It is hard to find any real discussion about why things ended up like this due to fanboi howling, both pro and con, and a lot of very ignorant ranting. Due to a scheduling quirk, SemiAccurate’s Bulldozer arrived at our orbiting HQ about the time we arrived in Taiwan for two weeks, so no testing from us for a bit. That said, we recommend reading reviews by The Tech Report, Lost Circuits, and Anandtech.

Bulldozer’s performance is a huge let down, no question there. AMD’s repeated promises of a higher IPC vs the outgoing Stars core in the current Phenom CPUs just didn’t happen, things actually regressed there. Loaded power consumption didn’t improve either, but per core things got a bit better. Overall, with 8 cores, power use went down at idle, and that is where most desktops spend a majority of their life. On highly threaded apps, performance slots in about where it is slated to, between Intel’s 2500 and 2600 parts, but on single threaded apps, Bulldozer falls on it’s face. Overall, it is hard to come up with areas that Bulldozer is superior to it’s predecessor for the desktop user.

This isn’t to say the chip is awful, it isn’t, it just doesn’t move the bar forward running the code you would run. When it comes to server code, HPC code, and several other niches, the architecture may shine, but for now, it clearly doesn’t. That brings us to why Bulldozer is so, as we said, meh, and there is no easy answer. As we said earlier, it is death by 1000 cuts.

To understand how AMD could have put out such a part, it is instructive to look back at the extremely long gestation of the architecture. As we said in the first part of this story, the author was the first person to publicly write the name Bulldozer over five and a half years before launch. Less than a year later, I was shown the base architecture diagram, including the shared front end and FPU, so the base idea has been unchanged for more than half a decade. In CPU years, that is akin to three ice ages or dozens of dog-centuries ago.

On November 13, 2008, AMD dropped a bombshell at their analyst day in New York. They put out a new roadmap, and a lot of chips went poof. The mood in the room was decidedly negative, the first Fusion parts were canned, Bulldozer pushed back, and the analysts were angry. It was pretty clear that no one in the room, almost 100% financial folk, didn’t understand the reasons behind the pushback, and why they were actually a good idea.

At the time, I wrote, “On the CPU side, there are seven new CPUs listed, and the Shrike family of 45nm Bulldozer parts is dead. In it’s place is the 32nm Orochi line. These are the new Bulldozers, and they come in during 2011. Orochi has four cores and 8M of cache, about what you would expect for the high end part. Below that is Llano, the mainstream desktop and notebook processor. It has four cores, 4MB of cache, and an integrated GPU.” (Sorry, no link due to this.)

Why is it a good idea, and why did the word see it as bad? Bulldozer was originally slated to come out during the later days of the 45nm process, at that time in late 2009 or early 2010. The generation that was to come out then was not nearly as advanced as the current version, and what we have now isn’t exactly setting the world on fire. Schedules were tight, as were resources, and things were shaping up to be a disaster if there wasn’t a comprehensive rethink of the roadmap.

In one of the bravest moves in corporate scheduling history, Dirk Meyer saw what was going on and pulled the plug on the 45nm Bulldozer’s along with the first two generations of Fusion products. Since Wall Street works on a 3Q time table and CPU designs are on a 3-5 year schedule, any changes that cost money now but have a long term benefit tend to go over like a lead balloon with the financial set. This time was no exception, but was undoubtedly the right call.

Bulldozer on 45nm would have been a rush job, and everyone working on it said the time tables called for then were basically impossible. It would have been huge, hot, slow, and unpolished. Given the state of the current products, basically all those descriptors and then some, the 45nm product would have been a disaster, and a late disaster at that. It very likely would not have been out much before a 32nm variant, and would most assuredly bomb in the market.

To make matters worse, the development of the 45nm chip would have sucked up resources from the follow on 32nm variant and subsequent designs. There would have been a domino effect of delays and problems from a very questionable starting point. The word ‘mess’ is a very kind way of describing the impending train wreck.

So Dirk wisely pulled the plug, and instead of getting credit for doing the right thing, AMD got punished by the financial folk. The whole idea of taking some pain in 2010 with a long-in-tooth core to have a stronger 2011 didn’t play well on that cold November day in 2008, but it was unquestionably the right thing to do. It also avoided many more problems than most people could imagine.

As we said earlier, the current Bulldozer is more or less all of those things that the reschedule was meant to avoid. How could this have happened if the parts now available are the second or third generation? There is not one simple answer, there are a lot of little things wrong, and the death of Bulldozer is indeed by 1000 cuts, maybe more. Each one is not a big deal, but together, they take the chip from a really good idea to, well, aspirations of mediocrity.

The first problem is simple, Bulldozer was designed and implemented in a different era. What were the primary OSes used in late 2005 when the architecture was laid out? Browsers? Storage and IO capabilities? Networking? Had you even heard of SAAS or web applications, much less used them on a daily basis? How much of your video was delivered on your PC? The software landscape was different, and the projections for the world of 2011 were too. The changes to computing over that time frame have been immense, and that is the backdrop for a lot of what went wrong, even if it is not to blame directly.

If you look at some of the details, a few things stand out. First, the cache latencies as measured by Michael Schuette in the Lost Circuits article, are in a word, horrific. L1D caches are 1 cycle slower than Phenom/Stars, and the L2 cache is 25-27 cycles, 10-12 slower than it’s predecessor. It may be twice as large, but it is shared by two cores, and is almost twice as slow. Sharing of resources may have some very nice benefits, but in this case, the downsides are, well, crippling.

It is hard to fathom how such high latencies were tolerated, much less put in to production. While it is unlikely to be possible given the shared front end, a few more transistors burned to split this 2MB L2 cache in to 2 1MB L2 caches would have halved latency. This alone would have a marked increase on performance, and not a trivial one, even a single shared 1MB cache with a 15 cycle latency would likely be an improvement over what we ended up with.

During the press briefing in late August, the question of cache latencies came up, and AMD didn’t give an answer, but said they would get back to us. Given the number of people in the room with business cards that read architect, that lack of an answer did send up serious red flags. The official numbers never came for some reason, and now you know why. Cache latencies are such a massive screw up that it is hard to put in to words. Consider the cache latencies a handful of cuts, not just one.

How it ended up this way is a good question, but one we can’t answer. If there were two front ends requesting items from the cache, and it was dual ported, you could explain some of that added latency. Unfortunately, there is only one front end, and the cache only has to service one request at a time. Dealing with the cores, and what goes where is a different problem, the cache should only see one request at a time. This is a big mystery, and a bigger problem.S|A

Part II can be found here.

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate