Why Nvidia cut back the GTX480

Less is more

Nvidia world iconLAST MAY, we said that GTX480, then called GT300, was going to be hot, slow, big, underperforming, hard to manufacture, and most of all late. Almost a year later, Nvidia can’t even launch PR quantities at the promised spec of 512 shaders.

To call the launch a debacle is giving Nvidia more credit than it is due. The sheer dishonesty surrounding the launch is astounding. As early as last September, Nvidia was telling partners March at the earliest, realistically later, while still promising investors and analysts that the chip would launch in Q4/2009. Other than quibbles about SEC rules, six months after its ‘triumphant launch’ Nvidia has finally thrown in the towel and can’t launch a single card with the promised 512 shaders.

While it is a sad farce, there is a good technical reason for Nvidia having launched its Fermi GTX480 GPU with 480 shaders instead of 512. With 480 shaders, it can get a higher performing chip with fewer shaders. How? Through the magic of semiconductor chip binning.

If you recall, Nvidia was aiming for 750MHz/1500MHz with 512 shaders during planning, and publicly stated that it would beat ATI’s 5870 GPU by 60 percent. On paper, that seemed quite possible, but then came the problem of actually making it. We said Nvidia couldn’t. It said it could. Then it called SemiAccurate names. Then it finally launched its GTX480 chip, and the count of 512 shader parts is zero.

Back to the whole 480 versus 512 shader count issue, it all comes down to binning, a close cousin of semiconductor chip yields. In the common parlance, yield is how many chips you get that work, that are good rather than defective. Yield says nothing about the qualities of the chips, just yes or no. There is a lot more to it than that, since you could also include aspects of binning under the heading of yield, but for now we will stick with the good versus bad definition.

Binning on the other hand is more about what you get from those working parts. If you take a given semiconductor wafer, the chips you get from that wafer are not all equal. Some run faster, some run slower, some run hotter, and some run cooler. There is a lot of science behind what happens and why, but once again, let’s just look at the simplified version and say that for almost any property you look at, the chips coming out of a fab will be on a bell curve for that property. Designers can skew how narrow or broad the curve is to varying degrees as well, but there is always a curve.

When designing a chip, you have design targets for making the chip. For example, let’s say that you want it to run at 2GHz and consume 50W. The designs are set so that a large percentage of the chips will at least meet those minimum specs, that is, clock to at least 2GHz and pull no more than 50W when doing so. The idea is to have as little of the tail of the curve as possible be on the wrong sides of those two numbers.

Any chips that fall below those design points are scrap, so the trade-off in design is to figure out how much area you can add to the die in order to move that curve up before the scrap chips cost more than the net added area of the working chips. Ideally, you would want 100 percent of the parts above the line, but that never happens.

For some chips, for example game console chips, there is only one speed that it needs to run at. An XBox360 CPU runs at 3.2GHz. If the chips coming out of the fab run at 6GHz, it doesn’t matter, they will spend their lives running at 3.2GHz no matter what their potential is. There isn’t a market for faster XBox360 chips. On the other hand, if the chips can’t run faster than 3.1GHz, they are scrap. There is a hard line, so you want the bell curve to be as high as you can get it.

For computer CPUs, they sell at a range of speeds, for example 2.0, 2.2, 2.4, 2.6, 2.8 and 3.0GHz. If you aim for everything above 3GHz, that is great, but it is usually a waste of money. When you get CPUs out of the fab, they are tested for speed. If they make 3GHz, they are sold as 3GHz parts. If not, they are checked at 2.8GHz, then 2.6GHz and all the way down to 2.0GHz. Missing a single cutoff in the CPU world does not mean a chip is scrap.

You can bin on multiple metrics, like how many are working at 2.8GHz while consuming less than 75W. The graphs on binning are multidimensional and get astoundingly complex very quickly. Since a chip is not uniform across even it’s own die, especially with larger chips, you can selectively disable parts of a chip if they don’t meet the bins that you require. A good example of this would be AMD’s X3 line of 3-core chips.

You can also add redundant components, like an extra core, or an extra shader per cluster, but that adds area. More area costs more money per chip, adds power use, and can actually lower yield in some cases. Once again, it is a tradeoff.

GPUs have been doing this forever, and end up with very good overall yields on large and complex chips because of it. If you have a GPU with 10 shader groups, and it has defects, that is, it does not yield if you are aiming for all 10 groups working, it is very likely to yield as the next smaller part in the lineup, a hypothetical 8 group GPU. If you couple that with binning, and set things loosely enough, you will end up with a good number of formerly ‘scrap’ chips that lead a productive life. An extra shader per group ups the yield by a lot as well.

You can see this in almost every graphics chip on the market. The GTX280 has a little brother, GTX260, and the HD5870 has its HD5850. Later on in the life of a GPU family, you often see parts popping up, usually in odd markets or OEM only, with specs that are a portion of the smaller parts in the family. This generally happens when there is a pile of parts that don’t make the lowest bin, and that pile is big enough to sell.

Ideally, you set targets to make sure there is almost no need for the proverbial ‘pile in the back room’, but there are always outliers. If you can’t get the majority of the bell curve above the cut off points for your lowest bins, you have a problem, a big and expensive problem. At that point, you either need to respin the chip to make it faster, cooler, or whatever, or lower your expectations and set the bins down a notch or five.

Nvidia is in the unenviable position of having to set the bins down, way way down. The company is trapped, however. The chip is 60 percent larger, over 530mm^2, barely outperforms it’s rival, ATI’s 5870, and is over six months late.

This means Nvidia can’t take the time to respin it. ATI will have another generation out before Nvidia can make the required full silicon (B1) respin, so there is no point in trying a respin. The die size can’t be reduced much, if at all, without losing performance, so it will cost at least 2.5 times what ATI’s parts cost for equivalent performance. Nvidia has to launch with what it has.

What it has isn’t pretty. A1 silicon was painfully low yielding, sub-two-percent for the hot lots, and not much better for later silicon. A1 was running at around 500MHz, far far short of the planned 750Mhz. A2 came back and was a bit faster, but not much. When Nvidia got back A3 just before Christmas 2009, it was described to SemiAccurate by insiders at Nvidia as, “A mess”. Shader clocks for the high tip of the curve were 1400MHz, and the highest bin they felt was real ended up being about 1200MHz.

This is all binning though. Yields were still far below 20 percent for both the full 512 shader version and the 448 shader part combined. For comparison, Nvidia’s G200 chip, which became the GTX280 family of GPU parts, had a yield of 62.5 percent, give or take a little, and that yield was considered so low that it was almost not worth launching. Given a sub-20 percent yield, to call the Fermi or GF100 or GTX4x0 line of GPU chips unmanufacturable is being overly kind.

So, what do you do if you are Nvidia, are half a year late and slipping, and the best chip you can make can barely get out of its own way while costing more than five times as much as your rival’s chip to manufacture? You change bins to play with numbers.

Nvidia made the rather idiotic mistake of announcing the GTX470 and GTX480 names in January, and now it had to fill them with real silicon. The company bought 9,000 risk wafers last year, and couldn’t get enough to make the promised 5,000 to 8,000 512 shader GTX480s from that, a required yield of less than 1 percent. See what we mean by unmanufacturable? To make the minimal number of cards that you need for even a PR stunt level launch, you need to have at least a few thousand cards, and there simply were not that many 512 shader chips.

What is plan B? According to Dear Leader, there is no plan B, but that is okay. At this point, the GTX480 is on plan R or so. You have to suck down your ego and lower the bins. If you can’t make 512 shaders, the next step down is 480 shaders. This moved the cutoff line down far enough to make the meager few thousand 480 shader parts necessary to launch.

GTX480 is slow, barely faster than an ATI HD5870. If Nvidia loses 32 shaders, it also loses 1/16th of the performance, or 6.25 percent. That would leave it a bit slower than the 5870 if the clocks were set in the low 600MHz range though. Still not good for a triumphant launch, but at this point, a launch is better than the alternative, no launch. Out the door is the first step.

Remember when we said that one problem was ‘weak’ clusters that would not work at the required voltage? Well, if you want to up the curve on yields, you can effectively lower the curve on power to compensate, and Nvidia did just that by upping the TDP to 250W. This the classic overclocking trick of bumping the voltage to get transistors to switch faster.

While we don’t for a second believe that the 250W TDP number is real, initial tests show about a 130W difference in system power between a 188W TDP HD5870 and a ‘250W’ GTX480, that is the official spec. Nvidia lost a 32 shader cluster and still couldn’t make 10,000 units. It had to bump the voltage and disable the clusters to get there. Unmanufacturable.

If you are still with us, we did mention that the 480 shader part was faster. How? With the slowest cluster gone, that bumps the speed curve up by a fair bit, and the worst part of the tail is now gone. Bumping the voltage moves the speed curve up more, and the end result was that Nvidia got 700MHz out of a 480 shader unit. That 700/1400 number sounds quite familiar, doesn’t it?

On CPUs with a handful of cores, multiplying the core count times MHz is not a realistic thing to do. Most workloads that CPUs handle are not parallel in nature so the result is bogus. GPUs on the other hand have embarrassingly parallel workloads, so number-of-cores X MHz is a fair calculation. If you look at our initial specs, the early GTX4x0 cards SemiAccurate had access to were 512 shader units running at 600MHz and 625MHz, and a 448 shader unit running at 625MHz or so.

With the last minute spec change, voltage bump and shader fusing off, Nvidia was able to move the bins down enough to get a 700MHz part that has 480 shaders with a ’25W’ penalty. The shipping specs are 448 cores at 1215MHz for the GTX470, and 480 cores at 1401MHz for the GTX480. If you look at the count of cores X MHz, you will see how it is a little faster.

Shader Speeds

Shaders X Clocks, and speed versus a 600MHz 512 shader part

What did Nvidia get by losing a cluster, adding tens of watts, and upping the clock? It looks like nine percent over the spec tested by SemiAccurate last month, and five percent over the proposed 512 shader 625/1250MHz launch part. According to partners, Nvidia was playing with the numbers until the very last minute, and that playing seems to have paid off in a net faster card.

PR gaffes of not being able to make a single working part aside, this five to nine percent bump allowed Nvidia to dodge the bullet of ATI’s Catalyst 10.3a drivers, and add a bit to the advantage it had over a single HD5870. Most importantly, it allowed Nvidia to get yields to the point where it could make enough for a PR stunt launch.

The 480 shader, 1400MHz cards are barely manufacturable. If you don’t already have one ordered by now, it is likely too late to get one since quantities are going to be absurdly limited. As things stand, the 9,000 risk wafers seem to have produced less than 10,000 GTX480s and about twice that many GTX470s if the rumored release numbers are to be believed. That would put yields of the new lower spec GTX480 at well under the two percent Nvidia saw last fall.

It is hard to say anything good about these kinds of yields, much less to call it a win. A one-point-something percent yield, however, is a number greater than zero.S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate