Nvidia’s Fermi GTX480 is broken and unfixable

Hot, slow, late and unmanufacturable

Nvidia world iconWITH ANOTHER LAUNCH of the Nvidia GT300 Fermi GF100 GTX480 upon us, it is time for an update on the status of that wayward part. Production parts have been coming back from TSMC for several weeks now, and the outlook for them is grim.

We first got word that production A3 GF100s were back in Santa Clara near the end of January. When enough of them rolled off the line to characterize the silicon, we hear there were no parties in Santa Clara. For those not interested in the ‘why’ side of things, the short answer is that the top bin as it stands now is about 600MHz for the half hot clock, and 1200MHz for the hot clock, and the initial top part will have 448 shaders. On top of that, the fab wafer yields are still in single digit percentages.

That said, the situation is far more nuanced than those three numbers suggest, and the atrocious yields are even after the chip has been downclocked and defective units fused off. To make matters even worse, the problems that caused these low yields are likely unfixable without a complete re-layout. Lets look at these problems one at a time.

Number one on Nvidia’s hit list is yields. If you recall, we said that the yield on the first hot lot of Fermis that came back from TSMC was 7 good chips out of a total of 416 candidates, or a yield of less than 2 percent.

The problem that Nvidia faces can be traced to what it is doing to fix the issues they face. The three steppings of GF100 are all what are known as metal layer spins, something that is cheaper and faster than a full base layer respin, taking about two months to see results. A full base layer respin takes well over a quarter, likely more than six months to accomplish, and costs more than $1 million just for the masks. Metal layer spins are denoted by upping the number, A1 to A2 for example, while base layer respins up the letter, A3 to B1. Nvidia counts first silicon from A1, so the current A3 is the third metal spin.

Metal layer spins tend to solve logic problem like 1 + 1 = 3, not power or yield issues. Most yield problems are related to the process that the chips are made on, and modified by factors like how fast you try to run the transistors, how much you bend the design rules, and other related issues. While this is a very simplified version, metal layer spins don’t tend to do much for power consumption or yield problems.

When Nvidia got its first Fermi GF100 silicon back in early September, and you can read about the dates and steppings on the chips in this story, clock speeds were hovering around 500MHz and yields were in single digit percentages. These numbers were shockingly low, but on the first batch of silicon, rushed through the fab hence the name hot lot, problems are quite common.

The second spin, A2, did up the clock speeds a bit, but yields remained stubbornly low. It was a month or so overdue, so you can be pretty sure the problems that were being worked on were fairly difficult. This is not a process that you tolerate any unnecessary delays on.

SemiAccurate has heard that the A3 silicon hot lots that came back just before Christmas didn’t improve clock speeds at all, or not enough to be considered a fix. That isn’t a surprise because Nvidia was using the wrong tool, metal layer changes, to fix a clock speed and power problem. Yields on A3 hot lots also were in the toilet. As we have been saying since we first heard about the design last March, it is unmanufacturable, and the changes that might possibly fix things will take a full re-layout.

Why are things this bad? The simple answer is that Nvidia didn’t do its homework. Just like the failures that lead to the bad bumps, Nvidia simply didn’t do test, instead it tried to brute force things that require nuance and methodical forethought. ATI did test. ATI put out a full production run of HD4770 (RV740 silicon) GPUs and used that as a test of TSMC’s 40nm process. TSMC looks to have failed, badly, but crucially, ATI learned why and how the parts failed to yield. That learning was fed back in to the Evergreen 5000 series GPUs, and those come off the line at perfectly acceptable yield rates now.

Nvidia in the meantime had four 40nm GPUs in the pipeline for Q1 of 2009, the G212, G214, G216 and G218, shrinks and updates of the 55nm G200b, G92b, G94 and G96 respectively. G212 was so bad that it was knifed before it ever saw silicon, and the second largest one, the G214 had to go on a massive diet, dropping from 128 shaders to 96. This was renamed G215, and finally came out in November 2009. It is sold as the GT240, G216 is sold as the GT220, and the G218 is on the market as the G210. All have had innumerable renamings and are currently being parlayed as 300-series chips for no apparent reason.

The problem here is that the chips are approximately 139mm^2, 100mm^2 and 57mm^2 for the G215, G216 and G218 respectively. ATI’s RV740 is 137mm^2. These are all very small, while the higher end 55nm G200b was over 480mm^2, and the earlier 65nm G200 was over 575mm^2.

ATI was making salable quantities of a 137mm^2 chip in April 2009. Nvidia had severe problems with the 40nm process and only got the G216 and G218 out in August 2009 as OEM-only GPUs. It took months for the yield to reach a point where Nvidia could release retail cards, and the G215 lagged the first two by several months.

A really rough measure of yield is that for similar products, the yield goes down by the square of the die size. A 200mm^2 chip can be expected to have 1/4 the yield of a similar 100mm^2 chip, and a 50mm^2 chip will have about 4 times the yield of the 100mm^2 part. Chip makers put lots of redundant structures into every design in order to repair some kinds of fabrication errors, but there are limits.

Each redundancy adds to the area of the design, so the base cost for the chip is higher. Semiconductor manufacturing is a series of complex tradeoffs, and the cost for redundant area versus yield is one of the simpler ones. If you plan right, you can make very high yielding chips with only a little extra die area.

If things go well, the cost of the redundant area is less than you would lose by not having it there at all. If things go badly, you get large chips that you can’t make at anything close to a viable cost. The AMD K6-III CPU was rumored to be an example of this kind of failure.

Last spring and summer, ATI was not shy about telling people that the lessons learned from the RV740 were fed back into the Evergreen  5000 series chips, and it was a very productive learning experience. One of the deep, dark secrets was that there were via (interconnects between the metal layers on the chip) problems. The other was that the TSMC 40nm transistors were quite variable in transistor construction, specifically in the channel length.

Since Anand talked about both problems in his excellent Evergreen history article, any promises to keep this secret are now a moot point. What ATI did with Evergreen was to put two vias in instead of one. It also changed transistor designs and layout to mitigate the variances. Both of these cost a lot of area, and likely burn a more than negligible amount of energy, but they are necessary.

Nvidia on the other hand did not do their homework at all. In its usual ‘bull in a china shop’ way, SemiAccurate was told several times that the officially blessed Nvidia solution to the problem was engineering by screaming at people. Needless to say, while cathartic, it does not change chip design or the laws of physics. It doesn’t make you friends either.

By the time Nvidia found out about the problems, it was far too late to implement them in Fermi GF100. Unless TSMC pulled off a miracle, the design was basically doomed.

Why? GF100 is about 550mm^2 in size, slightly larger than we reported after tapeout. Nvidia ran into severe yield problems with a 100mm^2 chip, a 3 month delay with a 139mm^2 chip, and had to scrap any larger designs due to a complete inability to manufacture them. Without doing the homework ATI did, it is now trying to make a 550mm^2 part.

Basic math says that the GF100 is a hair under 4 times as large as the G215, and they are somewhat similar chips, so you can expect GF100 yields to be around 1/16th that of the smaller part. G215 is not yielding well, but even if it was at a 99 percent yield, you could expect Fermi GF100 to have single digit percentage yields. Last time we heard hard numbers, the G215 was not yielding that high.

Fixing these problems requires Nvidia to do what ATI did for Evergreen, that is, double up on the vias and also change the circuits in a non-trivial way. This process requires a lot of engineering time, a base layer respin, and probably at least one metal spin on top of that. If everything goes perfectly, it will still be more than six months before it can bring a fix to market.

While this is bad for Nvidia, and likely terminal for Fermi GF100 as an economically viable chip, it does actually get worse. The chip is big and hot. Insiders have told SemiAccurate that the chips shown at CES consumed 280W. Nvidia knew that the GPU would consume a lot of power long before the chip ever taped out, but it probably thought it would be around the 225W mark claimed for the compute boards.

To combat this, Nvidia engineers tell SemiAccurate that the decision was made to run the chip at a very low voltage, 1.05V versus 1.15V for ATI’s Cypress. Since ATI draws less power for Cypress, 188W TDP vs 225W TDP for the Fermi GF100, every time Nvidia needs to tweak the voltage of its card, that results in roughly 50 percent more amperage used for every .01V the core is raised by. While this is oversimplification, the take-home message is that Nvidia made choices that result in more power added than ATI if the voltages need to be upped.

Remember the part about TSMC having variable transistors? This means some are ‘leaky’, meaning they consume more power than their less leaky brethren, and others run slower. The traditional fix for the slow transistors is to up the voltage, and that often works to make a ‘weak’ transistor perform correctly. It also makes leaky transistors leak more, and the more they leak, the hotter they get.

Hotter transistors also leak more than cooler ones, so you get into a cycle where leakage leads to higher temperatures, which make things leak more, and so on. One way to combat this is to put a much more aggressive heatsink and fan on the card, but that costs a lot of money, and tends to make a lot of noise. If you are having a flashback to the Nvidia NV30 GeForce 5800 ‘dustbuster’, that is exactly it.

TSMC’s variability means that there are lots of weak transistors scattered throughout the die, and lots of leaky transistors. If Nvidia ups the voltage, they start sucking up power at a huge rate. If it doesn’t, the weak transistors basically do not work, and are effectively ‘broken’ or ‘defective’. The two goals are mutually antagonistic, and the low voltage, high amperage choices made by Nvidia only multiply the problems.

If that wasn’t bad enough, sources tell SemiAccurate that the TSMC 40nm process is very heat sensitive. Leakage goes way up with heat, much more so than with previous processes. If you go past a certain critical temperature, leakage goes up shockingly fast. The jail that Fermi GF100 is in now has three sides closing in on the chip.

The alternative is to fuse off shaders with too many ‘weak’ transistors, and leave the voltage alone. Unfortunately, another bad architectural choice makes this very unpalatable. Fermi GF100 is arranged into 16 clusters of 32 shaders for a total of 512 shaders on the chip. By all accounts, if you are going to fuse off one, you are forced to fuse off a full set of 32. Since the weak transistors are scattered evenly throughout the GPU, fusing off two would mean that you lose not two but 64 shaders. This level of granularity is bad, and you have to question why that choice was made in light of the known huge die size.

On the current A3 silicon, sources tell us that Nvidia is having to use both ‘fixes’, fusing off at least two clusters and upping the voltage. This results in a GPU that consumes more power while having at least 12.5 percent less performance than intended. If you were going to use one in your PC, this may be manageable, but hundreds or thousands of them in a big supercomputer is a non-starter.

For reasons tied to the power consumption and weak transistors, Fermi GF100 simply will not run at high clocks. Last March, sources told SemiAccurate that the intended clock frequencies were 750MHz for the ‘low’ clock and 1500MHz for the high clock. Since you can only pull off so many miracles with a voltage bump, we hear the A3 production silicon has a top bin of 600/1200MHz, and that is after an average of two shader clusters are turned off.

Nvidia was claiming 60 percent more performance than Cypress last fall. That quickly dropped to 40 percent better, and at CES, Nvidia could only show very select snippets of games and benchmarks that were picked to show off its architectural strengths. Those maxed out at about 60 percent better than Cypress, so consider them a best case.

If that 60 percent was from a fully working 512 shader, 750/1500MHz Fermi GF100, likely the case at 280W power draw, than a 448 shader 600/1200MHz GPU would have 87.5 percent of the shaders and 80 percent of the clock. 160 * 0.875 * 0.8 = 112 percent of the performance of ATI’s Cypress. This is well within range of a mildly updated and refreshed Cypress chip. Don’t forget that ATI has a dual Cypress board on the market, something that Fermi GF100 can’t hope to touch for performance.

Fermi GF100 is about 60 percent larger than Cypress, meaning at a minimum that it costs Nvidia at least 60 percent more to make, realistically closer to three times. Nvidia needs to have a commanding performance lead over ATI in order to set prices at the point where it can make money on the chip even if yields are not taken into account. ATI has set the upper pricing bound with its dual Cypress board called Hemlock HD5970.

Rumors abound that Nvidia will only have 5,000 to 8,000 Fermi GF100s, branded GTX480 in the first run of cards. The number SemiAccurate has heard directly is a less specific ‘under 10,000’. There will have been about two months of production by the time those launch in late March, and Nvidia bought 9,000 risk wafers late last year. Presumably those will be used for the first run. With 104 die candidates per wafer, 9,000 wafers means 936K chips.

Even if Nvidia beats the initial production targets by ten times, its yields are still in the single digit range. At $5,000 per wafer, 10 good dies per wafer, with good being a very relative term, that puts cost at around $500 per chip, over ten times ATI’s cost. The BoM cost for a GTX480 is more than the retail price of an ATI HD5970, a card that will slap it silly in the benchmarks. At these prices, even the workstation and compute cards start to have their margins squeezed.

The two real fixes, doubling the vias and redesigned circuits to minimize the impact of transistor variances both require a full re-layout. Both also cost time and die area. The time is at least six months from tapeout, if you recall. Fermi taped out in late July and was only slated for small numbers at retail in late November, a very unrealistic goal. A B1 spin of the chip would be at least Q3 of 2010 if it tapes out today, and it won’t have a useful life before it is supplanted by the next generation of 28nm chips.

Should Nvidia make the necessary changes, that brings up two more problems. Nvidia is at two limits of chip engineering, a die size wall and a power wall. The power wall is simple, a PCIe card has a hard limit of 300W, anything more and you will not get PCIe certified. No certification means legal liability problems, and OEMs won’t put it in their PCs. This is death for any mass market card. The power can only be turned up so far, and at 280W, Nvidia already has the dial on 9.5.

The die size wall is similar, you can only fit a mask of a certain size in the TSMC tools. The G200 pushed that limit, and the changes to Fermi/GF100 would likely push the chip to a size that would simply not fit in the tools needed to make it. At that point, you have to look at removing units, a task which adds even more time to the stepping. The only way to make it work is a shrink to 28nm, but the first 28nm process that is suitable is not going to have wafers running until the last few days of 2010, best case.

Fermi GF100 is six months late and slipping, can’t be made profitably if it can be made at all, and initial production is going to be limited to essentially a PR stunt. Every card made will be sold at a huge loss, meaning that there will probably not be any more wafers run after the initial 9,000 risk wafers are used up. Those were paid for in advance, so there is little harm to finishing them for a stopgap PR stunt.

The chip is unworkable, unmanufacturable, and unfixable. If Nvidia management has any semblance of engineering sense, it will declare its usual victory with Fermi and focus resources on Fermi II, something it is still promising for 2010. The changes needed to fix the first Fermi are basically not doable until the chip is shrunk to 28nm, the size is just too big.

This puts any hope for Nvidia out until 2011 for anything but PR stunts. The derivative parts for Fermi exist only on paper, they haven’t taped out yet. If Nvidia does tape them out, they will have the same size, yield and power problems as Fermi GF100. ATI will price them into the red, so there is no way Nvidia can make money on the derivatives, and there is no way it can fix the problems in time to matter either. Nvidia will not have any economically viable DX11 parts until the last days of 2010 if everything goes well from here.

As we have been saying since last May, Fermi GF100 is the wrong chip, made the wrong way, for the wrong reasons. Dear Leader has opened the proverbial can of Whoop-Ass on the competition, and on top of that criticized Intel’s Larrabee for everything that ended up sinking Fermi GF100. Intel had the common sense to refocus Larrabee and take a PR hit rather than pouring tens of millions more dollars down the drain for no good reason. It doesn’t look like Nvidia has the management skills to make that call. The company not only designed a ‘Laughabee’, it went against all sense and built that too.S|A


Edit:  There was a typo 5980 is in fact 5970.

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate