AFTER WEEKS OF hunting down obscure sources harboring even more obscure technical knowledge, we can say that Nvidia massively screwed up the GF100 based cards, they have reverted to using some of the Bumpgate bad materials again. Nvidia is once again unwilling to talk about it, and their customers probably will be kept in the dark too.
The story of Bumpgate is long and technical, but let us recap a bit. In July of 2008, Nvidia announced that they had a problem with chips dying in the field in very high numbers. Nvidia refused to say what chips were affected, what OEMs sold the defective parts, or what affected customers could do. It’s not that they didn’t know, if you read the lawsuit by NUFI, their insurance company, Nvida was paying out claims to notebook makers left and right, and had known about the problems for over a year.
Nvidia knew EXACTLY what the problem was, and exactly what chips were affected. Either that or they started changing chips in the middle of runs for no particular reason. The G86, G92 and many chipsets were all changed without explanation at the same time. Curious.
Instead of doing the right thing, the company deflected any questions, and claimed either ignorance or OEM contract clauses, depending on which one was more expedient for the questioner at hand, to wriggle out of answering questions. Nvidia even went as far as blaming their customers instead of their own lack of testing.
In Nvidia’s defense, the problem was extremely complex and technical in nature. There wasn’t a single problem, it was a cascading series of failures, but one that could have easily been caught if Nvidia actually did the thermal testing required for any IC this complex. Intel, AMD and ATI all tested, and they didn’t have similar field failures.
Before you give the company a pass on either not knowing or not understanding the problem, that simply isn’t the case. After months of digging, it has come to our attention that the whole Bumpgate problem was first noticed by HP in January of 2007. At the time, it was just a blip on the warranty claims system, but HP being HP, an investigation was quickly started. Lone blips became more and more common, problematic and expensive, and soon HP had a full blown problem on their hands.
It all culminated in a report that SemiAcccurate has never seen, but has good reason to believe exists. We were told that by the fall of 2007, HP had root caused the problem to cracking bumps in Nvidia chips, and gone into great detail as to why it was happening. We are also lead to believe Dell, and several other OEMs had similar investigations and reports. We are told it says much the same thing as the three part series above.
So, what is going wrong? In a three part story originally published by the author on another site (updated and republished here, here and here) the failure chain was detailed. The overarching idea is simple, Nvidia picked the wrong material and did not test the end result thoroughly, if they tested at all. Simple thermal cycle testing would have found this problem, so it is unlikely that Nvidia did even that.
The materials at the heart of the problem were the underfill and the bumps, hence the term Bumpgate. Bumps are the small, approximately 100 microns in diameter, balls of solder that connect the chip to the green PCB package, and carry all the power and data to and from the silicon itself. Underfill is the epoxy-like substance that is put around the bumps to protect them from moisture and contamination.
For the purposes of this story, underfill has one critical property, Tg, or temperature of glassification. Glassification is a fancy word for something that is like melting, but instead of going from a solid to a liquid, underfill goes from a solid to essentially a gelatin. Once you hit Tg, the underfill rapidly loses almost all it’s mechanical strength, sometimes ending up at less than 10% of it’s original stiffness.
In addition to protecting the bumps from moisture and contamination, underfill also provides mechanical support for the silicon slab that is the chip itself. When a chip heats up, it expands, and hot spots expand more than cool ones, creating mechanical strain on the bumps. Underfill supports the bumps, and relieves some of the stress on them. The stiffer underfill you use, the more stress it absorbs.
If you use too soft an underfill, it doesn’t absorb much stress, and the bumps break. If you make it too stiff, the strain is transferred to the chip itself, especially the fragile passivation layers on top, killing the chip in very short order, usually upon first power up. The engineering magic is to play Goldilocks and find an underfill that is not too hard and not too soft, but just right, basically Mama Bear’s underfill.
Just when you thought it was easy, there is one factor that makes things horribly complex, all underfills that SemiAccurate is aware of have one annoying property, Tg is related to stiffness. If the material is soft, it has a low Tg, and the stiffer it gets, the higher the Tg. Basically if you want underfill that does not glassify at low temperatures, you MUST use a very stiff one. This complicates chip design immensely.
What happened to Nvidia is that they used a low Tg underfill, so it glassified at normal operating temperatures. Those repeated heating and cooling cycles glassified the underfill, and the bumps were repeatedly stressed, relieved, stressed, relieved, and so on until they simply snapped. If you take a spoon and bend it back and forth, after time it will suddenly break. That is exactly what is happening to the Nvidia bumps.
Fixes for this type of problem are not easy, and in some cases are impossible. Intel never had a hint of these problems, they likely tested in an obsessive manner for years, and that seems to have paid off. AMD has some patented technologies to protect the fragile passivation layers from the problems brought on by lead free bumps as well, something that required a lot of hard work to develop.
It is quite unlikely that either Intel or AMD is willing to share their solutions with Nvidia, even if they pout and threaten to open another can of whoop-ass. Considering that we are told Nvidia knew about the problem in fall of 2007, it took until summer 2008 for the first round of ‘fixed’ parts to come off the line, even the ‘simple’ problems take years to fix.
By October of 2008, Nvidia was still shipping the bad bump materials in Apple MacBooks, so even known fixes can take months to implement. Once ‘fixed’, those parts can have other potential problems brought on by unintended side effects.
What did Nvidia change with their solution? The first change was from a low Tg underfill, Namics 8439-1 to a higher Tg underfill, Hitachi 3730. This was followed by a change from high lead bumps to eutectic (lead/tin alloy) bumps. Complicating things was the fact that at that time, Nvidia did not use a stress relieving polyimide layer in their chips, severely limiting the Tg of the materials they could use.
That short recap of the problem brings us to the current 40nm chips. While some suggest Nvidia is incapable of learning, in this case they do seem to have done the right thing. As far as SemiAccurate is aware, later 55nm Nvidia chips and all of their current 40nm parts based on the G215, G216 and G218 ASICs use a higher Tg underfill. Sources tell us the underfill is Hitachi 3730 and no high lead bumps are used. Yay, progress. These chips are clean, no potential Bumpgate material related issues.
How about the GTX470 and GTX480? Sadly the story there is not so rosy. Nvidia has gone back to the Bumpgate underfill, Namics 8439-1, for these two parts. All evidence points to this being unplanned, a decsion that may have some very nasty consequences later on.
Nvidia did not move to the Hitachi underfill on a whim, it provided the mechanical support that the bumps needed to live a long and happy life. The chip was engineered with that level of bump strain relief, it is a part of the physical design. If it was not necessary, in 2008 Nvidia would simply have changed the bump material from high lead to eutectic and left it at that. For some reason, they didn’t.
Given that the GF100 die is large and hot, 529mm^2 or so, and well over the official 250W TDP, that is very hard on the bumps, and so a lot of protection from the underfill is a necessity. It also has large caches that almost assure uneven heating when compared to the hot high density shader blocks. Basically, the GF100 looks to be a case study for uneven heating and hot spots. Hot spots lead to uneven expansion, and that means lots of strain on some but not all bumps. If there is a worst case scenario for bump cracking, it probably looks a lot like the GF100 die.
Problems caused by uneven heating like this tend to show up first in the corners of a chip, something that intuitively makes a lot of sense. Guess what? Sources have told SemiAccurate that during testing, the higher Tg underfills that were used on the previous Nvidia 40nm GPUs delaminated the corners of the GF100 die. That is the fancy way of saying that if you use the same underfill that works on the G215, G216, and G218, it rips the corners off the GF100. Literally. In case you don’t get the point, this kills chips dead. Quickly.
Nvidia had to backpedal very fast, and go back to a softer underfill, the Namics 8439-1. If you look at the time it took to go from first silicon back to when production chips started coming off the line there doesn’t seem to be enough time to do a full thermal cycling test program. The problem with thermal cycle testing is that you simply can not rush it. You have to heat things up as they would heat in the field, let them cool, heat, cool, and so on for a set number of cycles. Experts tell SemiAccurate that this takes months to do.
If Nvidia sacrificed some of the second hot lot of A1 silicon for thermal cycle testing, there was still less than four months to test, engineer a fix, test the new materials and implement them in production. At best it is a laughably short schedule, hardly enough for any real testing. Realistically speaking, there was far less time than that.
Nvidia backpedaled on the materials used, and they did it in a hurry. Worse yet, it does not appear to have done so in a scientific manner, more of a blind panic to get the chips out the door to make good on earlier executive promises. The underfill they reverted to is the same one that was used in the defective bumpgate chips, but the bumps are not high lead.
One thing for sure is that the underfill is not the same as they used on every other 40nm chip they made. To compound the problem, one of the public excuses that the company flung against the wall was that BIOS updates kept the Bumpgate chips cool enough to stay under the Tg and not be thermally cycled as often. One cursory look at the GTX470 and GTX480 running temperatures show that this is impossible on the current GF100 based GPUs.
SemiAccurate asked Nvidia about the materials used in the original Bumpgate. On October 15, 2008, Nvidia’s Mike Hara told us, “The 9300/6400[sic] and 9600 discrete all use the new material set.” On April 12, 2010, when asked about the materials used in the GTX470 and GTX480, Nvidia’s Robert Sherbin said, “Charlie, Sorry, we don’t disclose this information publicly. Thus, I can’t provide it to you. b.”. Suddenly, they are not talking, any guesses as to why?
The GTX470 and GTX480 unquestionably use Namics 8439-1 instead of Hitachi 3730, and that is a regression. Insiders inform SemiAccurate that the regression was caused by the originally chosen materials literally killing chips, and something had to be done if Nvidia was going to ship the new GPUs.
With Nvidia not commenting on the situation, SemiAccurate recommends that potential buyers of GF100 based cards hold off on their purchases until Nvidia explicitly addresses this potential problem. By addressing the issue, we do not mean pawning off blame years later like the last time, but actually providing data as to why these chips are safe to purchase. Don’t hold your breath on this happening, the company has yet to come clean on Bumpgate.S|A
Latest posts by Charlie Demerjian (see all)
- More on Intel’s 10nm process problems - Sep 17, 2018
- Intel puts out another 14nm 2020 server platform - Sep 11, 2018
- Why Can’t Intel Supply Enough 14nm Xeons? - Sep 10, 2018
- Intel can’t supply 14nm Xeons, HPE directly recommends AMD Epyc - Sep 7, 2018
- AMD reintroduces the Athlon name with two CPUs - Sep 6, 2018