Nvidia plays the meltdown blame game

Bumpgate: Official story doesn’t mesh with reality

Nvidia world iconEditors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information.  We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.

This article has had some of the original links removed, and was published on Monday, July 7, 2008 at 5:32PM. 

NVIDIA’S STOCK TOOK a long overdue beating the other day, more because Wall Street is collectively horrified that they have been lied to than any fundamentals that are public. That said, the 8K keeps up their longstanding tradition on corporate honesty and integrity.

The root of the problem is, so far, HP notebooks, but likely others will surface with time. You can see the HP page here, and at least one lawsuit about the same thing here. [Editor’s Note: Original link broken/removed] No mention of this in the Nvidia statement though. They are claiming it is limited to HP if you ask sweetly. Our sources say otherwise.

Why would they? If you look at what Nvidia says, it isn’t their fault, it is those damn suppliers. The official line is “While we have not been able to determine a root cause for these failures, testing suggests a weak material set of die/package combination, system thermal management designs, and customer use patterns are contributing factors”. Parsing that, you see that they are blaming fabs and packaging suppliers first, OEMs second, and those damn users third, but they themselves have no fault here, NV can do no wrong.

This is really dangerous for three reasons, they are pissing off suppliers, pissing off OEMs, and pissing off users. Last I checked, they need all three to remain in business.

The weak die/packaging excuse doesn’t wash at all. Nvidia is blaming TSMC behind the scenes, trashing them pretty hard through ‘unofficial’ channels to deflect blame. They are likely to be doing the same to packaging suppliers as well, and others. The reason this doesn’t wash is that there are only a handful of suppliers in each of these fields.

If the packaging suppliers had a problem in their material set, there would be problems surfacing with other companies’ products, ATI, Altera and dozens of others, would have chips crapping out left and right. This is especially true for designs where they are meant to cycle up and down 24/7 like embedded parts. You would see an industry rife with failures and warnings like the bad caps problem of a few years ago.

You simply aren’t seeing that. Period. No warnings from others, no recalls, no TSMC warnings, no nothing. This is a sham to deflect blame from Nvidia, they don’t want to dent their shiny image, much less slow down the ‘can of whoop-ass’ opening. I am calling bullshit on the blame directed at their suppliers.

Suppliers are a problem for Nvidia though, at least they are now. Trashing your suppliers like this is a dangerous thing to do, Nvidia needs them more than they need Nvidia. Can you imagine the scene at the next TSMC planning meeting where they are discussing who gets what allocation on the next tight process, and how much they pay?

TSMC Planner 1: How many wafers do we allocate for Nvidia a month?
TSMC Planner 2: The 40nm process is looking tight at first, do you agree?
TSMC Planner 1: Yeah, really tight.
TSMC Planner 2: Remember that time when NV was calling us [male rooster euphemism][oral suction euphemism]s to anyone who would listen? Wasn’t that a fun time.
TSMC Planner 1: So 4 then?
TSMC Planner 2: 4K? That seems high.
TSMC Planner 1: No, 4.

Publicly blaming your suppliers is bad. When it isn’t their fault, it is worse. Doing so in the sleazy backhanded ways that Nvidia knows so well is tantamount to corporate suicide. Suppliers will find a way to make you pay, and they will get the knife in somehow. Nvidia being bossy and arrogant will only make the situation more enjoyable for them. Look for this PR blunder to have massive long term effects that manifest themselves in dropped margins, critical parts shortages, and missed deadlines. Bad move #1.

Bad move #2 is blaming the OEMs, this is done with the subtle phrase “system thermal management designs” in the 8K. This is engineering code for “we didn’t do anything wrong, those nitwits at HP did”. It works like this, Nvidia makes a part and it has a variety of constraints it is meant to be used within. Things like power draw, minimum and maximum temperature, and other related variables.

NV specs these things, and HP makes a notebook to the specs that NV gives them, a process that happens long before the chips come out of the fabs in any decent volume. If the chips are within the promised specs, thing go well. If they are not, there are some tweaks you can pull, but if they are too far out of the promised spec, you are basically screwed.

Now this assumes both sides are honest, and people are trying to solve problems, not deflect blame. Nvidia is really good at the latter, bad at the former. They also can’t make a chip that isn’t a blast furnace. Most of their recent woes, including the massively delayed current round of MCPs, is down to out-of-control thermals, just like the last round.

How do you fix a systemic design problem in silicon on a time scale that doesn’t sink an entire season’s notebook sales? Easy, you fudge the spec sheet. If you have a TDP of 20W for a part, and it is coming in at 25W from the fab, you can lower the speed or change what TDP means. If you promised HP a chipset that has an 800FSB and it can only hit 667 in your wattage constraints, well, that is problematic. If you give them a chipset with a 20W TDP, and if the definition of TDP changed between the last generation and this one, well, “that is how we do it now”.

If it is HP incompetence as Nvidia is stating, then it would simply be a case of a line or two of notebooks that went bad. HP system engineering is one of the very best in the industry, period, subject to management whims. This is not to say they can’t screw up, they most definitely can, but it is pretty rare on anything major, HP does seem to have QC process engineering down well.

Does this mean they are perfect? No, not even close. Have they screwed up on a notebook? Sure, probably several here and there over the past few years. If you look at the HP page, once again here, you will see there are 24 models affected. I can believe there are one, two, maybe four screwups, but 24 model lines all with the same problem? All with cooling related failures? All with cooling related video failures? All with cooling related video failures only on Nvidia parts?

What NV is doing is smearing the good name of HP and it’s engineers. There is no way in hell that HP totally botched every Nvidia based notebook for a generation in the same way. Not a chance. This is once again a smear job, and it will once again come back to bite Nvidia in the bottom line, give it time. Companies like this have long memories. The only thing you can say from this is that it is not HP’s fault.

Well, you can say more. If HP spec’d cooling for a theoretical 20W, and the Nvidia chip puts out more than 20W, what happens is you get more heat in the system than you can get rid of. This means temperatures will slowly start to climb. They will either keep climbing, or level off, but likely it will go out of the thermal bounds set by Nvidia. The system will get really hot or simply crash.

The problem there? This puts them outside of the thermal tolerances for the packaging. That is OK for short periods, but repeatedly staying above the limits causes the packaging material to degrade prematurely. Worse yet, the repeated heating and cooling caused by the laptop overheating and then crashing, followed by being left off for a bit to cool and ‘work again’, is horrible for the packaging. This is how solder joints and bumps crack, and substrate warps. Couple this with weakened materials from overheating, and you have dead GPUs.

This is hugely unlikely to be a HP problem, or a substrate problem. It is most likely a bad engineering design decision that Nvidia tried to sweep under the rug. Sometimes it works, other times it doesn’t. This time is an ‘other’, and companies like TSMC and HP don’t like being publicly crucified for Nvidia’s screwups. They really don’t like it.

The third bad move is ‘customer use patterns’, so it isn’t our fault, it is those crazy kids! A Scooby Doo villain couldn’t have said it better after a failed whoop-ass attempt. From the look of things, customers doing things like turning on and off laptops was completely unanticipated by Nvidia product planners. I mean who does that?

Blaming customers would be bad move number three, but I doubt most of them will realize it is Nvidia’s fault, they will blame HP or the host of other OEMs that haven’t been named yet. Either way, if you take bad move #2 into account, if I were an OEM, I would tell everyone calling in for warranty support unequivocally that it is Nvidia’s fault for supplying bum chips. In this case, it wouldn’t be deflecting blame, it would be honesty.

In any case, the ‘crazy kids’ blame game is pointless and will only hurt Nvidia if people hear it. They likely won’t, but there is no upside unless they think analysts are several steps dumber than a slow sheep. If you know how Nvidia treats analysts, this will sound very familiar.

In the end, this whole thing can be summed up by bad engineering, covering your ass, and hoping it blows over. Nvidia corporate messaging is pretty much incompetent, more driven by the fact that they are pawns of people higher up the food chain than anything else, it is kind of sad. The main problem is that they only have one tool, a hammer.

When something goes wrong, they don’t know how to solve those problems, only hit things. Rather than deal with a crushing loss like adults, the situation was dealt with by surprising Wall Street with a collective kick in the hedge funds. There was no explanation, no softening of the blow, and no word to the press, just a ‘Surprise, we are tanking’ governmental form, followed by stonewalling and finger pointing at blameless people.

Botched doesn’t begin to describe this response, but it is a good start, they utterly flunked Crisis Management 101. Given the last sentence of the 8K, “There can be no assurance that we will not discover defects in other MCP or GPU products.”, this is far from over. In fact, we know it isn’t over, there are many more lines and products affected.

With failed Nvidia parts leading to a massive loss, plummeting stock, and management fast-talking up a storm, what everyone want to know is where the buck stops. That is not a simple question, but several industry insiders have told us the same story, it all depends on who got burned, and how big they are.

The one we know about is HP, here and here, but they are far from the only ones. Nvidia is chiming in now because it is very likely they are footing the bill for the class action settlement, or at least a very large chunk of it. When they gave the prescient advice of “There can be no assurance that we will not discover defects in other MCP or GPU products”, they aren’t joking, this problem hasn’t cropped up in desktop parts yet, but it most assuredly will. We are getting reports of other affected items, but it is premature to name them.

So, basically, Nvidia totally screwed up, and are blaming everyone but the one company they should, themselves. The OEMs know it, consumers know it, suppliers know it, and since the “OMFG, our hair is on fire” performance of last week, just about the entire world knows about it. Everyone who has one of these parts will be seeking restitution, just watch the bills mount now that word has spread.

But that brings up the costs and payments. Nvidia took a $150-200M hit initially over this, but what does that cover? Looking at Dell’s web site, going from an integrated GPU to an external Nvidia GPU is either a $50 or $130 upgrade, maybe more on a lower volume gaming part. That is what Dell sells the module for, plus profit and overhead. The chips that Nvidia sells, minus GDDR memory, construction etc, are probably in the $10-40 range.

If you look at that type of GPU, there could be three million or so parts affected, and those can likely be fixed by swapping out an PCIe card. With chipsets, well things get interesting, they are soldered to the mobo, as are many CPUs, especially in thinner notebooks. In this case, the replacement means a new mobo minimum, possibly a CPU thrown in for good measure.

Then there is the cost of fielding the support call, not a trivial matter for a dead notebook. Shipping the part back to the depot, labor to replace the mobo, and shipping it back as well. Added staffing to handle the returns of large portions of 24 notebook lines adds to the bottom line as well.

That leads to intangibles like customer ill will, lost productivity, and the odd executive who gets a bum laptop for their kids. You can’t put a dollar value on these, but they do have an effect, much of Dell’s current woes are due to treating customers like dirt 3-5 years ago. So, once again, who pays for all of these costs? That is an unequivocal “it depends”, depends on how contracts are written, how much leverage the OEMs have, and how much good will Nvidia has built up.

On one side, you have Dell, one time masters of the supply chain, and squeezers of every penny they can get. Industry insiders tell us that Dell will be billing Nvidia for everything, bad GPUs, mobos, replacement costs, help desk, lawyers, and every truck roll needed to fix something in the field. If Nvidia wriggles out of paying for something, they will pay for it in other ways, guaranteed.

HP is a little more flexible, but since Nvidia has been effectively badmouthing HP engineering for this problem, I can see how they would lean a bit more toward the “right royal bastard” side of things. They are close to Dell in what they will charge, but may let some minor things slide.

As you move down the food chain to smaller people such as mobo makers, Tier 2 computer makers, and even little shops, NV will disclaim more and more. Asus and Gigabyte will likely not get everything covered, not even close. Smaller board makers might get credit for the cost of MCPs and GPUs.

Unhappiness will abound. They will all get their pound of flesh, it may just take a bit of time. Lawsuits seem to have forced disclosure, and NV is still trying to spin, minimize the downside, and point fingers. This, however, is far from over. Look for desktops to be affected as well as discrete GPUs before this is all done with, most of them use the same ICs as the mobile parts.

There seem to be two products currently affected, the low end and the mid range parts of the last generation. Depending upon the failure rate, Nvidia could be looking to eat the majority of a generation’s products plus the cost of things they were soldered to, and the tech school dropout used to screw in the new parts.

This will be very ugly before it is done, very very ugly. Finger pointing early on and the blame game will only harden resolve on the other side, and add to costs. I guess there goes their cash reserves, it couldn’t come at a worse time. Then again, doing everything wrong does have a cost.S|A

Author’s Note: It still astounds me that Nvidia has never, not once given out a list of affected OEMs, affected chips, and what models they went into. Nvidia has that info, they just refuse to release it. Depending on who Nvidia representatives are talkting to, the story changes about why, but none of the ‘explanations’ stand up to any real scrutiny. To me, this is the textbook example of a company that cares not one whit about it’s customers, be it OEM customers or the end users.S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also a council member with Gerson Lehman Group. FullyAccurate