HATS OFF TO Intel for pulling one of the most devious PR spin and coverups of the year with their chipset problems explanation. While it may be a desperate and masterful obfuscation job, if you poke a little under the surface, you can see how badly they screwed up, and how little of the truth actually came out.
Updated: Tuesday, February 1, 2011 Bottom of the article
To say I don’t buy the explanation that Intel provided is understating the issue immensely, only the level of the coverup is still somewhat vague. The reason isn’t that it doesn’t make sense, but once you start looking at questions about how and why, things fall apart. One complicating factor might be explainable, a dozen, not a chance. The list is far closer to a dozen than one.
Lets start with the problem, Lars wrote it up, but a few more things have come to light since. Short story, the Cougar Point chipset has a problem with slow failures on the SATA-3 ports, but not on the SATA-6 ports. This means ports 0-1 are just fine, but ports 2-5 are quite possibly hosed. The overwhelming majority of PCs will never use more than 2 SATA ports, so even if you have an affected board, chances are that you will never notice.
More encouraging for Intel, the number of laptops out there that use more than two SATA ports, either internally or externally, is just about zero. If a laptop maker is going to use two SATA ports, they will obviously use the 6Gbps ones. It is questionable if those ports are even available on some laptop SKUs, so it would be surprising if laptops were affected by this in large numbers.
That is where the good stuff tends to end, and the fishy explanations, or lack thereof, begin. Lets start with the problem itself. The SATA-3 ports will slowly start to degrade over time, and the error rate will grow until the link dies, and there is no recovery. This is supposed to hit 5-15% of the chipsets, and it is pretty random as to which ones will die.
The obvious first question is, why was this not caught in validation? It seems like an obvious one to find, and if it even crops up in 1% of the population during the initial testing period, then Intel testing should have caught this one. Intel’s testing is unusually thorough including their accelerated aging process. It does not seem obscure or even uncommon enough for Intel to miss. Something went badly wrong here.
Second is the question of what exactly is going wrong. Intel knows exactly what is wrong, and have implemented a fix, so one would assume that the tough questions have been answered. When specifically asked, Intel was not able to provide an exact explanation of the specific circuit errors. To me, this says that they don’t have a good explanation as to why they missed it in testing, so it is better to not explain lest people ask those questions.
Ugly as this may sound it gets worse when you look at the manufacturing side. Cougar Point is made on a 65nm, likely the same as the predecessor Ibex Peak/P55, but Intel has made their web site utterly useless for finding actual technical information, so we can’t be sure this is correct.
In any case, yet another problem is in the design/IP side of things. When you design something like a SATA controller, you do it once and re-use the code to make successive chips, or you just do what many others do and buy it outright from an IP supplier. Intel likely designed their own, and they have probably revised it several times over the years since the SATA-II/3Gbps spec was finalized years ago. Unless there was a radical change in the interface specs, or a radical change in the chipsets, once a block like this is feature complete, validated and proven, you would be quite frankly stupid to change it.
There have been no changes to the spec, and there doesn’t seem to be any reason that Intel would have changed the chipset on their own, so neither case should affect this chipset. Cougar Point and Ibex Peak look to be done on the same process, so most of the chip was probably just re-used. The SATA-6 ports are Intel’s first attempt at SATA-6, and they are obviously a different block than the SATA-3 ports. If the error is a design problem, how did it not crop up in older chipsets is a good question. Again, a question that Intel could not provide an answer to when asked.
If Intel did not re-use the IP, you really have to question their fiscal sanity. Why would you waste the money to re-do the work that is done and dusted like that? What good does it do? There is only down sides and no up sides to it as Cougar Point shows. The design side really leads to more questions than it answers.
Then there is the problem of design vs manufacturing problems. Intel insists that this error is a circuit design flaw, not a manufacturing or packaging problem. It will be fixed with a metal layer change, and a high up metal layer change at that, so it could very well be a design problem. The closest to an explanation we were given was that it is a circuit that was overloaded, and degrades the harder it is used.
Not having enough of an EE background to envision a mechanism where this would happen, we remain dubious that it is a design issue. The much simpler explanation is that it was a manufacturing problem that was correctable by a design change, IE you use thicker wires at a given location to correct for more variability in that layer than was forecast.
Another similar explanation could involve packaging, where the pins around the SATA ports don’t attach perfectly, and the higher voltages of the SATA port make them degrade more. A metal change could allow for larger bumps or better attachment. Both of these, and several other manufacturing related problems seem to be much more likely culprits. Intel’s whole story seems to be shaved very closely with Occam’s Razor.
Then comes the timing, and that is where things go really really bad. The first problem is that there was no explanation to the OEMs before this whole problem went public. Intel says this is due to something called RegFD, basically that if a company discloses something material to one person, they have to do it to everyone. The logic that Intel seems to be following is that since this will cost the company about $1 Billion, $300 Million in lost/slowed sales and $700 Million to recall and reimburse customer losses. That is most definitely material, no quibbles here.
If you look at the two problems together though, they don’t parse. For Intel to forewarn customers and OEMs, they would do so under very strict NDA agreements already in place. Intel would not have to mention the magnitude of the problem, the costs they would incur, or anything else financial, just technical problems. The OEMs would have had days or weeks of lead time to stop manufacturing, stop shipments, or possibly make a workaround.
The thing you may have noticed is that this is all NDA, and does not talk about any investors, just customers. Intel routinely puts out far more material disclosures to OEMs than this, things like long term roadmaps and shipping numbers, on a regular basis. The explanation of “Our lawyers made us do it under RegFD” is farcical.
Instead, Intel dropped a bomb on their biggest and best customers in the worst way. “Good morning guys, guess what, you are screwed, and 80+% of your business is stopped. Expect a box of cookies from us with a sincerely purchased Hallmark card, shipping sometime in late Q2.” This is going to go over like a lead balloon, and there will be repercussions.
Actually, it is worse than that, most OEMs likely found out through the press, it is Chinese New Years, and most of Taiwan and China is literally shut down this week. All the relevant executives, decision makers and managers are on vacation right now, but the factories are still churning away 24/7. There lights are on, but there is no one home to make the decision to turn them off.
Best case, the relevant managers will be dragged in from their vacations and forced to work through the holidays to fix this, soothe their customers, and generally fume at the ones responsible. Worst case, they won’t know, and the factories will churn out another week of product before someone tells them to stop. Either way, the problem is massively compounded by this idiotic timing, and every single customer will be fuming.
Why did they do things this way? The only semi-sane reason is to control the message in order to keep their stock price up. Intel took the very unusual step of basically explaining this in terms of finance, not tech, and playing up how Q1 was actually going to be better for them than previously disclosed, even considering the charges.
All is happy, and as of the end of the day, and as of this writing, Intel stock is essentially flat from where it opened. Mission accomplished, now all that remains is to deal with those pesky customers, a much easier task than herding the analyst cats. That said, it it a bit too manipulative for my liking, if I owned stock, I would go as far as to officially make an SEC complaint over this type of behavior, it just stinks.
To make matters worse, the chips have been on sale since January 9, 2011, a date we exclusively disclosed last September. That means they were on sale for a mere three weeks, if Intel had given their customers a warning when they knew, it could have prevented at least a third of those parts from being sold.
Intel says there were a bit less than 8M Cougar Point chips sold at this time, and an unknown but far lower number in the hands of consumers. By disclosing things in accordance with RegFD, but in a slightly less packaged, spun and palatable way for the analysts, they could have prevented literally millions of those parts from being sold.
Once again, the money saved by a prompt NDA disclosure to the customers, even a warning to them, not a full recall, would have been huge. Intel has made this type of disclosure and stop ship notice many times in the past, to not do it now seems very curious. The motherboards not made would be worth tens or hundreds of millions of dollars, all money that will now end up coming out of Intel’s pockets.
I can’t think of a scenario where this is anything but grossly financially irresponsible. I also can’t think of a reason why one would do this other than to try and protect the stock price. Either way, it is mismanagement and money wasted by Intel.
Then there is the problem of when they knew about the problem, the official explanation seems very unlikely given the timing. Officially, the revised chipsets are in production now, and will be shipping by the end of February, lets call it four weeks. If Intel only discovered the issue a few days ago, the timelines don’t seem to make sense.
The one part that does make sense is that the production of the chipsets will be accomplished in four weeks. Normally it is more of a six week minimum to push a production sized lot of chips through a fab, with packaging and testing adding yet more time. You can do hot lots and test lots much quicker, but in this case, you need full production runs, not a dozen wafers, and that means going through all the steps one by one.
Luckily for Intel, the fix was in the top metal layer, or possibly the top few layers. This means, if all went well, that the old chips could be halted in process, and new layers put on wafers that already are most of the way through that 6+ week process. This makes the four week fabrication and packaging timeline plausible, and Intel extremely lucky.
If the chip had to be redone from the ground up to fix the problem, a much more likely scenario, at least as far as the raw numerical odds go, then Intel would have had to have known before Sandy Bridge launched, and before their earnings call on January 13 too. This brings up all sorts of nasty regulatory implications, so nasty that I doubt anyone would risk them. In this case, I think Intel just got lucky.
That said, the timelines for the fix still seem implausible, terminally so. Why? None of the steps in the process of getting to the point where you can make the revised chip seem like they could have happened in the stated time frames. Lets look at them.
The first is that you need to identify the problem. This takes time, and Intel claims that there has been no reports of failures from end users, all the problem reports are coming from OEMs or internal tests. Since they probably all didn’t happen on the same day last week, Intel had to have known about the potential of a problem for weeks.
These types of issues are murky and the numbers are barely above statistical noise, so no blame for not sounding the alarm earlier. Hindsight is 20/20, making a $1 Billion recall decision takes a little more thoughtful analysis. That said, there had to have been signs for a while, much longer than intoned by Intel.
Second is the decision that a problem does exist and more importantly, is worth fixing. Once that decision is made, you need to figure out what exactly the problem is, not always an easy thing to do. This takes days, if it was easy, it would not have slipped by design, various validation steps, both automated and manual, and testing. This takes a few days, and that is assuming the engineers involved got very lucky in figuring things out.
If there was sufficient information to pinpoint the problem easily and quickly, this time could have been cut down by significant chunks. If there was that much specific information though, it would have had to have come from somewhere, and that makes you wonder when Intel knew about this whole mess, and the magnitude of the problem.
The scenario where Intel got 492 specific, repeatable, and very emblematic problem reports on the same day late last week that allowed the engineers to pinpoint the error in record time seems impossibly silly. Intel has some of the best silicon engineers in the world, bar none, backed by the best tools in the world, but this is too much of a stretch. The timelines proposed with discovery and identification are far too tight for my belief.
From problem identification, the next step is to design a fix. This probably was relatively quick, maybe doable in a few hours. Again, if all of the stars aligned, and the fix was containable to a specific metal layer or two, this is entirely plausible. Intel seems to have gotten very lucky here too. Again.
Then comes the next step of validation, something that is far more problematic than most laymen can imagine. You not only have to validate the changes you make, but the problems that it could cause upstream and downstream of that circuit. It is nowhere near as simple as proofreading your changes and signing off on them. Intel could have skipped this step, but you can’t really cut it short.
The problem here is that Intel appears to have either put the chipset into mass production without testing at all, or done only the most cursory of checks on it. Either scenario is troubling, but for entirely different reasons. I would be very surprised if Intel skipped these validation and checking steps, so that pushes the ‘when they knew’ date back few days.
Once you have the problem identified, root cause identified, and a fix designed and tested, you have to make the masks. Going on the assumption that the problem is confined to the last few metal layers, lets assume you only need to make a partial mask set. This is very likely, but still takes a few days, and obviously can’t be done in parallel with the previous steps.
Overall, the steps involved, even if things go improbably well for Intel, will take more than a week. It is borderline inconceivable that it took less than a week, barring the ‘492 email’ scenario above. Even then, the scenarios were it took only a week carry very long odds. If I had to put money on the timing, I would bet the trigger was pulled at least two weeks ago on this fix, but it is likely longer.
All of this make the swirling questions of why things were disclosed so badly all the more pertinent. Intel shot their own feet off by screwing their customers. For me, the RegFD explanation doesn’t wash, and even if it does, the costs of waiting are monumental. It simply wasn’t worth it to not tell the OEMs, or even warn them that there might be a potential problem.
Intel is very good at messaging this kind of thing, and the OEMs do appreciate it, even if it is, at times, bad news. The timelines given to the discovery and fix are again borderline ludicrous, with no really good explanation as to how things happened in almost magical time frames. This is the long way of saying the technical side doesn’t add up either.
So why would Intel do this? The only explanation that is not only remotely plausible, but actually very likely, is to keep the stock price from tanking. Given how it was messaged, that mission was very nicely accomplished, with the stock closing down .07 from it’s opening Monday morning.
The cost of doing things this way however are huge, costing literally hundreds of millions of dollars. How any company can manage their finances this irresponsibly is beyond me, such short term thinking may please Wall Street, but it only hurts more long term. Intel did the worst possible thing it could have today, I hope whoever had an upcoming stock sale feels it was worth the long term cost for the company.S|A
Updated: Tuesday, February 1, 2011
Editor’s note: Intel has informed us that their spokesperson mis-spoke yesterday that Reg. FD requirements were the motivatiting factor for the timing and disclosure of the problem. They were not. The proper rules that apply forcing their disclosure and the timing is located here: http://taft.law.uc.edu/CCL/34ActRls/rule10b5-1.html
Latest posts by Charlie Demerjian (see all)
- More on Intel’s 10nm process problems - Sep 17, 2018
- Intel puts out another 14nm 2020 server platform - Sep 11, 2018
- Why Can’t Intel Supply Enough 14nm Xeons? - Sep 10, 2018
- Intel can’t supply 14nm Xeons, HPE directly recommends AMD Epyc - Sep 7, 2018
- AMD reintroduces the Athlon name with two CPUs - Sep 6, 2018