Why did TSMC stop 28nm production?

No clear answer, but lots of speculation

TSMC logoThere has been a lot of misunderstanding about the TSMC shutdown news from last week, so lets look at it a little more closely. At this point, information is really scarce, and very few people are talking.

What is the current state of things? Over a week ago, SemiAccurate heard that TSMC had halted all 28nm shipments without so much as an explanation. Our best information at the time was that customers were not forewarned, they were only informed after the shutdown had occurred. It took over a week of digging before we got any hard confirmation of the situation, and then we wrote it up. At that point, the best information anyone had was that TSMC was going to have production running at full capacity, and chips coming out of the fab, by the end of March.

Since then, there has been a lot of FUD, rumors, and explicit non-denials. The best one of these is from Bit-Tech, including some direct statements by TSMC PR. Unfortunately, those statements do not deny that there is any stoppage, and dance about the issue. About the most direct part of the statement is near the end, where TSMC states, “I want to inform you that our 28nm production is normal, and all our 28nm customers are fully aware of our production status”. Note that it says that customers are informed about the situation, but does not say that customers were informed about it before it happened, or that it didn’t happen at all.

The best information that SemiAccurate has been able to gather says this is indeed the case, and customers had no idea that this was going to happen until it was already done. While we do not dispute the TSMC statement, it is clearly meant as a face saving stunt rather than to inform customers or deny our report.

To make matters far worse, if you know a bit about the industry, TSMC does not do explicit PR in a pro-active way. Instead, they have proxies that they pay in the background who do not disclose their relationship with TSMC or disclose that their ‘opinions’ and ‘articles’ are nothing more than TSMC’s damage control when things go wrong. Why they continue to use such a Ninny when it just makes them look worse is beyond me, but TSMC desperately needs to change its tactics before anyone takes them seriously. If anyone ever finds out about the lists of tapeouts they hand him to leak, heads will roll. [Editor’s note: We don’t think TSMC is aware of how much this damages their reputation in the industry.]

Since we broke the original story, SemiAccurate has gotten additional confirmations about the situation, or parts of it. Unfortunately, the confirmations all describe the effects, not the cause, of the problem. As of today, the best information we have is that all 28nm production at TSMC, both high and low power processes, are not running, and have not been for at least three weeks. No one is reporting production has restarted, nor any date given for such an event, if one exists. No root cause is public, yet, either.

So lets analyze and hopefully narrow the possibilities. This is based on what we know about how the processes and technology works, not based on leaks. We will assume that production stopped in mid-February, and all three variants including both high and low power processes, High-K and non-High-K are down. Even if the low power variant of 28nm is still running, it does not affect the analysis below.

There are several possibilities that could lead to a shutdown, external influences such as power failure, machine failure, machine settings, fab contamination, material contamination, post-production problems, and the generic “other problem”. Lets look at each one.

One thing that is pretty easy to rule out is an external influence such as earthquake, power failure, or water supply shortage. All of these are possible and have affected the Taiwanese fabs in the fairly recent past, but we can find no instance of them happening in the mid-February time frame. Also, any of these problems would not be specific to the 28nm process, they would affect other geometries running along side them in the same fab, not to mention other nearby fabs. This has not happened.

Fab contamination is also pretty easy to rule out as well. TSMC’s 40nm production was a proverbial mess for months because of a purported leak in a gas line. Even if this was not the actual case on 40nm, there was a persistent contamination problem that took months to fix. Once again, since this issue is specific to 28nm and no problems have been reported on other lines in the same building, this is extremely unlikely.

If you know how fabs work, a semiconductor line is not the same thing as a traditional assembly line, there are clusters of tools to do a job. If a chip needs one pass in tool A, and three passes in tool B, or one takes disproportionately longer than the other, you will have more copies of tool B than tool A. In any case, you are very likely to have multiple copies of all the tools, and one breaking is unlikely to to stop the entire process, just slow it down. The same is true for setting the dial to 11 as you can see in this TSMC internal training video.

[Editor’s note:  Yes, that was a joke, something we are rarely do around here, but since you get the same info you get from white papers for a lot less at SemiAccurate, we step out of line ever now and again to keep readers from becoming suicidally bored. Carry on.]

If any single tool was misconfigured, or even if an entire tool type was set wrong, that would not necessitate a complete line stoppage, just a resetting and maybe a minor period of re-calibration. Breakage or settings is similarly unlikely to be the cause of a three week and counting stoppage.

Material contamination seems to be the most likely candidate. TSMC could have gotten a bad batch of gas or other material, and not known about it until the finished products came out weeks later. This could necessitate a pause until new materials could be brought in, or worse yet that pause and a decontamination period to basically scrub the unwanted bits from the lines and tools. The worst case is that some tools were poisoned and those have to be replaced, something that takes months. Either way, materials problems could easily be common across all three 28nm processes, and could affect things suddenly.

Next up is the worst of all worlds, a post-production problem. This could be either something like an Nvidia Bumpgate or something more process related. The most likely candidate is a lifespan issue, basically extended testing has shown the parts fail in large numbers after a year or two of hard use. This is possible, but we can’t see how it would not have been caught months ago in pre-release testing.

Most companies, with one notable exception, tend to test for problems like this very carefully, and we can’t see something like this getting out and affecting everything 28nm, both high and low power. It also would be very unlikely to lead to such a sudden stoppage, the chipmakers would have been brought in to the loop early on, not post-stop. Lastly, if the problem was bad enough to halt production, it would be bad enough to necessitate a recall, and that has not happened.

For the generic “other problem”, there are things like packaging problems, underfill, solder bumps, and TSMC partner problems. These tend to be varied, especially given the number of packaging houses and technologies used for different intended purposes. Cell phone chips are not packaged like 300W GPUs, and industrial lifespan FPGAs are similarly different from disposable widget controllers. All of this would lead to specific product stoppages, not line stoppages.

There could be other issues that we have not considered in our analysis here, also, some or all of the affects may be only part of the story. To top it off, TSMC could have realized that there would be no short term restart of the process, and just said that to placate their customers. Given the number of sources SemiAccurate has who are reporting either the same information or subsets thereof, we very much doubt the end result will continue to line up with our initial reporting and analysis of this situation. The reasons behind the shutdown remain an open question, but hopefully you now have a much better idea of the possibilities. If you do know what is going on, feel free to write us, preferably with documentation shorter than a whitepaper.S|A