Tegra 3 missed performance goals by wide margins

Late, problematic, and far too costly

Nvidia world icon 63x27 Tegra 3 missed performance goals by wide margins Nvidia’s ‘quad-ish’ core Tegra 3/Kal-el is bad idea with a worse implementation, and that is before the financials. Once you get in to that arena, things get positively ugly.

The short story is this; the CPU is under-performing badly, missed internal clock targets, and costs too much to be financially viable.

Lets start out with what Tegra 3 is, it is a quad-core A9 built on a TSMC 40nm process. We went over it a while ago, some of the details are here. The biggest thing we missed is the so called fifth core, and that is where many of the problems lie, not to mention the costs. ARM supplies the A9 core in two main versions, low power or high performance, both relatively speaking. Companies with an ARM core license can choose one or the other, but not modify the core itself.

To save power, one of the biggest problems with the current Tegra 2, there are only two paths available. The first is to make the system really efficient, and use the core as little as possible, putting it to sleep as often as possible, and waking it only when necessary. This is what Intel and AMD do, and the acronym is called HUGS, Hurry Up and Go to Sleep. It works very well, and is what Nvidia was touting as a key win for Tegra 2 on the power savings front, real world testing notwithstanding.

The other method is to put the hard work in and tweak what you can with the core, and build it on a low power process, then claw as much clock speed back out as you can. If it is not obvious, this is the harder path, and one that takes much longer to implement. Several ARM vendors have done this, and the results are quite a bit better than Tegra 2′s power numbers, a whole lot better in fact. Intel and AMD do this in addition to HUGS, and the power levels they achieve are quite notable.

Which way did Nvidia go for Tegra 3? Well, neither, they built a Frankenstein chip that trades die size, leakage, performance, yield, and cost for idle power savings. The trade-off is time to market.  A risky bet.

Back to the technology, Nvidia added a fifth A9 core to the CPU and built it on TSMC’s new 40LPG process. This process allows you to build a 40G (40nm General purpose) chip with 40LP (40nm Low Power) ‘islands’, a best of both worlds scenario, right? If you look at the ARM A9 web page, specifically the performance tab, you can see the differences between the two A9 cores, albeit both on 40G. 40LP should shave some power off of the total along with some clock speed. The idea is to have the LP core running when loads are light, then fire up the other four big cores when demand is there, moving the work to them. Again, a win/win, right?

In both cases, the answer is, “sort of, but……”. The first “sort of” is the performance, you will take a big hit on the transfer from the LP core to the G cores, but we would assume that Nvidia engineers have figured out how to minimize this impact. Either way, you will take a latency hit as well as a power hit in moving all that data, the best you can do is minimize how often you have to switch.

Even if Nvidia did a theoretically optimal job here, it would still lose to an efficient A9 implementation like many other ARM vendors do. That in turn would be annihilated by a custom core, ala Qualcomm’s Snapdragon and the upcoming Krait line. Is the fifth core a performance benefit? Yes. Is it better than the competition? Not even close. According to numbers shown to SemiAccurate, basically every upcoming competitor beats Tegra 3, most by wide margins.

Next up is the 40LPG process itself, a ‘sort of’ win/win again. This is not to say there is anything wrong with the process, there most definitely is not, and it does deliver exactly what it promises, the problem is cost. Some estimates tossed around have said that the cost adder to the 40GLP process is between 5-10%, but once again, SemiAccurate’s research says that is both low and is purposefully misleading, it only takes one set of costs in to account.

The first big problem is masks, the big skeleton in the 40LPG closet is masks, you need two sets. In order to do a 40LPG chip, you need a set of 40G masks, and a partial set of 40LP masks, and you have to swap them out for each step of the process that is applicable. The 5-10% number may be true for what TSMC adds to the bill to make up for increased tool time and their added costs, but what about the rest?

Conveniently left out of the official Nvidia ‘not leaked from us’ numbers are the mask costs, likely adding $1M or more to the tab, lowered yield, clock speed headaches, and many other niggling production issues. Our sources say that the cost adders are likely 20% or more when all is said and done. 40LPG is certainly viable, certainly delivers what it promises, but it isn’t cheap, and you certainly have to have a process and production team that is on the ball. If you have followed Nvidia, you probably know that this last one is…. not compatible with the company of late.

On top of that, you have to add another core to the mix, 4+1 cores are exactly that, 4+1. According to ARM, two cores for the performance optimised A9 is about 6.7mm^2 vs 4.6mm^2 for the low power variant. Add in the support circuitry for the core migration and other related housekeeping, and you are looking at less than 10% of added chip. Given the attendant yield losses that come with area, plus effectively doubling process steps on may layers, each of which takes a toll, this added core will probably bump silicon costs by about 10%. In short, this one is going to be much much more expensive to produce than Nvidia is even hinting at.

Then there is one last problem, actually making it. If you noticed, today was the big launch day for Tegra 3, specifically in the Asus Transformer 2/Prime netbook. When today rolled around, the Transformer was mysteriously delayed, once again to wait for that laggard Google. It isn’t Nvidia’s fault. Again. Just like Tegra 2. Again. Nothing to see here, Nvidia is making something that doesn’t stand up to scrutiny, so blame Google. Again. You would think that when the Tegra 3 drum was banged heavily a bit more than a month ago, they might have realized that the OS would not be ready in time, what is the lead time for laptop assembly and shipping from Taiwan again? That said, why would any company purposely mislead investors about material product launches? It makes no sense.

Getting back to the delay, there is another big whoopsie that no one is talking about which may be the actual reason for the delay, production problems. If you recall, Nvidia has been promising 1.5GHz base frequency Tegra 3′s for a long time, and that number was privately pushed hard as little as 3 weeks ago. The TSMC 40nm process is over two years old and well understood, so estimates of performance are closer to fact than divining. Production silicon should be bang-on estimates, right?

Well, with today’s “Google induced” delay, the numbers actually came out, and they are 1.4GHz. Actually, it is 1.4GHz max, 1.3GHz if more than one core is working, aka marketing spin for <1.3GHz with a bit of optimistic ‘turbo’. That ladies and gentlemen is at least a 15% clock miss on a well understood process. It has to be Google’s fault though, any other explanation would mean that Nvidia can’t make what they have promised, and are struggling to supply a single relatively low volume device, much less the volumes they promised the financial community.

(Note: We do know that HTC is promising a 1.5GHz part, but there are two problems. First, Nvidia PR is saying 1.4GHz to the press. Second, if Asus can’t support 1.5GHz ‘turbo’, not base, in a netbook/tablet form factor, with tablet thermal dissipation and tablet sized batteries, can 1.5GHz base be done in a phone? Were the laws of physics repealed in California during last night’s election?)

That darn Google, always screwing up Nvidia’s faultless engineering. Then again, the last round of ‘Google induced’ delays for Tegra 2 meant that the dozens of IR promised design wins shrank to almost zero. This time, Nvidia might want to rethink that finger pointing lest Wall Street catch on to this little similarity, much less the meaning behind it. Luckily, no financial people read SemiAccurate, much less hire us as technical consultants. Bullet dodged there.

Tegra 3, even at it’s reduced volumes, delayed introduction, and massive added costs is a performance monster, right? Well, no, it isn’t, it is quite frankly a dog. The main problem is that Nvidia doubled the core count, 2 to 4, upped shaders from 8 to 12, and didn’t add any memory controllers. They still have one controller for all of it, but the maximum speed is upped from 600 to 1066MHz LPDDR2. This would be significant if Tegra 2 wasn’t already a laggard in this area, and the speed increases actually make up for the increases in horsepower.  Additionally, L2 cache remains the same as Tegra 2 carries, compounding the memory pain.

Numbers seen by SemiAccurate show a dual core Qualcomm Krait SoC absolutely destroying a 1.5GHz Tegra 3, these are real tested silicon, not estimates. Not that Nvidia can make a 1.5GHz Tegra 3, but that is nitpicking. T33, aka Tegra 3.3 was meant to close the gap with Qualcomm by raising clocks from 1.5GHz to 1.7 or 1.8GHz, but it looks like a lost cause now. T33 seems to be more of a holding action to get things to what Nvidia promised T30/Tegra 3 would be when it was released in volume. Last September. If you are wondering why the Tegra roadmap was delayed and some desperate stopgaps put in when we exclusively broke that story last month, now you know. Does anyone else find it curious that Nvidia isn’t releasing clock speeds, only max ‘turbo’ frequencies?

In the end, what you have is the exact same Nvidia tactic that failed so miserably the last two times. Promise the moon, use the tame press as a sounding board, and fall flat on execution. Don’t admit anything is wrong, and blame everything on external sources because nothing is wrong at Nvidia. 28nm is on track and will fix everything anyway, the rumors of problems there have been curiously denied through backchannels, but the reasons that Nvidia dumped most if not all of their early 28nm capacity are not worth Nvidia addressing. Costs of Tegra 3 production are not a valid topic of discussion either, all is well, just believe the men with vested interests and titles. As far as Nvidia is concerned, Tegra 3 is just what they expected.S|A

m4s0n501
The following two tabs change content below.
 Tegra 3 missed performance goals by wide margins

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also a council member with Gerson Lehman Group.