NVIDIA’S GF100 ARCHITECTURE is falling into the same trap that G200 did, shooting for the moon at the cost of the parts that pay the bills. Let’s take a look at the architecture and how it stacks up in the market once again.
If you recall, last May, I said a few things about the chip then known as GT300, now called Fermi or GF100. The executive summary at the time was that GF100 was too big, too hot, and the wrong product design in almost all areas. Nvidia was shooting for a world-beating GPGPU chip, and it might have achieved that. Unfortunately for it, there isn’t a sustainable market for such a beast. The costs of that GPGPU performance were raw GPU performance and manufacturability. While GPUs have a large and sustainable market for the time being and perhaps the foreseeable future, the GPGPU market is another story altogether. It is a risky management bet at best.
What is Nvidia going deliver? As we have said earlier on tapeout, the GF100 Fermi is a 23.x * 23.x mm chip, we hear it is within a hair of 550mm^2. This compares quite unfavorably to it’s main competitor, ATI’s Cypress HD5870 at 334mm^2. ATI gets over 160 chip candidates from a wafer, but Nvidia gets only 104. To make matters worse, defective chips go up roughly by the square of chip the area, meaning Nvidia loses almost three times as many dies to defects as ATI because of the chip size differential.
The raw manufacturing cost of each GF100 to Nvidia is more than double that of ATI’s Cypress. If the target product with 512 shaders is real, the recently reported 40 percent yield rates don’t seem to be obtainable. It won’t hit half of that based on Nvidia’s current 40nm product yields, likely far far less.
Cost aside, the next problem is power. The demo cards at CES were pulling 280W for a single GPU which is perilously close to the 300W max for PCIe cards. Nvidia can choose to break that cap, but it would not be able to call the cards PCIe. OEMs really frown on such things. Knowingly selling out of spec parts puts a huge liability burden on their shoulders, and OEMs avoid that at all costs.
280W and 550mm^2 means Nvidia is maxed out on both power use and reticule area for any product from TSMC. There is precious little room to grow on either constraint. The competition on the other hand can grow its part by 60 percent in die area and over 50 percent in power draw while staying below what Nvidia is offering. That puts an upper bound on Nvidia’s pricing in a fairly immutable way, barring a massive performance win. If you don’t feel like reading to the end, the short story is that it didn’t get that win.
Getting back to the architecture itself, Jen-Hsun was mocking Intel’s Larrabee as “Laughabee” while making the exact same thing himself. As we stated last May, GF100 has almost no fixed function units, not even the tessellator. Most of the units that were fixed in G200 are now distributed, something that is both good and bad.
How did this come about? Sources in Santa Clara tell SemiAccurate that GF100 was never meant to be a graphics chip, it started life as a GPGPU compute chip and then abandoned. When the successor to the G200 didn’t pan out, the GF100 architecture was pulled off the shelf and tarted up to be a GPU. This is very similar to what happened to ATI’s R500 Xenos, but that one seems to have worked out nicely in the end.
How do you go from compute to GPU? Add a bit of logic to each Shader Multiprocessor (SM – a 32-shader unit, the GF100 has 16 of them) to do some GPU DX11 specific tasks. The up side is that this can work fairly well, an advantage of a more general purpose GPGPU chip. The down side is that a chip you wanted to sell for $2,500 isn’t economically feasible to sell at $500. Given that the competition is fierce in the GPU segment, Nvidia is unable to increase its pricing to meet costs.
Moving on to tessellation we said last May that Nvidia does not have dedicated hardware tessellators. Nvidia said the GF100 has hardware tessellation, even though our sources were adamant that it did not. You can say that Nvidia is either lying or splitting hairs, but there is no tessellator.
Instead, the GF100 has what Nvidia calls a ‘polymorph engine’, and there is one per SM. Basically it added a few features to a subset of the shaders, and that is now what it is calling a tessellator, but it is not. ATI has a single fairly small dedicated tessellator that remains unchanged up and down the Evergreen HD5xxx stack. On ATI 5xxx cards, tessellation performance takes almost no shader time other than dealing with the output just like any other triangle. On GF100, there is no real dedicated hardware, so the more you crank up the tessellation, the more shaders you co-opt.
Nvidia is going to tout the ‘scaling’ of its tessellation capabilities, but since it is kicking off the GF100 line at the top, scaling is only going down from there. ATI’s 5xxx parts don’t lose anything when going down, nor do they lose die area when going up.
When showing off the tessellation capabilities last week, Nvidia was very careful to show the Unigine Heaven benchmark with the dragon statue, claiming it beat the ATI 5870 by 60 percent. While technically true, this number is purposely misleading on two accounts. First is that this is a best case for showing off tessellation and only tessellation. If it were to show any of the other Heaven tests, the margins would narrow significantly.
Secondly, the test is very light on other parts of the system, so any use of system resources for the tessellator would likely be masked. It is unlikely that this performance would carry over to a real world scenario, which is exactly why it was chosen. Synthetic benchmarks are usually used as a best case scenario, and in this instance, they are.
The last bit is that our sources were reporting that Nvidia was comparing GF100 to a Radeon HD5870, a much cheaper card. If Nvidia had compared it to a comparably priced ATI HD5970 Hemlock card, which coincidentally uses almost the exact same power, the results would not have been pretty. Hemlock doubles the tessellation power of 5870, so it is pretty obvious why that comparison was not made.
Moving along, another interesting bit is that the GF100 has upgraded the number of ROPs (Rendering Output units) from 32 on G200/GTX280/GTX285 to 48. While this looks bad on the surface, the raw count of ROPs does not take into account any gains in efficiency between the two architectures. The ROPs per shader ratio has gone down dramatically which speaks volumes about the intended target market of the GF100 line.
On a slightly less positive note, the texture unit count has gone from 80 on the G200 line to 64 on the GF100 parts. Again, without numbers on efficiency of the units, they are not necessarily comparable, but smart money from insiders was on 128 texture units last spring. So instead of improving performance it would seem to be a decrease in performance between the current line and the upcoming release.
Both cases point in the direction of Nvidia doing the bare minimum to keep GF100 in the graphics game. The GPU functions appear to be bolted on to a GPGPU oriented ‘generalist’ chip, something strongly supported by the die size to performance ratio.
Nvidia’s GF100 starts out with its back against the wall and there is little room to grow from there. Nvidia cannot scale it up in die area, cannot scale it up in power use, and as it stands, the part is basically unmanufacturable from a balance sheet perspective. This would be acceptable if the competition was vastly slower, allowing Nvidia to charge a ‘sucker premium’, or if there was a ready market for them at $2,500 and up, but neither appears to be the case anymore.
The performance numbers so far, both in DX10 games and DX11 demos show that it at best has a 60 percent lead on some very specific benchmarks over a much cheaper and more efficient HD5870. Given that GF100 is more than 60 percent larger than that part, 60 percent should be the minimum, not the maximum performance lead. The more direct comparison to an ATI 5970 was curiously neglected at CES, mainly because the GF100 is about on par, best case for Nvidia, with that part.
This caps the price Nvidia can charge for GF100 cards at the price of Hemlock, or less. Since GF100 silicon at 40 percent yields will cost about $125 for the raw unpackaged silicon versus a consensus number of less than $50 for ATI’s Cypress, ATI can likely make a dual card for less than Nvidia can make a single GF100. The 280W power draw means it is almost impossible for Nvidia to put out a dual card with all units active, leaving the slippery slope of fused off shaders and downclocked parts to hold the fort.
Nvidia has been telling its AIBs (Add In Board makers) that the initial GF100 chips they will receive are going to be massively cut down and downclocked, likely at the same 448 shaders and rough clocks as the Fermi compute board. There will be a handful of 512 shader chips at ‘full clocks’ distributed to the press and for PR stunts, but yields will not support this as a real product.
To make matters worse, SemiAccurate’s sources in the Far East are saying that the current A3 silicon is ‘a mess’. Last spring, we were told that the chip was targeted for a top clock of 1500-1600MHz. The current silicon coming out of TSMC, defects aside, is not binning past 1400MHz in anything resembling quantity, with 1200MHz being the ‘volume’ bin. Even at 75 percent of intended clocks, the numbers of chips produced are not economically viable.
One of the main causes here is that clocks are, according on one source, ‘all over the place’, and there is no single problem to fix on the chip. Intra-die variation is huge, no surprise given TSMC’s 40nm process and the 550mm^2 or so that GF100 takes up. Each individual part tends to have different problems, and Nvidia’s unfinished homework did not help.
To compound this, the chip was made to run at a low voltage with a relatively high amperage. If you up the voltage, it disproportionately increases the net wattage used compared to other chips, HD5870 included. If Nvidia increases the voltage, it will blow through the wattage TDP limit and burn traces. If it doesn’t, there are a few SMs that have a ‘weak’ transistor or two that won’t run at the lower voltages.
Unfortunately for Nvidia, the architecture is badly designed. The best granularity for fusing off units loses 32 shaders. In its Fermi guise, the GF100 sells for $2,500-$4,000 as a Tesla board, and that is a downclocked 448 shader chip. It is quite telling that Nvidia is unable to cherry pick enough fully working parts to support the meager numbers that the Tesla volume requires, especially in light of the margins on those parts.
Nvidia has promised AIBs chips in late February, so a March release seems feasible. The AIBs were cautioned at CES that they would only receive low quantities of the fused off parts, and fewer if any of the ‘full’ 512 shader parts. If you are waiting in line for the chips Nvidia showed off or the chips that the press will be given, it will be a very long wait, but you will have lots of company.
GF100 as it stands is exactly what we were told last spring. It is too hot, too big, too compute focused, not graphics focused, and economically unmanufacturable. Nvidia is more than capable of winning cherry picked benchmarks against a much cheaper card, but it is eerily quiet against the comparably priced HD5970. Quite the green ‘Laughabee’.
Look for a lot of FUD coming out of Santa Clara over the next few weeks. It has nothing to sell but is desperate to keep ATI from booming in the GPU market. ATI on the other hand has multiple lines of DX11 parts on the market in all but the lowest price tier, and those will come in very short order.
Until it can ‘launch’ parts in barely above PR stunt quantities all Nvidia can do is spin. In the meantime, ATI is fast approaching the six-month window traditionally needed for launching derivative parts, and will likely have the next full generation finished before any GF100 derivatives launch, if they ever make financial sense at all. When you don’t have product, spin, and Nvidia is putting most ice skaters to shame with its current hot-shoe dance. S|A
Latest posts by Charlie Demerjian (see all)
- Globalfoundries 7nm process isn’t even close to the name - Sep 26, 2016
- ARM upgrades realtime offerings to v8-R and adds Cortex-R52 - Sep 21, 2016
- Everspin and Globalfoundries team up for embedded ST-MRAM - Sep 15, 2016
- Intel’s Xpoint is pretty much broken - Sep 12, 2016
- ARM adds 2048-bit vectors to v8A with SVE - Sep 7, 2016