If you look at the HPC and related accelerator market, it has been dominated by GPUs with about 90% of unit sales going to Nvidia, 10% to AMD. The other players have long ago died, and with barely two players in the market, they can jack prices up to abusive levels. Strangely enough, they do. The same GPU you can get off the shelf for $300 once in ‘professional’ form suddenly costs $3000. The difference is a few blown fuses and drivers that are made to not run on the cheaper version. Actually there is one other difference, the consumer cards usually are significantly faster than their more expensive brethren.
If this sounds like a market ripe for a third player, you would be quite right, but there are some significant barriers to entry. The software stack associated with the card is a significant part of the product, from drivers to compilers and the general surrounding ecosystem are probably the most important piece of the puzzle. GPUs are unquestionably hard to program for, not just the new languages that are required, but more importantly a completely new paradigm involved with how you architect code for it.
This means one thing, the talent pools that can write decent GPGPU code is both small and shallow, plus the people that can do it well are few and far between. Worse yet, they are expensive, very very expensive, and they know it. The normal GPGPU coding process involves hiring people for way more than you want to spend, then paying them copious amounts to climb a steep and always changing learning curve. The end result usually ends up with mediocre code that more often than not misses the vendor’s promises by an order of magnitude. More often than not, the projects don’t come close to their goals if they don’t outright fail.
This is why the software stack is of deadly importance. If a top GPGPU coder makes $150K or so, lets just round up to $200K with benefits, toys, and incentives, that is a lot of money. To develop a code base with a team of ten coders over a year will cost lots and lots of money, usually dwarfing the cost of the hardware. The safe route is to use vanilla x86 servers, coders that are familiar with them, and proven tools to do the job. That would be why GPGPUs sell mainly to a very small subset of the market that needs the performance at all costs, and is willing to spend whatever it takes to get the job done.
Even for this crowd, some common sense does apply, and they do actually evaluate products before sending over dump trucks full of cash. If it works as promised, the trucks roll out, but rest assured, they know what they are doing and more importantly why. If your total solution does not work, the underlying hardware does not matter much. Imagine a safe full of valuables that you don’t have the combination for and you are on the right track.
This is why software is the most important piece of the puzzle. If your hardware is 10x faster than the competition, but you can’t actually get code to work well enough to extract a quarter of that speed, you are still ahead of the game. If extracting that 25% costs you more in coders and time, your solution quickly becomes worthless in spite of the superior performance on paper.
For some odd reason, a few vendors seem oblivious to the quality of their professional drivers and software ecosystem, AMD being a great example of this. To make matters worse, some of those that grasp the concept use software as a lock in tool or worse yet a weapon. Please note that the weapon tends to be pointed at the customer more than the competition, and the lock in gets worse and worse the more you use the solutions offered. For some reason, customers are not fond of this situation, but many don’t have a choice.
From there, the hardware performance comes in to play, but it is secondary to a metric we will call usable performance. The idea is summed up best by (Peak performance) * (what you can extract), and that is what your end up getting as a result of your efforts. As you can tell, tools play a key role here even before the cost of coding, TCO, and related bits are accounted for. More is more, but more is not any better than less if you can’t code for the chip without a team of PhDs and a few years of research.
Just to make matters more painful for the customer, each GPU generation radically changes the underlying hardware. Advances in features are generally welcome, but that means the older code becomes non-optimized for the new toys or it simply don’t work at all. Nvidia claims that their Cuda VM will account for the changes and optimize code on the fly, but SemiAccurate’s contacts say that it only works out well on paper, the real world is a bit different. A rework of the code and a recompile is necessary for optimal use of new hardware, but that is easier than direct coding on the metal. That said, even this half step is still something you would not wish on an enemy, much less a team you are paying 6+ digits a year per head.
In the end, GPGPUs are a work in progress. That is the smiley happy term for the English term, “Bloody mess”. The landscape is littered with broken tools, grand promises, failed projects, changing paradigms, and the occasional success story. Lots of money has been spent, but most of that is not real. For every card that is sold, more often than not there is a backchannel gift that more than makes up for the monetary outlay, especially in the big supercomputer space and halo projects. One large national lab denizen once described ‘buying’ one such specialized supercomputer for millions of dollars as a, “Cash flow positive event” for his organization.
The net result is a lot of smoke and mirrors for the money in and the meagre results out. There are few examples of things working as expected, and even fewer examples of unbiased results, but they do exist. Smart customers evaluate everything that is pitched to them, and test until they are sure the products match their expectations. This takes time, money, and discipline that few companies can afford. The net result is very low GPGPU uptake for the mainstream corporate customer. Some vendors will claim large profits from their products, but the accounting for costs seems a bit specious.
Getting back to the topic at hand we have Intel’s new Xeon Phi accelerator. It was originally a project called Larrabee that Intel intended to be a GPU based on x86. That goal, umm, didn’t work out exactly as intended, and the project was reoriented to be a compute accelerator. The delays meant the first generation was DOA, mainly used as a software development tool to work the bugs out. The bugs were more Intel’s, not the customers, and the third generation part showed a marked improvement in most areas. That part is what was released yesterday as Knights Corner/MIC/Xeon Phi.
The biggest drawback to its use as a GPU was the layout, basically a shared nothing cluster of x86 cores with a massively wide vector engine bolted to each one. It sort of makes sense if you think about it, but the work involved with chaining Seymour Cray’s proverbial 1024 chickens to a plow comes to mind in making Phi do GPU work. The architecture is unquestionably more flexible than a GPU but also takes more to focus on that task that purpose built hardware. That same drawback however makes it a much better compute accelerator than a GPU, there are times when the chickens win.
Intel wisely made Phi look like a dirt standard rack of x86 servers to the software. They can talk TCP/IP over PCIe, so the communications protocols are what you would expect. In general, the chip is purpose built to look like an MPI cluster to existing code. In case you don’t follow parallel programming APIs closely, MPI is by far the most widely known and used way to program parallel code. Everyone knows it, and everyone who has done parallel programming is at least familiar with the paradigm.
Phi also runs Linux, a mildly tweaked but vanilla kernel that Intel claims will be merged in to the main kernel line in short order. You can SSH in to it, run code on it, and do everything you want just like a standard HPC box running Linux. Since Microsoft has about zero marketshare in the HPC and supercomputer space, this is a welcome and familiar way to do things for everyone involved.
Tools are another win for Phi, everything supports x86 CPUs, and that code simply runs on Phi without so much as a line of code changed. You literally don’t need to do anything to get the code to work, and if it is thread aware it will perform fairly well out of the box. You can optimize quite a bit from there, but basic functionality and extracting a large percentage of the performance from a card is a matter of hours or days, not months of pain. The usable performance of Phi is laughably higher than any GPU out there, it isn’t even a close race. No, it isn’t even a race, it is a clean kill.
So Intel has solved the biggest problem facing the entire HPC accelerator and GPGPU space with one part, and the code just works. The tools that you use are the ones you have now, everything supports x86. The coding paradigms are also what you use now, if you were targeting a rack of x86 machines, you have already targeted a Phi card. This isn’t to say the optimization process is non-existent, it is just much easier. You start out at a better point, have unquestionably superior tools, a greater variety of them, and a large pool of coders to hire from.
The difference between Phi and GPGPU is astounding. The hardware is a bit light on raw performance, barely over a TeraFLOP DP while the competition is notably higher. SP FP is a far more lopsided win for the GPU set, they all will crunch multiple times what a Phi can. That said, for a given amount of programmer hours, it would be surprising if you didn’t get a better result from a cluster of servers with the Intel cards plugged in than any competing GPU based solution.
Performance per watt is a similar loss for Intel on paper. Phi is a 225W part, similar to the Nvidia K20 and AMD FirePro S9000, and loses to both on raw performance. If you substitute usable performance for raw peak performance, Intel wins at DP but loses on SP due to the sheer speed advantage of the GPUs.
Intel also pulled a very interesting play for cost, the big 5110P card only costs $2649, hundreds of dollars cheaper than the GPUs it competes against. The workstation oriented 3100 line that launches in 1H/2013 is slated to cost “under $2000”, or vastly less than it’s GPU based competitors. If you look at TCO for a given code base, lets call it cost to program plus cost of hardware, Phi is a clean kill for Intel. GPGPUs don’t stand a chance. Unless you need the GPU functionality of the devices, there is no reason to buy one of the more expensive models.
In fact, if you buy the same GPU in non-professional guise, you can get the majority of the visualization functionality for 10% of the cost. Even with the cost of a Phi added in, you still save money with the Intel solution. This part simply obsoletes the whole GPGPU paradigm and relegates it to generating pretty pictures, not to serious compute work. Yes a GPU can post higher peak performance numbers, but good luck getting it to work like that on your code. Don’t take my word for it, try it for yourself if you can, but don’t blame me for the resulting frustration and hair loss.
That brings us to the market for GPGPUs, specifically Nvidias GPGPUs. The company makes a large percentage of their profits, said to be about 30%, from a disproportionately small number of GPGPU unit sales, said to be about 5%. This market is rife with not-quite success stories, lots of money spent, and many burnt fingers attached to the hands of early adopters. The small numerical adoption rate is a testament to the barriers to entry in the mainstream. The GPGPU market has not taken off, and is only shown to be self-sustaining when using some very questionable math.
So in comes Intel with Phi. The barriers to entry there are not zero for a potential Phi adopter, but they are unquestionably far lower than anything Nvidia can realistically claim. Coding costs are cheaper, hardware is cheaper, tools selection is wider and deeper, as is the talent familiar with them. To be blunt, Intel is going to wipe the floor with Nvidia in every aspect related to code and coding costs.
With Phi on the market, expect new projects that choose Cuda and Nvidia hardware to wither very quickly. Projects that are already heavily invested in the Nvidia solutions will be unlikely to drop them cold, but if they have maintained an x86 code base, Phi’s overwhelmingly lower cost of code maintenance, updates, and optimization for the next generation may very well win the day. The difference is really that extreme when you run the numbers with any realistic programmer costs factored in.
The Intel Xeon Phi 5110P card will absolutely devastate the market for Tesla products, and the upcoming 3100 series will do the same for much of the Quadro line. While there are some markets that need the GPU’s graphics functionality, but the majority of the rest will disappear with frightening rapidity once the 3100s arrive. With them goes an disproportionate percentage of Nvidia’s profits for their GPUs, both the professional versions as well as the total line. Given the small number of unit sales in this market, Intel doesn’t have to sell many to cause some fairly acute pain to Nvidia’s bottom line.
With the cost of GPGPU coding coming down at a snails pace, and lock in being prioritized over customer benefits, the viability of the whole market is now in question. From SemiAccurate’s point of view it is already dead, but may have a long tail as those locked in struggle to break free. Anyone trying to push a proprietary solution at this point is in full panic mode right now, several backchannel signs removed any doubt earlier this week.
Phi is having immediate effects on the market that are quite visible to any onlooker too. The high end Nvidia K20 cards announced today have not been priced officially, that in itself is strange, but word on the street is that it is priced 20% less than expected. If that isn’t a red flag that the market is in trouble and the players know it, I don’t know what is. The end times for GPGPU is here, pity it’s purveyors, Intel doesn’t take prisoners.S|A
Latest posts by Charlie Demerjian (see all)
- Cavium’s Octeon TX blends compute and packet moving - May 2, 2016
- Qualcomm releases Zeroth API to developers - May 2, 2016
- Another detail about Qualcomm server SoCs revealed - Apr 27, 2016
- AMD finally really honestly launches the dual Fury - Apr 26, 2016
- AMD’s Chinese JV is like the other and different - Apr 22, 2016