NVIDIA JUST HAD one of their most sleazy marketing tactics exposed, that PhysX is faster on a GPU than a CPU. As David Kanter at Real World Tech proves, the only reason that PhysX is faster on a GPU is because Nvidia purposely hobbles it on the CPU. If they didn’t, PhysX would run faster on a modern CPU.
The article itself can be found here, and be forewarned, it is highly technical. In it, Kanter watched the execution of two PhysX enabled programs, a game/tech demo called Cryostasis, and an Nvidia program called PhysX Soft Body Demo. Both use PhysX, and are heavily promoted by Nvidia to ‘prove’ how much better their GPUs are.
The rationale behind using PhysX in this way is that Nvidia artificially blocks any other GPU from using PhysX, going so far as to disable the functionality on their own GPUs if an ATI GPU is simply present in the system but completely unused. The only way to compare is to use PhysX on the CPU, and compare it to the Nvidia GPU version.
If you can imagine the coincidence, it runs really well on Nvidia cards, but chokes if there is an ATI card in the system. Frame rates tend to go from more than 50 to the single digits even when you have an overclocked i7 and an ATI HD5970. Since this setup is vastly faster than an i7 and a GTX480 in almost every objective test, you might suspect foul play if the inclusion of PhysX drops performance by an order of magnitude. As Real World Tech proved, those suspicions would be absolutely correct.
How do they do it? It is easy, a combination of optimization for the GPU and de-optimization for the CPU. Nvidia has long claimed a 2-4x advantage for GPU physics, using their own PhysX APIs, over anything a CPU can do, no matter what it is or how many there are. And they can back it up with hard benchmarks, but only ones where the Nvidia API is used. For the sake of argument, lets assume that the PhysX implementations are indeed 4x faster on an Nvidia GPU than on the fastest quad core Intel iSomethingMeaningless.
If you look at Page 3 of the article, you will see the code traces of two PhysX using programs. There is one thing you should pay attention to, PhysX uses x87 for FP math almost exclusively, not SSE. For those not versed in processor arcana, Intel introduced SSE with the Pentium 3, a 450MHz CPU that debuted in February of 1999. Every Intel processor since has had SSE. The Pentium 4 which debuted in November of 2000 had SSE2, and the later variants had SSE3. How many gamers use a CPU slower than 450MHz?
Of the SSE variants, the one that matters here is SSE, but SSE2 could also be quite relevant. In any case, Intel hasn’t introduced a CPU without SSE or SSE2 in almost a decade, 9 years and a few days short of 8 months to be precise. For the sake of brevity, we will lump SSE, SSE2, and later revisions in to one basket called SSE.
AMD had a similar API called 3DNow!, but the mainstream K8/Athlon64/Opteron lines had full SSE and SSE2 support since May of 2004. Some variants of the K7 had SSE with a different name, 3DNow! Professional, for years prior to that.
Basically, anything that runs at 1GHz or faster has SSE, even the Atom variants aimed at phones and widgets supports full SSE/SSE2. Nothing on the market, and nothing that was on the market for years prior to the founding of Ageia, the originator of PhysX that Nvidia later bought, lacked SSE.
To make matters worse, x87, the ‘old way’ of doing FP math, has been deprecated by Intel, AMD and Microsoft. x64 extensions write it off the spec, although you still can make it work if you are determined to, but it won’t necessarily be there in the future. If you don’t have a damn good reason to use it, you really should avoid it.
What’s more, x87 is vastly slower than using SSE. x87 is stack based, meaning that to do an operation with x87, you need to push things to the stack, use instructions like FXCH to manipulate it, and push a lot to memory needlessly. Simply using the equivalent SSE instruction instead of x87 will net you about 20% more speed. You can design a pathalogical case where SSE is slower than x87, but you would have to go out of your way to make it happen. I am pretty sure Nvidia will demo this kind of ‘valid benchmark’ in the near future, a purposefully designed pathological case that proves their point. In the real world, multiple game developers, assembly experts, and chip designers, spoken to for this article can’t think of a situation where SSE is slower.
As Real World Tech pointed out, the Ageia PhysX chip used 32 bit math, and the now Nvidia PhysX programs likely do as well. The code runs on G80 based GPUs, and they did not have DP FP capabilities. This means that you can pack 4 of those data points into a 128 bit number.
Why is this important? SSE has scalar and vector variants for instructions. Scalar basically means one piece of data per instruction, and that is where you get the ~20% speedup over x87. Vector allows you to do the math on 1 128 bit instruction, 2 64 bit instructions, or 4 32 bit instructions simultaneously. Since PhysX uses 32 bit numbers, you can do 4 of them in one SSE instruction, so four per clock. Plus 20%. Lets be nice to Nvidia and assume only a 4x speedup with the use of vector SSE.
What does this mean? Well, to not use SSE on any modern compiler, you have to explicitly tell the compiler to avoid it. The fact that it has been in every Intel chip released for a decade means it is assumed everywhere. Nvidia had to go out of their way to make it x87 only, and that wasn’t by accident, it could not have been.
If they didn’t go out of their way, the 2-4x speed increase by using a GPU for PhysX would be somewhere between half as fast and about equal to a modern CPU. Even if you use numbers that are generous to Nvidia, PhysX would be SLOWER ON THE GPU IN EVERY CASE. To top it off, there is NO technical reason for Nvidia to use x87, SSE is faster in every case we could find.
But it gets worse. Far worse. The two programs that Real World Tech looked at, and others looked at by SemiAccurate, are single theaded. That means the CPU can only run PhysX on one core at a time. Multi-threading an API like PhysX is hard work, but Nvidia has already done that.
GPUs have lots of ‘cores’, the GTX285 for example has 240 of them, ATI’s HD5870 has 1600, and Northern Islands has…nah, that would be telling. Without quibbling over the definition of ‘core’, we will just take Nvidia’s statement at face value and assume 240 cores. If they can get 4x the performance of a single Intel core with 240 of their cores, each Nvidia core is worth 1/60th of an Intel core, or less. If the PhysX code didn’t thread well, and it does, GPU physics would run slower than a dishwasher controller on heavy painkillers.
So the PhysX code is threaded when run on the GPU, but not on the CPU. On consoles, the XBox360 and PS3 specifically, the code is threaded just fine. (Note: The Wii only has a single core without any kind of SMT, so threading won’t help that playform) None of the consoles have a CUDA capable GPU, something that Nvidia claims is necessary for GPU physics. The PS3 uses a variant of the G70 or G71 for it’s GPU, the first Nvidia product that supported PhysX is the G80.
All the consoles run PhysX just fine, and the frame rate doesn’t suffer the same order of magnitude performance decrease as a CPU would. Why? Because Nvidia allowed the code running on console CPUs to use multiple threads to do the work, and ported it to AltiVec, the PowerPC vector instruction set. With that, a console that barely has the power of a low end P4, will run PhysX, the game, and everything else just fine. Gosh, what might that infer?
Most modern games fully use less than 2 CPU cores, and most gaming PCs now have 4 cores, the newest ones have 6. Nvidia will not allow PhysX to run on the other 2-4 cores that are basically idle when gaming. If they only allowed a second thread to run PhysX, you would double the speed at a minimum.
Since everything is in one thread for the programs that Real World Tech looked at, PhysX isn’t even fully utilizing a single core, so adding more threads would almost assuredly mean far more than a 2x speedup. On a 4 core CPU, you could easily get a 4x speedup from even basic threading, far less than Nvidia has done for consoles.
The problem is that the 4x speedup from threading would once again erase the 2-4x ‘advantage’ from running PhysX on the GPU. Threading would once again relegate GPU PhysX to somewhere between half as fast and barely equal to a modern Intel CPU. See the problem? To ‘fix’ this ‘problem’, Nvidia won’t thread PhysX on the PC CPU, something that they do on every other platform the API is available for.
We are told that Nvidia claims that threading Physx on the CPU is not their problem, it is up to the game developers to implement. Only on the PC though, for the rest they are happy to make the effort. Like the “no one wants SSE, game developers clamor for x87 code” line of bull they spew, this is nothing more than plausible deniability for the technically unaware. Then again, after years of hype, the number of games released that use PhysX on the GPU for anything more than trivial eye candy can be counted on one hand. Make of that what you will.
Imagine if instead of purposefully de-optimizing PhysX for the CPU, Nvidia instead just did what they do for every other platform, IE not restrict the instruction set use for PR purposes and thread the code. On a modern 4 core CPU, you would get 4x speed increase from SSE, and a 4x increase from threading. Math says that would get you a 16x increase in speed, more than the decrease that you see going from GPU PhysX to CPU PhysX today.
The 2-4x advantage that Nvidia claims for the GPU is only when they hobble the CPU. If they didn’t, the CPU would have a 4-8x performance advantage on Nvidia’s own API. Havok and Bullet physics APIs seem to do just fine, better than PhysX actually, when running on the CPU. For some unknown reason, it is only the physics API by the GPU-only vendor that has problems on modern CPUs. Anyone have a clue why this is the case?
To take this a step farther, if you de-optimized the GPU version of PhysX in the same way that Nvidia does to the CPU version, imagine what would happen? To start with, on a GTX285, executing one instruction per clock would mean going from a ’2-4x advantage’ over the CPU to a 60-120x disadvantage over de-optimized CPU code. With the simple threading and SSE optimizations above, the CPU would run it 960-1920x faster than single threaded GPU code. Even a lowly Atom CPU would probably be 100x faster than single threaded GPU PhysX code. If you take away vectorization as well, the GPU performance drops yet farther.
In the end, there is one thing that is unquestionably clear, if you remove the de-optimizations that Nvidia inflicts only on the PC CPU version of PhysX, the GPU version would unquestionably be slower than a modern CPU. If you de-optimize the GPU version in the same way that Nvidia hobbles the CPU code, it would likely be 1000x slower than the CPU. If Nvidia didn’t cripple CPU PhysX, they would lose to the CPU every time.
One thing you can be sure about, Nvidia will react to the Real World Tech article with FUD and tame attack sites. The official drums about no developer wanting SSE and how threading is up to game developers will have a few more technically devoid talking points added, and Nvidia innocence will be proclaimed. It doesn’t matter, the GPU is the wrong thing to run physics on, it is slower than the CPU at that task. Period. This won’t stop Nvidia from saying the exact opposite though, facts don’t seem to get in the way of their company’s PR statements.
If Nvidia wants to prove that PhysX is actually faster on the GPU, I will offer them a fair test. Give me the code tree for PhysX and the related DLL, and I will have them re-compiled for GPUs and then optimized the CPU version with some minor threading and vectorized SSE. Then I will run the released PHysX supporting games on both DLLs as a benchmark. How about it guys? If your PR claims are anything close to true, what do you have to lose?S|A
Latest posts by Charlie Demerjian (see all)
- ST brings Java, widgets, and code generation to the STM32 line - Dec 11, 2013
- LSI shows off 28nm 28Gbps SerDes for 32Gb FC and 100GbE - Dec 10, 2013
- LSI extends Syncro to four nodes and beyond - Dec 10, 2013
- Tablet OSes killed Windows 8 and Microsoft with it - Dec 9, 2013
- ARM makes a learning remote that will never need batteries - Dec 5, 2013