INTEL IS FINALLY starting to talk about the upcoming Poulson chip, the first real update to the Itanium architecture in years. While the chip may be huge and technically complex, it is hard to see anyone other than die-hard HP big iron shops actually caring.
For those of you not following the Itanium market closely, Tukwila, the last of the old generation, finally shipped in Q1 of 2010. It brought one big advance to the table, CSI/QPI, meaning a mostly common platform with Xeons. While the platform might be common, that does not mean compatible, and certainly not interchangable. It will save a lot on parts sourcing and system design though.
Itanium is meant for big iron, mainly mainframe and mission critical computing capabilities, think big and very expensive boxes. It has taken that market by storm, and by that market, I mean replaced HP’s in-house chips that were discontinued for the customers not smart enough to flee lock-in. The rest of the world seems to have ignored the line, with the few vendors that have tried to use it backpedaling in a hurry.
The newest chip, Poulson, seems to carry on several Itanium themes while shedding one big one. Like it’s predecessors, Poulson is huge, 588mm^2 including 8 of the new cores. Unlike pretty much every other Itanium though, this one is not on a trailing edge process, it is on the current state of the art 32nm node, two up from the older 65nm Tukwila/9300.
Update: Intel says that the actual die sie is 544mm^2. On the call, it was said to be 588mm^2 as others have also reported.
Yet another Itanium theme that Poulson uses to it’s advantage is the familiar huge caches. Poulson has a rather astounding 54MB of on-die caches, spread over many different types. This allows the chip to support tons of threads without stalling the in-order architecture. If you want to succeed in the markets that Poulson is aimed at, you unquestionably need massive caches like it has.
Intel was very non-specific about what the caches were, but a few bits came out. First is that there are two large directory caches, quite likely one big logical cache, that are each nearly the size of a core. These are very likely multiple megabytes in size, and necessary for the core counts Poulson systems will scale too. It was also mentioned, very confusingly, that there is a 256K directory and 512K data cache, but not clarified is if it is per-core or per chip. It is almost assuredly per core, but that is not guaranteed. The L3 cache was unspecified but is undoubtedly tens of megabytes.
A single Poulson Core
Poulson consists of 3.1 Billion transistors, has new pipelines, caches and everything else you would expect from a new architecture. Intel went out of their way to point out that the Int and FP pipes are all new, something you would expect from a change like this.
The most dramatic difference was Intel changing the instruction bundle width from 6 to 12 wide. If you remember, the promise of Itanium was that it would be a relatively simple, don’t say stupid or the three fanbois will get petulant, chip that pawned off work to the compiler. The idea was that the compiler smarts would not have to be replicated in silicon leading to simpler and more efficient hardware. To say it failed miserably would be an understatement, but a few diehards keep the embers of hope smoking if not glowing.
One big problem is that all Itanium chips to date expect bundles of 6 instructions in a group, and they are executed concurrently. If there are not six instructions available that can be run in parallel, that slot goes to waste. Intel is trading Out of Order complexity for width in the hope that software will catch up someday. This leaves the software very dependent on being compiled to exactly match the hardware, a constant that till now was fairly universal.
When Poulson expanded the instruction bundle width from 6 to 12, they basically broke all the existing software optimizations. If you don’t recompile, you are wasting half of your capabilities. The other possibility is that the type of SMT in Poulson will take advantage of this and allow for two threads to be bundled in hardware, leaving the older six instruction bundles workable. We think this is the much more likely scenario, especially since the Intel presenters steadfastly refused to discuss any SMT details. Upping the thread count on SMT in Poulson seems very likely.
The full die, caches, cores and more
While Intel would not give any details on specifics, the arrangement of the chip is so close to Beckton that it is eerie. This means that Poulson has a ring bus just like Beckton, Westmere-EX, Larrabee/Knights-term-of-the-moment, and Sandy Bridge. This isn’t much of a shock, but it will be interesting to see how wide it is. A ring also implies a segmented L3 cache, although how well that added latency plays with an in-order architecture is yet to be disclosed.
Shrinks vs Redesign power use
Intel moved Poulson from a non-HKMG 65nm process to a 32nm 2nd gen HKMG process, and claims huge power benefits. On top of that, they did a ton of work to cut power use even more, bringing the chip up to modern standards, if not beyond.
Both leakage and dynamic power were addressed, with that many transistors, there isn’t any real choice but to be aggressive on both sides. DIMM clock gating was also mentioned prominently, something that will make a big difference, Itanium systems use memory in quantities most people tend to think of hard drives in.
Last, Intel claims “accurate power monitoring”, which while probably, err, accurate, kind of makes you wonder what the previous Itaniums had. Perhaps this is why most companies deploying Itanium machines tend to have 24/7 soothsayers on staff. Contrary to popular, err, belief, this is not mandatory for all HP products, just a damn good idea.
At the risk of sounding like a broken record, Intel would not talk about power or TDP of the new chips. SemiAccurate has long heard that Intel was going to ignore their TDP bins and go after IBMs power with an ‘unlimited’ power Itanium, 180+W was mentioned. It would not surprise us to see Poulson redefining what top TDP is for Intel, but that would probably not fly with existing board architectures.
Speaking of which, Intel was rather coy about the new chip’s I/O. Since it is socket compatible with the older Tukwila/9300 parts, that gives you most of the answer right there. Intel claims a 33% increase in bandwidth for QPI in the slides, but gave conflicting numbers, 4Gt/s and a rise from 4.8 to 7Gt/s, in their call last week. Whatever the number ends up being, the platform was specced for this before Tukwila launched, so it should be a smooth upgrade.
Why should you care about this chip, or the Itanium line in general? The short answer is that you shouldn’t unless you fall in to the one class that needs, or thinks it needs, silly levels of RAS. Intel put lots of time into talking about it’s wonderful RAS features, and how it can detect, correct, avoid and map out just about anything that humanity is ever likely to encounter.
This includes design choices on the transistor level, different logic chosen with RAS in mind, and lots of parity. Caches and pipelines are are covered with ECC or better too, and the firmware has a much longer list of scenarios to deal with. On top of that, Poulson is said to have silly levels of logging available.
On the surface, this sounds all fine and dandy, but there are two really big problems. First is that Intel refused to talk any specifics, just vague generalizations. That usually means that the details aren’t all that impressive when you look closely, but given this chip is meant for environments where high RAS levels are a mandatory item, it rings far less hollow than an average Nvidia PR claim.
More troubling is Poulson’s RAS levels with respect to Xeons. Intel’s biggest competitor for Itanium is Xeon, and if you look at performance, x86 wipes the floor with Itanium. The two things the bigger chips could do to differentiate were RAS and scalability. Scalability became much less of a concern with the Xeon-EX line, and their performance only widened the raw horsepower gap. This left RAS as the only feature, unless you count lock-in, that made Itanium stand out. On everything else, it lost to the cheaper and faster Xeons.
That is why some statements from the Intel spokespeople last week were so troubling, one journalist asked about the differences between Itanium and Xeon RAS levels. The answer was vague, that there was not a substantial gap in mission critical features, just differences. They went to great lengths to say that it wasn’t that one was better than the other, just that Xeon had some things, Itanium had others. The take home message was that RAS was basically a wash between them.
While this may not be true, even if it is close, why would anyone buy an Itanium product? Becton scales natively to 64 cores, higher with a bit of glue logic. Westmere-EX will up this to 80 cores in a month or two, leaving the market for the triple digit core counts all to Itanium. And Power. And Sparc. The air up there is awfully thin. By the time Poulson ships, Ivy Bridge-EX (Note: there is no Sandy Bridge-EX) will be very close to shipping, if not out, and raise the RAS bar for x86.
Overall, Poulson has some neat features, but you have to wonder why Intel is bothering? According to what we were told, it will likely be slower, hotter, more expensive, and painfully limited in OS choices compared to x86. On top of that, software headaches, a persistent bane of Itanium’s existence, only get worse. Other than a few specific niches for shops that are bound to HP proprietary OSes, we can’t see any reason to buy Poulson. We can however see a lot of reasons to jump to Westmere-EX or Ivy Bridge-EX.S|A
Latest posts by Charlie Demerjian (see all)
- Globalfoundries 7nm process isn’t even close to the name - Sep 26, 2016
- ARM upgrades realtime offerings to v8-R and adds Cortex-R52 - Sep 21, 2016
- Everspin and Globalfoundries team up for embedded ST-MRAM - Sep 15, 2016
- Intel’s Xpoint is pretty much broken - Sep 12, 2016
- ARM adds 2048-bit vectors to v8A with SVE - Sep 7, 2016