As you might have noticed in our earlier articles, one of the main reasons to consider a Xeon E7 v2 is RAS. Lets take a look at what RAS features Intel added to the new E7 v2 line that make it more reliable than a lesser Xeon.
All told there are 27 RAS features that Intel lists for the new Ivy Bridge-EX line that we affectionately know as Xeon E7 v2 x8xx. Of that list, 10 are considered new or the other way of looking at it is the 17 prior technologies are just the foundation. In any case Intel marketing has once again run amok and for some odd reason labeled this group of 10 Intel Run Sure Technology. Taking them at their word the old Westmere-EX line and every other Intel CPU may be a bit flaky but that is marketing for you.
These 10 new technologies are split in to two groups, six are considered memory related and four are system oriented. Some are hardware, some are a combination of hardware and software, and some are enhancements to older technologies. What they all have in common is the ability to detect, correct, flag, and sometimes even partition off errors. Lets take a look at them in a bit more detail.
First up is something SemiAccurate touched upon yesterday, memory protection in the form of DDDC or Double Device Data Correction. This is an advanced form of ECC that can detect errors on two different DRAMs spread across two DIMMs and correct for them. It can even deal with the complete failure of a memory die and reconstruct the contents of it to other chips on the fly. The individual dies can then be permanently mapped out. The Double part means this can happen twice without taking the system down.
SDDC is similar but works in Performance Mode or when x8 DIMMs are used in which precludes DDDC in Lockstep Mode. The S stands for Single and that is what you pay for with Performance mode, a DIMM can only tolerate a single failure before there are serious problems at the system level. As we stated earlier, anyone buying a big E7 v2 system is likely to care about uptime and RAS more than someone buying a tablet so this is a good tradeoff for them.
But wait, there’s more! DDDC is now Enhanced DDDC, also known as DDDC+1. What does this feature bring an uptime conscious buyer? In short it allows traditional ECC, the ability to detect and correct single bit errors, to work even after two complete DRAM die failures. In case you are wondering, SDDC has been updated to SDDC+1 in v2, ECC still works for bit errors after one die is completely mapped out.
At the risk of sounding like a game show host, DDDC+1 will let two DRAMs completely fail on a Lockstep channel and rebuild the memory contents on other chips, plus still fixes single bit errors on the fly. If one DRAM fails, most systems will flag this and replace the parts ASAP so the odds of a second chip dying before the first can be replaced is pretty slim. Since the E7 line supports Memory Sparing, Dynamic Memory Migration, Failed DIMM Identification, and hot plugging of DIMMs, fixing a dead part on the fly should be pretty simple. More importantly it should require zero downtime as long as the OS and specific system you have supports those technologies.
Two of those technologies, Dynamic Memory Migration and Memory Hot Plug, plus Memory On-Lining, are new with v2 and supplement what was in v-not-2 and before. As you might guess, moving memory around between hardware pieces is not a simple task and may require OS support. Most modern server OSes should be aware of this feature and work well with it but your mileage may vary. Until you are sure that pulling out a bank of DIMMs won’t cause rather severe problems with screen coloration, you might want to test it a few times during scheduled outage periods…
Hot Plug has a companion technology with Memory On-Lining. In essence when you map out a DIMM on the fly, pull it out of the system, and plug a new one in, that is all fine and dandy as far as the hardware is concerned. The system and OS however need to know that there is new memory in place, mark it as good, and then start using it. That is kind of what On-Lining does.
More interesting is a variant of Memory Mirroring that has been around for a while in the E7 machine. The new version is called Fine Grained Memory Mirroring and it does what it says. Without the Fine Grained prefix you could have two DIMMs with copies of the data so if one failed, nothing catastrophic happens. With Fine Grained you can select portions of the memory to be mirrored and do so at whatever level you choose.
Like the older version, when a mirrored space has a problem, it fails over to the other copy transparently to the OS or software. You don’t need to reboot, you don’t need to change anything, and can replace the DIMM at the time you choose. Fine Grained Memory Mirroring lets you trade-off cost for security at the level amenable to your budget and uptime needs. You can see how mirroring critical OS and VMM portions are a good idea but database segments already mirrored to disk may be a waste. It is a very useful bit of control for those who wish to dig in to that level, and most users of large E7 v2 systems will.
If everything above works out the system will never see an error, it will just be detected, corrected, and if necessary the offending bits mapped out. When things are not correctable, that is where the other part of the new RAS features come in to play, collectively they are called MCA or Machine Check Architecture. MCA requires software awareness at all levels, or at least the more levels the software is MCA aware the better off it is. When MCA is in the house, things get interesting.
In the pre-MCA days if an error was not corrected, the system hopefully crashed. We say hopefully because silent data corruption is a really bad thing. Think about what would happen if for example the company holding your mortgage had a little problem like that… If the error is detected, the system can be flagged to say that there is a problem at which point it will alert the operators and/or stop. Stopping or crashing is bad for uptime statistics, something you might have noticed is a high priority for E7 v2 customers.
MCA has four new parts, MCA Recovery Execution Path, MCA Recovery Non-Execution Path, MCA I/O, and Enhanced MCA Gen 1. Rather than go over each in detail, lets look at how it works. The idea is to notify the software at all levels that there was an error that could not be corrected and was passed along. That error is tagged, be it a PCIe packet or other data stream, with the fact that it is bad. An interrupt is thrown to alert the VMM or OS, whatever is at the lowest level, that there is a data problem.
From this point the VMM can take action. If nothing is using a bad memory space, it can be mapped out. If a VMM is using that space and it can pass that information to an MCA aware OS or software stack, the VM can be gracefully taken down, the erring hardware remapped, and things restarted. No system crashing, no corruption, and a contained error that will hopefully never happen again. An operator can take whatever action is appropriate at this point.
If the OS running in that VM is MCA aware, and most should be, then it can take action in a more granular fashion than the VMM. The OS can see what program is using the offending memory space or I/O device and notify it should it be MCA aware. If the program is not also MCA aware the OS can simply save as much data as possible, shut it down, notify the hardware to map out the bad bits before restarting the program. If you are running Apache for example, the OS may be able to simply kill that single instance and keep everything else going without a hiccup. One user interrupted instead of potentially thousands if an 8S system crashes completely is a good trade for most companies.
Intel was using SAP and their HANA in-memory database to show off the new MCA features. If you have a large database running on a fully populated 8S E7 v2 machine with the full 1.5TB/socket or 12TB for the system memory load, a crash can have serious effects. Worse yet if you think about how many users would be on such a machine and the types of workloads it runs, downtime has a massive cost too. As you might guess, HANA is MCA aware at the program level.
If the hardware flags the VMM, and it flags the OS, if you are running HANA it will also be flagged and take action. HANA copies all of the database in memory to SSDs so there is a full backup. With an uncorrected error, HANA can flag the area that is bad, re-load the data from SSD to a different memory space, de-allocate that memory block, and keep right on running. From there the DDDC+1, Memory Mirroring, Hot Plug, and others take over and the offending bits can be swapped out if needed. The database never went down, no programs were restarted, no OSes crashed, and no VMMs had problems. The worst that happened is a little slowdown for a query which can be problematic but far less so than any of the other scenarios.
The work that was done to contain and partition off the error was minimal, likely just a page or two, in the worst case it may have been a full DRAM die’s worth of data, single digit Gb worth. If you have a fully MCA aware stack, it should allow for some serious uptime and granular fault tolerance. The more effort a software vendor puts in to MCA support, the more reliability and granular fault partitioning they will get out.
In theory MCA can protect against almost anything, but in reality you will still have problems. If that memory error is in a critical portion of the VMM, MCA won’t do very much. Sure it will flag the VMM properly that it is toast, but that doesn’t do much good. If you look at the size of the critical areas of an OS or a VMM versus the size of the E7 v2 memory pool, this risk is very low but it is still there. Then again, this is exactly what Fine Grained Memory Mirroring is for, so in theory this should never happen.
The rest of MCA deals with logging and flagging errors, but also predictive failure analysis. MCA Gen 1 is Enhanced mainly in this area. If a system can flag an error all the way down to the program level, it does no good if that error keeps happening and the guy at the console is blissfully unaware of it. Better yet if the predictive failure analysis alerts someone before things get to the point where MCA is needed, all the better.
That is the idea behind the new MCA enhancements. Detect things earlier, hopefully before they are uncorrectable, keep the problems as contained as possible, and most importantly extend that capability to all areas of the system. As of now the main parts of the system are covered but there is a lot more to do. When the next EX line comes out some time in 2016 or so, there will undoubtedly be advances in this front.
On top of these new features there are enhancements to some of the older ones too. CPUID remapping has been updated so a system can boot even if CPU0 has been mapped out. This can be done on the fly too, again with OS support, and a new CPU hot plugged and brought back up. In theory there should be little reason to ever bring down a Xeon E7 v2 box but you know how well theory correlates to reality in cases like this. With luck though, you should never see an E7 box go down for hardware related reasons.S|A
Have you signed up for our newsletter yet?
Did you know that you can access all our past subscription-only articles with a simple Student Membership for 100 USD per year? If you want in-depth analysis and exclusive exclusives, we don’t make the news, we just report it so there is no guarantee when exclusives are added to the Professional level but that’s where you’ll find the deep dive analysis.
Latest posts by Charlie Demerjian (see all)
- Coolit water cools Cascade-AP CPUs - Nov 14, 2018
- Intel tries to pretend they have 5G silicon with the XMM 8160 - Nov 12, 2018
- AMD’s Rome is indeed a monster - Nov 9, 2018
- Intel announces Cascade Lake-AP MCM - Nov 5, 2018
- ARM brands infrastructure as Neoverse - Nov 2, 2018