Linux EDAC modules on Server Systems

Error Detection and Correction, EDAC, is a set of Linux kernel modules for handling hardware-related errors. Its major focus has been ECC memory error handling, however it also detects and reports PCI bus parity errors. This post on the ION Server Blog looks at how that fits on modern servers.

It may make sense to explain a bit about memory errors here and understand the rationale for ECC (Error Correcting Code) memory. Wikipedia has a fairly complete article on the subject if you want to learn more. In summary though, background electrical or magnetic interference, chiefly from neutrons dislodged by cosmic radiation impact the storage location of a bit of DRAM memory with enough energy to change the state of that bit. That is not a failure of that bit, but rather a change of state of the capacitive cell that holds that bit’s data. In more common terms, a bit gets flipped. These are often called “soft errors” as the data has been corrupted, but not due to an actual failure of the memory device.

Originally parity was used to detect a single changed bit in a storage unit. Parity failed however in cases where two bits in a byte/word were flipped. ECC goes a step further and is able to correct a single bit error and report (usually a NMI, non maskable interrupt) a multi-bit error. On today’s servers where a bit of corrupted data, or program, could have significant impact, the ability to correct fairly common soft errors make a lot of sense.

Once a server can correct single bit errors, it then becomes important to monitor that activity. Occasional correction of bit errors is expected. It can be a sign of failure or impending failure of the memory module though, which is why it becomes important to track such events. Oracle, for example, recommends that more than 24 Correctable Errors from a single DIMM when other DIMMs are not showing errors is probably a reasonable point at which to replace that DIMM.

Now we can return to our original topic of understanding how ECC “CEs”, correctable errors, are logged.

In most modern servers, since 2000, or so, a separate baseboard management controller (BMC) monitors all or most of the available on-board instrumentation – things like fan speeds, temperatures, power consumption and more. That usually includes detection of both correctable and uncorrectable memory errors. These would be recorded in the “System Event Log” which can be accessed via various open-source IPMI tools as well as manufacturer specific tools and interfaces. This monitoring and logging happens on the serverboard without the involvement of any running operating system

Modern Linux, since around the time of the 2.6.16 kernel, also includes a mechanism to log correctable and uncorrectable errors known as EDAC – Error Detection and Correction – logs that information in the syslog and /sys/devices/system/edac/….

Here is the problem: both Linux EDAC and the BMC know of correctable errors by polling a register – a read-only register that is erased automatically when it is read. Usually EDAC gets to it first, but sometimes the BMC does, which means that errors can be logged in two separate places, neither of which is complete.

Most server manufacturers now recommend that on systems with a BMC which can track CEs, Linux EDAC for CEs should be disabled, so that the BMC can log all of these events. There are two steps to doing this on a running Linux server:

mce=ignore_ce should be added as a kernel parameter.
Search for and disable running EDAC modules:
1. lsmod | grep edac
2. for each EDAC module found, disable it via
  - alias edac_xxx off in/etc/modprobe.conf
  - or,
  - blacklist edac_xxx in/etc/modprobe.d/blacklist.conf

Any recorded CE events can then be found in the System Event Log. Many operations however do not monitor that log as closely as they do the syslog. For those administrators, ipmiutil getevt may be an ideal tool. Run with the options “-b -t 0” it will run in the background, forever, and copies all new events from the SEL to the syslog (and /var/log/ipmiutil_evt.log). This way both locations have all CE events logged, along with anything else the BMC can log, like a power anomaly or a fan failure.

Correctable memory errors are an important event to track in a server, but it is important to make sure that their meaning is understood and that all such events are logged in the same way.

Leave a Reply Cancel reply