Memory errors in mcelog

Memory (RAM) errors are among the most common errors in typical server systems. They also scale with the amount of memory: the more memory the more errors. In addition large clusters of computers with tens or hundreds (or sometimes thousands) of active machines increase the total error rate of the system.

On systems with ECC memory memory errors can be detected and usually corrected.

When a corrected error -- or soft error -- occurs in a system this is not necessarily a problem. In fact on systems with long uptime there is an expected soft error rate that will be reported. The hardware platform uses error correcting codes and redundancy to handle soft errors. This is why they are called corrected errors. Unlike an uncorrected (hard) error -- that is data corruption -- soft errors do not directly require software reaction. Also since there is an expected soft error rate for each system some soft errors are expected to occur. A small number of soft errors in a given time frame is generally not a problem.

For memory errors it is important to be able to identify which components caused the problem. The mcelog daemon tracks memory errors in different buckets:

per DIMM (if available)
per Channel
per memory controller
per Socket (= physical CPU package)
per Page. This is used to automatically offline bad pages

The state of the running daemon can be queried using

mcelog --client

For more details please see this recent LinuxKongress 2010 mcelog paper and the other references.

For users:

Background:

For Developers: