On systems with ECC memory memory errors can be detected and usually corrected.
When a corrected error -- or soft error -- occurs in a system this is not necessarily a problem. In fact on systems with long uptime there is an expected soft error rate that will be reported. The hardware platform uses error correcting codes and redundancy to handle soft errors. This is why they are called corrected errors. Unlike an uncorrected (hard) error -- that is data corruption -- soft errors do not directly require software reaction. Also since there is an expected soft error rate for each system some soft errors are expected to occur. A small number of soft errors in a given time frame is generally not a problem.
For memory errors it is important to be able to identify which components caused the problem. The mcelog daemon tracks memory errors in different buckets:
- per DIMM (if available)
- per Channel
- per memory controller
- per Socket (= physical CPU package)
- per Page. This is used to automatically offline bad pages
The state of the running daemon can be queried using
mcelog --client
For more details please see this recent LinuxKongress 2010 mcelog paper and the other references.