Advanced hardware error handling for x86 Linux
Further reading
Papers and presentations
Studies about memory errors
-
A good
study on memory errors from the University of Rochester.
"A Realistic Evaluation of Memory Hardware Errors and Software System
Susceptibility", Li, Huang, Shen, Chu: Usenix Annual Tech Conference 2010
-
The famous
google memory error study.
"DRAM Errors in the Wild: A Large-Scale Field Study", Schroeder,
Pinheiro, Weber, SIGMETRICS, 2009.
Note: there are various indications that the google numbers
are significantly higher than in typical servers. It is not recommended to use them for planning purposes.
-
A classic study on the benefits of automatic bad page offlining:
"Assessment of the Effect of Memory Page Retirement on Systems RAS against
Hardware Faults", Tang, Arruthers, Totari, Shapiro:
Proceedings of the 2006 International Conference on Dependable Systems and
Networks.
-
A newer study that gets to the same conclusion. Automatic page offlining is a good idea:
"Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design", Hwang, Stefanovici, Schroeder, ASPLOS 2012 (non-paywalled PDF).
Intel RAS
-
Intel
whitepaper
on RAS in recent server processors.
-
The
Intel
Software Developer's manual describes the low level register
interface of the x86 machine check architecture
Machine checks are described in Volume 3A: System Programming Guide.
Related software
-
The
mce-inject injector tool and the
mce-test test suite can be used to test machine check.
This is in addition to the mcelog test suite included with the source
(
make test).
-
Linux EDAC project on sourceforge. EDAC is an alternative approach at reporting memory errors. See also the
EDAC discussion in the FAQ.
Other