Glossary
Machine Check
An hardware error detected by hardware and reported to software.
Machine Check Architecture (MCA)
x86 machine check architecture is a hardware programming
interface to allow software to report and handle both corrected and uncorrected hardware errors. This is an architectural interface with some
abstraction and allows forwards and backwards compatible operating systems.
Details are described in the Intel Architecture
Software Developer Manual Volume 3 chapter 15.
Machine Check Exception (MCE)
The x86 CPU raises an int 18 exception to signify
an uncorrected hardware error. The operating system has a special
handler to process the information contained in the
MCA registers
ECC
Error Correcting Code. A specific code that can detect and correct
errors. Typical ECC codes can detect two bit of errors and correct one
bit (there are some advanced encodings that can handle more errors).
See Wikipedia's
ECC
entry. On servers the memory subsystem generally supports ECC.
Corrected error
An hardware error that was corrected by the hardware
(e.g. using a single bit data corruption that was correctible using
ECC). These errors do not require
immediate software actions, but are still reported for accounting
and
predictive failure analysis.
Uncorrected error
An uncorrected hardware error detected by the hardware. Data corruption
has occurred. These errors require software reaction.
Predictive Failure Analysis (PFA)
Using trends in (primarily) corrected errors to predict future failure of
hardware components and automatically taking steps to avoid outages.
mcelog implements automatic offlining for
memory,
CPU caches. Additional user-specified
actions can be also configured.
IO-MCA
Used for reporting uncorrected errors on PCI Express links on newer Xeon systems.
Supported by mcelog, see
IO errors.
PCI AER
PCI-Express Advanced Error reporting. Used for error reporting on PCI Express links.
Not supported by mcelog, but logged to the normal kernel log. For more details
one the implementation see the
OLS paper.
See also
IO-MCA.
RAS
Reliability, Availability, Serviceability.
DIMM
Memory module.
DMI (or SMBIOS)
This is a standardized way for a BIOS to report the current hardware
configuration to the operating system. The DMI information can be
dumped with the
dmidecode program. mcelog uses this information
when available to map DIMM numbers to silk screen labels.
APEI
An interface defined the
ACPI 4 standard
that allows a BIOS to report errors to an operating system. Formerly
known as
WHEA.
EDAC
An alternative memory error reporting framework. See the
FAQ entry