Advanced hardware error handling for x86 Linux

mcelog installation

Quickstart

The standard way should be to use the mcelog that comes with your distribution, but if your mcelog is very old or not correctly configured you can set up a newer one like this.

If your distribution has a old crontab based mcelog disable it to avoid conflicts. The easiest way is to delete the mcelog cronjob file in /etc/cron.*

Compile the git version or a tarball. The git version is currently recommended and more featureful.

Get the git version:

git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git

(or cd mcelog ; git pull -u if you already have an earlier checkout)

Compile and install

cd mcelog
make
... become root ... make install

Installation on Linux systems based on init scripts

Set up the init script mcelog.init and make sure it is executed at boot. For example this can be done (on a opensuse system) with:

cp mcelog.init /etc/init.d/mcelog
chkconfig mcelog on

On other distributions you may need to find the distribution equivalent of chkconfig. mcelog.init also has some configuration settings that you could change, although the defaults are reasonable.

If you don't reboot start mcelog explicitely with /etc/init.d/mcelog start. Not needed after each reboot if the init script is set up correctly.

Installation on systems based on systemd

mcelog includes a standard systemd unit file: mcelog.service. To install the service file:
cp mcelog.service /usr/lib/systemd/system
systemctl enable mcelog.service

Installation on systems based on upstart

Use the init script for now.

Other installation notes

You may need to add /var/log/mcelog to your log rotating setup (like logrotate) if you didn't configure mcelog to log to syslog or to journald. This can be often done by creating a file in /etc/logrotate.d (or reusing the one that is already there from the distribution)

You can verify the daemon is running completely by running

mcelog --client
This should query the information in the running daemon. If it prints nothing that is fine (no errors logged yet)

Dependencies

Verify you got a /dev/mcelog. If not create one with mknod /dev/mcelog c 10 227 On udev based systems you can also add a udev rule file in /usr/lib/udev/rules.d like
ACTION=="add", KERNEL=="mcelog", SUBSYSTEM=="misc", TAG+="systemd", ENV{SYSTEMD_WANTS}+="mcelog.service"
This is needed if the /dev is not persistent, as in many newer distributions. Typically the distribution package for mcelog takes care of that.

For bad page offlining you will need a 2.6.33+ kernel or a 2.6.32 kernel with the soft offlining capability backported (like RHEL6 or SLES11-SP1)

The kernel has to have CONFIG_X86_MCE enabled. For 32bit kernels you need at least a 2.6,30 kernel.

mcelog modi

mcelog can run in several modi: cronjob, trigger, daemon

The recommended mode is daemon, because several new functions (like page error predictive failure analysis) require a continuously running daemon. If you just want daemon mode you can stop reading here.

In daemon mode mcelog runs continuously as a daemon in the background and wait for errors. It is enabled by running mcelog --daemon & from a init script. This is the fastest and most feature-ful.

cronjob is the old method. mcelog runs every 5 minutes from cron and checks for errors. Disadvantage of this is that it can delay error reporting significantly (upto 10 minutes) and does not allow mcelog to keep extended state.

trigger is a newer method where the kernel runs mcelog on a error. This is configured with echo /usr/sbin/mcelog > /sys/devices/system/machinecheck/machinecheck0/trigger This is faster, but still doesn't allow mcelog to keep state, and has relatively high overhead for each error because a program has to be initialized from scratch.