This is the mail archive of the
overseers@sourceware.org
mailing list for the Sourceware project.
ongoing sourceware.org recovery from disk corruption
- From: "Frank Ch. Eigler" <fche at redhat dot com>
- To: Sourceware Overseers <overseers at sourceware dot org>
- Date: Tue, 15 Aug 2017 09:35:16 -0400
- Subject: ongoing sourceware.org recovery from disk corruption
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=fche at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 602CC6D71E
Hi -
As you probably know, we had a planned shutdown last night for
installation of a PCI SSD card into sourceware.org=gcc.gnu.org,
expecting to benefit from much greater read speeds. Its installation
went fine, and the machine came back up fine. The planned LVM2
operations were begun to make the PCI SSD card an LV raid1 mirror, and
the HDD LV mirror half was made 'writemostly'.
This is when things started going wrong. I believe a kernel bug
(rhel6 2.6.32+) has caused the mostly-null SSD LV mirror half to start
answering -some- reads, even from regions that were not yet finished
their initial mirror. This messed up ext4's brain, which started
corrupting metadata and some file content on the HDD half. Within a
few minutes, it was clear something was wrong, the SSD mirroring was
shut down and the machine was rebooted. That stopped any further
corruption.
Unfortunateyl, within those few minutes, a large number of files were
corrupted. While we have backups on /sourceware2 (now frozen) from
late the previous night (Aug. 13), the new work makes us loath to just
switch back to the backup and ditch the 24+ hours of un-backed-up work
before the corruption, and the new bits of work committed since then.
So we're proceeding to restore bits, file by file, when/as corruption
is found. It's silly laborious, and we'll appreciate your patience
and help identifying affected files. The version control repositories
appear fine now, /ftp is getting mass-restored (since it's apprx. all
old), so the most important stuff seems OK. There are reports of some
mailing list archives and wiki pages being broken; will look at those
next. Please come hang out on #overseers on irc.freenode.net to chat.
Sorry about this inconvenience. We (I) did not anticipate kernel bugs
messing up a Perfect Plan for speeding up our treasured little box.
In a little while, we'll try again, but with paranoid staging, some
manual fresh backups onto our backup server, and LVM snapshotting.
- FChE