Posts Tagged ‘crash’

Gentoo md (Software RAID) RAID5 disk crash

Tuesday, May 13th, 2008

It’s spring, so the weather outside is outstanding. The days are longer, the birds are singing and people enjoy the nature wakening up for a new summer of growth.
Having my exams in near future, I’m content that I can’t use a great deal of my time enjoying the spring yet for another few weeks. But at least I’ve set aside the time I need for preparing my exams…
…, I thought. Then, in the middle of my mail inbox I’ve received a mail from Charlie Root at my Gentoo server. Hardly a request for a bicycle ride in the sunshine:
WARNING: Some disks in your RAID arrays seem to have failed! is the message.

Damn, the third year in a row(!) something erroneous happens to my server in the middle of my exam preparations. Why can’t it happen during fall, when the weather’s bad and I have the time for such challenges.
Well, I suppose it’s Murphy’s law. However, over to a more precise description of the problem…

The problem

I have 6 SATA disks in a RAID5 software array (through the use of Gentoo’s md), on which an XFS filesystem is mounted.

The actual error is that two(!) of my disks became faulty overnight. Diagnostics follow.

/proc/mdstat:
md0 : active raid5 sdf1[5] sde1[6](F) sdd1[3] sdc1[2] sdb1[1] sda1[7](F)
(some number) blocks level 5, 64k chunk, algorithm 2 [6/4] [_UUU_U]

I really got nervous here; 2 faulty disks in a RAID5 array means trouble. But why on earth did two disks fail at one time?? I’m not sure yet.

dmesg excerpt:
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdf1 ...
md: adding sdf1 ...
md: adding sde1 ...
md: adding sdd1 ...
md: adding sdc1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md0
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: running:
md: kicking non-fresh sde1 from array!
md: unbind
md: export_rdev(sde1)
md: kicking non-fresh sda1 from array!
md: unbind
md: export_rdev(sda1)
raid5: device sdf1 operational as raid disk 5
raid5: device sdd1 operational as raid disk 3
raid5: device sdc1 operational as raid disk 2
raid5: device sdb1 operational as raid disk 1
raid5: not enough operational devices for md0 (2/6 failed)
RAID5 conf printout:
— rd:6 wd:4
disk 1, o:1, dev:sdb1
disk 2, o:1, dev:sdc1
disk 3, o:1, dev:sdd1
disk 5, o:1, dev:sdf1
raid5: failed to run raid set md0
md: pers->run() failed …
md: do_md_run() returned -5
md: md0 stopped.
md: unbind
md: export_rdev(sdf1)
md: unbind
md: export_rdev(sdd1)
md: unbind
md: export_rdev(sdc1)
md: unbind
md: export_rdev(sdb1)

It seems like sda is the only faulty one (mdadm –examine says it doesn’t have a superblock at all), but the crash of this one must have messed up the superblock of sde. However, it (the superblock of sde) was recreatable, and thereby also the RAID5-array as a whole: RAID5 allows for one disk (in this case, sda) to fail, but then you have no extra parachute, until you replace that faulty one.

The (temporary) solution

The superblock information for the different disks in the RAID-array were extracted for each disk with
mdadm --examine /dev/sda1
and so on. I noted (that is, copy-paste) the exact information for all disks, for (at least) using as option input to the mdadm –create-command in the next paragraph.

I ran (Warning: do this at your own risk! It’s incredibly important setting the right options here, so be sure you’ve read and understand the contents of man mdadm first! Failing to do so will result in loss of data!):
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=6 missing /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1, which put the array back up in degraded mode, with the erroneous disk sda set to missing. This was sufficient to be able to mount the array, but not without problems.

Mounting the degraded array

A first attempt on “mount /dev/md0 /mnt” gave the error “mount: Structure needs cleaning”. This is the XFS filesystem telling it’s not entirely consistent. I could possibly run xfs_repair (I did, in a pretend-type-mode, with the -n option), but I’m not willing to risk my data on this yet. Instead, I did get the device mounted with this command:
mount -r -o norecovery /dev/md0 /mnt

In this way, I can now access my data and make a backup of them. Some data is probably corrupted without repairing the XFS file system, but hopefully most of it is recoverable…

Some of the pages I visited in my frustration