Linux RAID-5 Crash Recovery at CFHT

Contents

I. Single Disk Failures, Software RAID

  1. Warn the appropriate users that you are about to put a lot of stress on their RAID disk and give them the chance to manually copy off critical data before starting reconstruction.

  2. Once the user no longer needs the machine, go into "reconfig" and select "CFHT Maintenance" below the profile, if available. This disables most logins, nfsd, and moves SSH to port 2267.

  3. Check the failed drive's SMART status to see if it really is going bad:
    # smartctl -cv /dev/hdx
    
    Look for "more than a few" raw counts (last column) of "Reallocated Sectors", and/or "Reallocated Events", or anything else unusual that would indicate the drive really is failing. If the drive looks OK, or you are able to determine that the RAID system ended up removing the drive because of a controller problem or something other than bad media, then skip to step 6.

  4. If the drive is SCSI, or an IDE slave, disconnect the power connector from the drive, reboot, and make sure /dev/hdb,d,f,h,j,l,n, or p (i.e., whichever slave it is) no longer shows up (i.e., the command "ls /dev/hd?" shows all disks except the one you're trying to replace.

    If the drive is an IDE master, disconnect its power connector and make sure both /dev/hda,c,e,g,i,k,m, or o (i.e. whichever hdx master you're trying to remove) AND the disk after (e.g., if hdx=hde, then hdf also) are no longer showing up.

    This procedure must be done carefully, to ensure that we are replacing the right disk. If other connectors have to be unplugged to reach a dead disk, be sure they go back exactly as the came.

  5. Replace the disk.

  6. Make sure the disk is partitioned correctly and then run:
    # raidhotadd /dev/md0 /dev/hdx
    

  7. As reconstruction begins, it is very possible that the disk will fail again. If the array has another error, one must be careful to check which disk has failed. When additional failures occur during reconstruction, /proc/mdstat can look like it does for a single disk failure when what you're really dealing with is a multiple disk failure, when the first failed disk is partially reconstructed, and the new disk to fail is the only one marked with '_'.

    Bottom line is, during reconstruction, be very careful to read messages about which disk failed. If a different disk fails, DO NOT REPLACE that second disk. Instead, follow the procedure below for multiple disk failures.

II. Multiple Disk Failures

The software raid system may sometimes remove multiple disks from a RAID-5. In this case, the RAID is no longer readable until it is reformatted. There is a way, however, to do this without erasing the data on the disks. NO DISKS should be replaced if you need to attempt this procedure:
  1. Do NOT take apart the machine.

  2. Identify which of the failed disks was the first to fail. This should be clear from the syslog. IF YOU GUESS WRONG, ALL DATA IS LOST.

  3. In /etc/raidtab, after the entry for the disk identified in step 2, change "raid-disk" to "failed-disk". If other disks also failed, note which ones they were, but do not change their entries. On the line which gets "failed-disk", don't change the number which the disk had either. (Remember to change /etc/raidtab through "reconfig", not be editing directly.)

  4. Edit fstab (or rc.local if that's where the RAID is being mounted) and add the "ro" flag to be sure the RAID gets mounted read-only by XFS. Or to be sure, just comment out all lines which would mount /dev/md0 and do it manually.

  5. Remake the raid (YIKES!) with a command like this:
    # mkraid --force --dangerous-no-resync /dev/md0
    

  6. Either reboot, or manually get the RAID back with these commands:
    # raidstart /dev/md0
    # cat /proc/mdstat  # (you should see only the one failed disk down.)
    # mount -o ro /dev/md0 /local/data
    # exportfs -ar
    

Every time the user accesses a file which is occupying some of the bad blocks, it may become necessary to repeat steps 5 and 6. Let the user know that they should try not to access the files which caused the array to go back down (they are lost.)