Look for "more than a few" raw counts (last column) of "Reallocated Sectors", and/or "Reallocated Events", or anything else unusual that would indicate the drive really is failing. If the drive looks OK, or you are able to determine that the RAID system ended up removing the drive because of a controller problem or something other than bad media, then skip to step 6.# smartctl -cv /dev/hdx
If the drive is an IDE master, disconnect its power connector and make sure both /dev/hda,c,e,g,i,k,m, or o (i.e. whichever hdx master you're trying to remove) AND the disk after (e.g., if hdx=hde, then hdf also) are no longer showing up.
This procedure must be done carefully, to ensure that we are replacing the right disk. If other connectors have to be unplugged to reach a dead disk, be sure they go back exactly as the came.
# raidhotadd /dev/md0 /dev/hdx
Bottom line is, during reconstruction, be very careful to read messages about which disk failed. If a different disk fails, DO NOT REPLACE that second disk. Instead, follow the procedure below for multiple disk failures.
The software raid system may sometimes remove multiple disks from a RAID-5. In this case, the RAID is no longer readable until it is reformatted. There is a way, however, to do this without erasing the data on the disks. NO DISKS should be replaced if you need to attempt this procedure:
# mkraid --force --dangerous-no-resync /dev/md0
# raidstart /dev/md0 # cat /proc/mdstat # (you should see only the one failed disk down.) # mount -o ro /dev/md0 /local/data # exportfs -ar
Every time the user accesses a file which is occupying some of the bad blocks, it may become necessary to repeat steps 5 and 6. Let the user know that they should try not to access the files which caused the array to go back down (they are lost.)