Notes on configuring the 11 disk RAIDs

Hardware:

The hardware used for mounting four extra drives internally consists essentially of a drive crate identical to the one in the front of a 420 mounted to the cover for the upper rear portion of the chassis. The cover is modified to allow access to the side and rear for ventilation. An external fan is mounted to improve cooling though heat generation is not a significant issue when using 5400 rpm drives. Power from the 420's power supply is also ample for 11 drivesof this speed. Wiley Knight designed the modified drive bracket and has ajig for producing them.

On a newer 420 I had make a change in the BIOS to get the machine to boot unattended with the promise cards. Under "Integrated Devices" I set "3Com ethernet" to "ON w/ MBA" which cleared the problem, something in the PCI busmastering.

Kernel:

Once the drives are installed the machine must be booted with a linux kernel that supports a large number of IDE devices and the version 0.9 of the RAID tools which are needed for more than eight elements. The patches used are:

ide.2.2.16.all.20000805.patch
raid-2.2.16-A0
linux-2.2.16-nfsv3-0.21.3.dif
dhiggen-over-0.21.3

additionally, Promise PCD PDC20246/PDC20262 support and the associated Special UDMA feature must be enabled before compilation to make the Promise 66 or 100 IDE interfaces work.

UPDATE: Using the new IDE patch and the 2.4.16 kernel introduces some changes in the RAID setup for large array. Fortunately, software level 5 arrays are much faster. Use of the Promise Ultra100TX2 and Ultra133TX2 (PDC20268 and PDC20269 respectively) involves some caveats. Most notably there cannot be more than 2 TX2 series controllers in a host. Accessing drives on the third or fourth controller hangs the system. One workaround for this is to mix controller types. For example a system can have two Ultra100TX2 cards and one Ultra100 (PDC20262) card to support twelve devices or two Ultra133TX2 cards and one Ultra100TX2 card (for support of disks over 137Gb the Ultra100TX2 needs to be updated with firmware that supports 48-bit LBA). There appears to be no limit to the number of non-TX2 series cards that can be installed in a system.
Disks larger than 137Gb can only be utilized on systems running the newest linux IDE patch (currently kernel 2.4.16-cfht includes this patch) and a with a controller that has 48-bit LBA support, native on the Promise Ulra133TX2 controllers and available with a firmware upgrade on Ultra100TX2 and 3Ware controllers.

Tools:

You will also need a more recent version of hdparm than our current distribution includes to address the large number of IDE devices, it's usually /sbin/hdparm and you can copy it from one of the existing 11 disk RAID servers.

Preparing the disks:

Once the machine is booted the following commands initialize the disks. They presume that you have skipped /dev/hdd, the second device on the second internal controller.

Partitioning the disks can be a little tricky. First of all for the Maxtor disks it's important that the slave devices are jumpered correctly. Maxtor's documentation describes two ways of configuring a slave device, but for the 80gig and 40gig drives only one configuration will present a geometry to linux that will work. For these two drives the correct setting is a jumper across J47 and J49.

For the 80g disks I created one 100M partition at the beginning of the disk and the rest of thedisk in the second partition. The first partition for the first five disks should be set to "linux swap" and "linux" for the remainder. Each of the partitions should be "linux-raid-autodetect".

IMPORTANT!: the Maxtor 80gig disks shipped in at least three different geometries and forcing the drive goemetry (and in some cases using the default geometry) can be very dangerous. The only solution that I have found is to use a very conservative partition table like the one below. In an eleven disk array you will loose 20 gigabytes or so from the maximun theoretical capacity of the drives but forcing units that are slightly smaller with a partition for a larger disk will cause data corruption either immediately or after using the disks for a while, after they start to fill up with data.

for i in a b c e f g h i j k l;do echo -e "x\nc\n9726\nh\n255\ns\n63\nr\nn\np\n1\n1\n10\nn\np\n2\n11\n9726\nt\n2\nfd\nv\nw\n" | fdisk /dev/hd$i ; done

This is the command used to partition the 100Gb disks with one small 100Mb partition and one large 100Gb partition:
for i in a e f g h i j k l o p; do echo -e "x\nc\n12182\nh\n255\ns\n63\nr\nn\np\n1\n\n+100M\nn\np\n2\n\n\nt\n2\nfd\nw\n" | fdisk /dev/hd$i ; done

For the 160Gb disks in pelmo:
for i in e f g h i j k l m n o ; do echo -e "n\np\n1\n\n+200M\nn\np\n2\n\n\nw\n" | fdisk /dev/hd$i; done

It is a very, very good idea to reboot the machine at this point.

Creating the arrays:

The first five disks will use their small partitions as system swap space. The last six we will stripe together as another RAID for use as a local /tmp filesystem.

With the 0.9 RAID tools the arrays are defined in /etc/raidtab which should look like something like this:
raiddev /dev/md0
       raid-level      5
       nr-raid-disks   11
       nr-spare-disks  0
       chunk-size      128
       persistent-superblock 1
  device /dev/hda2
raid-disk 0
device /dev/hdb2
raid-disk 1
device /dev/hdc2
raid-disk 2
device /dev/hde2
raid-disk 3
device /dev/hdf2
raid-disk 4
device /dev/hdg2
raid-disk 5
device /dev/hdh2
raid-disk 6
device /dev/hdi2
raid-disk 7
device /dev/hdj2
raid-disk 8
device /dev/hdk2
raid-disk 9
device /dev/hdl2
raid-disk 10

raiddev /dev/md1

       raid-level 0
         nr-raid-disks   6
        nr-spare-disks  0
          chunk-size      16
       persistent-superblock 1
  device /dev/hdg1
raid-disk 0
  device /dev/hdh1
raid-disk 1
  device /dev/hdi1
raid-disk 2
  device /dev/hdj1
raid-disk 3
  device /dev/hdk1
raid-disk 4
  device /dev/hdl1
raid-disk 5
This will define three RAID arrays, md0 is the bulk of the disks, md1 can be mounted as /tmp. Note that we are making md1 level 0 so there is no redundancy and we are using an interleaving factor (chunk-size) of 16 which will be more effective for smaller files which are more common in /tmp. We use a large interleaving factor (128k) for the big array because we will most likely be putting large files there and a wider interleave should increase performance.

Now we create the arrays one at a time:
mkraid /dev/md
To verify that things went well try:
cat /proc/mdstat
You should see md1 and md2 already created as they are relatively small and don't require parity calculation like a level 5 array. Md0 should still be cranking and have a considerable time left (like 2,600 minutes!). The [UUUUUUUUUUU] in the entry for md0 represents the elements, if one drive fails a "U" will be replaced by an underscore. /proc/mdstat is the primary place for information regarding the RAID arrays. It's O.K. to use a level 5 device while it's resyncing though performance will be poor, it's also O.K. to shut the machine down, the process will resume where it left off when the machine comes back up. You can wait to proceed until the array is complete or continue at a slower pace.

Now create the second array:
mkraid /dev/md1
This should come up almost instantly as there is no resync involved in alevel 0 RAID.

Filesystems:

The following commands create the ext2 filesystems for the big array, notice that there is no root allocation, and the stride matches our interleaving factor:
mke2fs -b 4096 -i 65536 -m 0 -R stride=32 /dev/md0
Here "stride" should be equal to the interleaving factor specified in the raidtab file divided by the block size of the filesystem. Block size is specified by the -b flag here as 4Kb so in this example with a 128Kb interleave we should use a stride of 128/4=32.

This would be a very inefficent filesystem for small files like we would find in /tmp so just use the default block size and a smaller interleave and stride for this array:

mke2fs /dev/md1 -m 0 -R stride=4
This command formats the swap partitions:
for i in a b c e f ; do mkswap /dev/hd${i}1 ; done
There need to be entries in /etc/fstab for the swap areas and the filesystems:

/dev/hda1    swap    swap    defaults,pri=1   0 0
/dev/hdb1    swap    swap    defaults,pri=1   0 0
/dev/hdc1    swap    swap    defaults,pri=1   0 0
/dev/hde1    swap    swap    defaults,pri=1   0 0
/dev/hdf1    swap    swap    defaults,pri=1   0 0
/dev/md0    /local/data    ext2
/dev/md1    /tmp    ext2
 
You can reboot at this point or just get started:

mount -a

swapon -a

Questions, problems should go to Kanoa .