Notes on configuring the 11 disk RAIDs
Hardware:
The hardware used for mounting four extra drives internally consists essentially
of a drive crate identical to the one in the front of a 420 mounted to the
cover for the upper rear portion of the chassis. The cover is modified to
allow access to the side and rear for ventilation. An external fan is mounted
to improve cooling though heat generation is not a significant issue when
using 5400 rpm drives. Power from the 420's power supply is also ample for
11 drivesof this speed. Wiley Knight designed the modified drive bracket
and has ajig for producing them.
On a newer 420 I had make a change in the BIOS to get the machine to boot
unattended with the promise cards. Under "Integrated Devices" I set "3Com
ethernet" to "ON w/ MBA" which cleared the problem, something in the PCI
busmastering.
Kernel:
Once the drives are installed the machine must be booted with a linux kernel
that supports a large number of IDE devices and the version 0.9 of the RAID
tools which are needed for more than eight elements. The patches used are:
ide.2.2.16.all.20000805.patch
raid-2.2.16-A0
linux-2.2.16-nfsv3-0.21.3.dif
dhiggen-over-0.21.3
additionally, Promise PCD PDC20246/PDC20262 support and the associated Special
UDMA feature must be enabled before compilation to make the Promise 66 or
100 IDE interfaces work.
UPDATE: Using the new IDE patch and the 2.4.16 kernel introduces some changes
in the RAID setup for large array. Fortunately, software level 5 arrays are much
faster. Use of the Promise Ultra100TX2 and Ultra133TX2 (PDC20268 and PDC20269
respectively) involves some caveats. Most notably there cannot be more than 2
TX2 series controllers in a host. Accessing drives on the third or fourth controller
hangs the system. One workaround for this is to mix controller types. For example
a system can have two Ultra100TX2 cards and one Ultra100 (PDC20262) card to support
twelve devices or two Ultra133TX2 cards and one Ultra100TX2 card (for support of
disks over 137Gb the Ultra100TX2 needs to be updated with firmware that supports
48-bit LBA). There appears to be no limit to the number of non-TX2 series cards
that can be installed in a system.
Disks larger than 137Gb can only be utilized on systems running the newest linux
IDE patch (currently kernel 2.4.16-cfht includes this patch) and a with a controller
that has 48-bit LBA support, native on the Promise Ulra133TX2 controllers and available
with a firmware upgrade on Ultra100TX2 and 3Ware controllers.
Tools:
You will also need a more recent version of hdparm than our current distribution
includes to address the large number of IDE devices, it's usually /sbin/hdparm
and you can copy it from one of the existing 11 disk RAID servers.
Preparing the disks:
Once the machine is booted the following commands initialize the disks. They
presume that you have skipped /dev/hdd, the second device on the second internal
controller.
Partitioning the disks can be a little tricky. First of all for the Maxtor
disks it's important that the slave devices are jumpered correctly. Maxtor's
documentation describes two ways of configuring a slave device, but for the
80gig and 40gig drives only one configuration will present a geometry to linux
that will work. For these two drives the correct setting is a jumper across
J47 and J49.
For the 80g disks I created one
100M partition at the beginning of the disk and the rest of thedisk in the
second partition. The first partition for the first five disks should be set
to "linux swap" and "linux" for the remainder. Each of the partitions
should be "linux-raid-autodetect".
IMPORTANT!: the Maxtor 80gig disks shipped in at least three different geometries and forcing the drive goemetry (and in some cases using the default geometry) can be very dangerous. The only solution that I have found is to use a very conservative partition table like the one below. In an eleven disk array you will loose 20 gigabytes or so from the maximun theoretical capacity of the drives but forcing units that are slightly smaller with a partition for a larger disk will cause data corruption either immediately or after using the disks for a while, after they start to fill up with data.
for i in a b c e f g h i j k l;do echo -e "x\nc\n9726\nh\n255\ns\n63\nr\nn\np\n1\n1\n10\nn\np\n2\n11\n9726\nt\n2\nfd\nv\nw\n" | fdisk /dev/hd$i ; done
This is the command used to partition the 100Gb disks with one small 100Mb partition and one large 100Gb partition:
for i in a e f g h i j k l o p; do echo -e "x\nc\n12182\nh\n255\ns\n63\nr\nn\np\n1\n\n+100M\nn\np\n2\n\n\nt\n2\nfd\nw\n" | fdisk /dev/hd$i ; done
For the 160Gb disks in pelmo:
for i in e f g h i j k l m n o ; do echo -e "n\np\n1\n\n+200M\nn\np\n2\n\n\nw\n" | fdisk /dev/hd$i; done
It is a very, very good idea to reboot the machine at this point.
Creating the arrays:
The first five disks will use their small partitions as system swap space.
The last six we will stripe together as another RAID for use as a local /tmp
filesystem.
With the 0.9 RAID tools the arrays are defined in /etc/raidtab which should
look like something like this:
raiddev /dev/md0
raid-level 5
nr-raid-disks 11
nr-spare-disks 0
chunk-size 128
persistent-superblock 1
device /dev/hda2
raid-disk 0
device /dev/hdb2
raid-disk 1
device /dev/hdc2
raid-disk 2
device /dev/hde2
raid-disk 3
device /dev/hdf2
raid-disk 4
device /dev/hdg2
raid-disk 5
device /dev/hdh2
raid-disk 6
device /dev/hdi2
raid-disk 7
device /dev/hdj2
raid-disk 8
device /dev/hdk2
raid-disk 9
device /dev/hdl2
raid-disk 10
raiddev /dev/md1
raid-level 0
nr-raid-disks 6
nr-spare-disks 0
chunk-size
16
persistent-superblock 1
device /dev/hdg1
raid-disk 0
device /dev/hdh1
raid-disk 1
device /dev/hdi1
raid-disk 2
device /dev/hdj1
raid-disk 3
device /dev/hdk1
raid-disk 4
device /dev/hdl1
raid-disk 5
This will define three RAID arrays, md0 is the bulk of the disks, md1 can
be mounted as /tmp. Note that we are making md1 level 0 so there is no redundancy
and we are using an interleaving factor (chunk-size) of 16 which will be more
effective for smaller files which are more common in /tmp. We use a large
interleaving factor (128k) for the big array because we will most likely be
putting large files there and a wider interleave should increase performance.
Now we create the arrays one at a time:
mkraid /dev/md
To verify that things went well try:
cat /proc/mdstat
You should see md1 and md2 already created as they are relatively small and
don't require parity calculation like a level 5 array. Md0 should still be
cranking and have a considerable time left (like 2,600 minutes!). The [UUUUUUUUUUU]
in the entry for md0 represents the elements, if one drive fails a "U" will
be replaced by an underscore. /proc/mdstat is the primary place for information
regarding the RAID arrays. It's O.K. to use a level 5 device while it's resyncing
though performance will be poor, it's also O.K. to shut the machine down,
the process will resume where it left off when the machine comes back up.
You can wait to proceed until the array is complete or continue at a slower
pace.
Now create the second array:
mkraid /dev/md1
This should come up almost instantly as there is no resync involved in alevel
0 RAID.
Filesystems:
The following commands create the ext2 filesystems for the big array, notice
that there is no root allocation, and the stride
matches our interleaving factor:
mke2fs -b 4096 -i 65536 -m 0 -R stride=32 /dev/md0
Here "stride" should be equal to the interleaving factor specified in the raidtab file
divided by the block size of the filesystem. Block size is specified by the -b flag here
as 4Kb so in this example with a 128Kb interleave we should use a stride of 128/4=32.
This would be a very inefficent filesystem for small files like we would find
in /tmp so just use the default block size and a smaller interleave and stride for this array:
mke2fs /dev/md1 -m 0 -R stride=4
This command formats the swap partitions:
for i in a b c e f ; do mkswap /dev/hd${i}1 ; done
There need to be entries in /etc/fstab for the swap areas and the filesystems:
/dev/hda1 swap swap defaults,pri=1 0 0
/dev/hdb1 swap swap defaults,pri=1 0 0
/dev/hdc1 swap swap defaults,pri=1 0 0
/dev/hde1 swap swap defaults,pri=1 0 0
/dev/hdf1 swap swap defaults,pri=1 0 0
/dev/md0 /local/data ext2
/dev/md1 /tmp ext2
You can reboot at this point or just get started:
mount -a
swapon -a
Questions, problems should go to Kanoa
.