[lug] RAID redundancy with boot sectors?

David L. Anselmi anselmi at anselmi.us
Sun Nov 26 17:42:16 MST 2006

Nate Duehr wrote:
> "Real" RAID's whole purpose in life was to add uptime.  A disk fails, 
> it's removed and replaced seamlessly.  There's almost no way to do this 
> with the "Simple" software RAID.  Either you're swapping cables, or 
> pulling a disk out and rebooting (possibly with an fsck pass if the box 
> went down in a particularly nasty way), etc.  So you don't gain the 
> "uptime" factor.

In my world this isn't true, especially if you're using the second drive 
for backups.

Good backups, no RAID:
- disk fails, machine is down
- get to the console to see what the problem is (45 minutes)
- verify the drive is bad and needs replaced (30 minutes, at least)
- buy a new drive (at least an hour)
- install the new drive (10 minutes)
- boot a rescue CD and restore data (30 minutes, at least)
- reboot, machine is up

Good backups, RAID:
- disk fails, machine is down
- get to the console to see what the problem is (45 minutes)
- pull the bad drive (10 minutes)
- reboot, machine is up
- now I can spend some time figuring out the best replacement strategy, 
verifying the bad drive is really bad, and getting ready for the 
replacement (days)
- after hours, take down machine and fix

So for me RAID shortens the critical path by at least half, and takes 
away the pressure during the troubleshooting and fixing parts.  To me 
that's worth it (like you said, playing with software RAID is fun anyway).

> So for non-critical machines where "Real" RAID can't be purchased, 
> you end up no worse off to buy a pair of cheap bigger drives, reload,
> and restore backups... and now you have three drives (one smaller
> than the other two) and you usually didn't take that much more
> downtime than messing around rebuilding the software RAID.

The time difference might be smaller if the second drive is sitting on 
the shelf waiting to be installed.  But you don't really want to have to 
buy a new drive while the machine is down, if people aren't able to work.

I don't necessarily think real RAID is better:

In one office, motherboard failures beat drive failures 4 to 1.  Of 
course that's an outlier.

In another office the RAID controller was a single point of failure and 
it failed before any of the drives.  The complexity of the recovery 
resulted in losing the on-disk data requiring restore from tape.

At that same office kicking the plug out of a RAID resulted in losing 
all the on disk data.

Granted my experience is somewhat peculiar.  But RAID assumes the drives 
will fail first.  After you account for single points of failure and the 
complexity of a RAID setup you might not have gained any reliability. 
It's still a useful tool in the right hands but I've seen it used badly 
too often.


More information about the LUG mailing list