VMware Virtualization: RAID 5: Vulnerabilities and remedies

A major disadvantage to RAID 5 is that only one drive can fail in a RAID set. If another drive fails before the failed drive is replaced and rebuilt using the parity data, data loss occurs. The period of exposure to a single point of failure and subsequent data loss because of a second drive failing should be minimized.

The amount of time that a RAID 5 set is rebuilding should be as less as possible to mitigate the risk. However consider the following designs where you are especially vulnerable on RAID 5 due to longer rebuilding periods:

*     Very large RAID groups, like say 9 + 1 and greater, which require too many reads to reconstruct the failed drive.

*     Very big drives, like 1.5 TB and 500 GB Fiber Channel drives, which require more data to be rebuilt.

*     Slower drives that stutter heavily during the time when they are providing the data to rebuild the replaced drive and at the same time support Production I/O traffic data reads and writes, typical in SATA drives which have a tendency to be slower during the random I/O that is characteristic of a RAID rebuild. The process of a RAID rebuild is indeed one of the most stressful and intensive drawback of a disk's life and survival. Not only must the disk service the Production I/O workload, but it must also provide data to support the rebuild. Statistically, disk drives are more likely to crash during a rebuild than during normal duty operations.

Some technologies that could reduce or eliminate the risks of a dual drive failure, and many of these techniques are implemented in most arrays, are:

* Using proactive hot spares, which shortens the rebuild period significantly by automatically starting the hot spare before the drive fails. The failure of a disk does not happen all of a sudden, it is usually preceded with read errors, which are recoverable, they are detected and corrected using on-disk parity information, or write errors, neither of which is catastrophic. When a set threshold of these read and write errors occurs before the disk finally fails, the failing disk drive is replaced by a hot spare within the array. This is much quicker than the rebuild after failure, because the lion's share of the imminently failing drive can be used for the copy and because only the fractions of the disk drive that are failing need to use parity information from other disks.

* Using smaller RAID 5 sets for faster rebuild and striping the data across them using a higher level construct.

* Using a second parity calculation and storing it on another disk.

VMware Virtualization

RAID 5: Vulnerabilities and remedies

No comments:

Post a Comment