Thursday, September 19, 2024

Understanding RAID

What does it all mean?

RAID 0, Duplexing, Mirroring, Striping with parity, ECC, the terminology bog goes on for what seems to be forever. Do you need to know what all of these mean? If you’re considering a RAID array for your network, the first step is understanding the different terms before you even think about making a purchase.
So, off we go…

RAID – To start talking about RAID (Redundant Array of Independent Disks; or redundant array of inexpensive disks; depending on who you ask, both are accepted as “correct”), you need to know that are two different types: Software and Hardware RAID. In a nutshell, hardware RAID is any type of RAID deployed on a computer system that is controlled at a hardware level, be it by a controller card or other device, which is independent of the operating system. Before any type of operating system is installed on the system (or started up, in the event an operating system is present), this level of RAID would already be enabled on the given system. In theory, the loss of the operating system from errors or configuration issues in this type of RAID configuration allows easier recovery of your data, as the disk configuration is held on the hardware controller. The loss of the hardware device makes data retrieval much more difficult, as the configuration of the hard drives is unknown to the operating system, (and any standard data recovery tools) as it sees the combined space of all the drives as one logical structure.

[NOTES FROM THE FIELD] – Certain server class hardware RAID solutions not only write the disk configuration to the memory of the hardware controller but to certain reserved sectors of the hard drives as well. This means that if the controller failed and then was replaced with the same make, model and BIOS revision of controller, the chances of recovering the data is greatly improved as the disk configuration would be written “back” from the disks to the memory of the replacement controller, and even if the operating system faulted, the data would most likely be intact for standard disk access and recovery at that point.

In a software based (software based means “derived from the operating system,” by default) RAID solution, it is the operating system itself that creates and stores the logical structure of the drives in the array. Real mode access (such as booting from a floppy disk or an NTFS boot disk in the NT world) in most cases is not going to allow any access to the data, as the operating system was by-passed and not allowed to initialize and access the logical drive array it has created.

[NOTES FROM THE FIELD] – There is no RAID based hardware failure (such as a controller card) that you need to be concerned with in a software based RAID scenario; however, if you lose the operating system to the point where the repair function cannot “fix” the issue, all the data of the operating system created RAID solution is usually lost.

FREE Trial! Increase WAN capacity. – With MSR technology that uses DNA pattern matching to incease capacity up to 10 times.

The Different Flavors of RAID

For the most part, the RAID levels themselves have the same characteristics and properties, whether they are software or hardware based. There are only a few types, RAID 0, 1 and 5, that can be software based (that I am aware of). The remainder are all hardware based for the most part.

RAID 0 – RAID 0 is a misnomer of sorts, as it is not really RAID (redundant array of independent disks; or redundant array of inexpensive disks; depending on who you ask, both are accepted as “correct”) because there is nothing redundant about RAID 0 as it offers the best read/write performance but no fault-tolerance.

Click here for the graphic

All of the drives are written to sequentially (one after the other, from A to B to C, etc.), as the data is broken down into 512 byte blocks up to 64KB blocks. This depends on the hardware controller and / or the software configuration of that controller or the operating system, as RAID 0 is supported in both hardware and software solutions. In the above image, each column is considered to be a physical drive with the letters representing the datablocks.

RAID 0 is often referred to as striping or striping without parity, and requires two physical hard drives at a minimum, or at least two drives in the same system with the same amount of RAID dedicated space. For example, it’s possible to have four 20GB hard drives in a system (80GB of total space), of which you have dedicated 2 physical drives – 10GB partitions from each, for a total of 20GB, to the RAID 0 array.

RAID 0 allows for the use of the entire dedicated space committed to the array. If you have dedicated ten 20GB hard drives to your RAID 0 array, you will have 200GB of space available for use, as the overhead for the array is negligible.

The failure of a single drive in a RAID 0 array causes the entire array to fail and will result in the loss of any data that is not committed to backup.

RAID 1 – RAID 1 is often referred to as disk mirroring when two different hard drives are used on the same IDE, SCSI or RAID controller and disk duplexing when two different hard drives are used on two different IDE, SCSI or RAID controllers. (Normally they are paired by make – IDE to IDE, SCSI to SCSI, RAID to RAID. While it is possible to mix them, it is not recommended).

There is no striping of any kind in this configuration; however, all of the data written to one disk is duplicated to the other and this is where the fault tolerance of this configuration exists. (This duplicate data is the parity information that will be used to maintain the system in the event of a drive failure.) This has the effect of allowing twice the read data rate of a single drive, as either can be accessed at a given time. (There is no write performance increase, and the system will take a performance hit due to the need to duplicate single write data to both drives.) This type of RAID implementation has one of the highest overhead needs of system or hardware resources of any of the RAID types, and it is almost always deployed (following best practices) as a HARDWARE implementation. (It can be done via the OS, but this causes all of the processing load of the RAID overhead to fall on the system processor. Inefficient, regardless of the type and size of the processor in use.)

Click here for the graphic

(Data is read from or written to A and A, or B and B or C and C, etc., depending on where the data is held / going). In the above image, each column is considered to be a physical drive with the letters representing the datablocks.

RAID 1 requires a minimum of two physical hard drives of equal space per array and the total amount of net space is half of the total committed. Effectively, you will lose 50% of the disk space committed, as the data is a mirror of itself. Two 40GB drives (totaling 80GB of space) will yield approximately 40GB of usable space in the RAID 1 array.

A standard RAID 1 implementation of two drives can survive the loss of a single disk and retain its data structure.

There is the ability in some Hardware RAID configurations to utilize both hot-swappable (sometimes called hot-pluggable) hard drives and online spares (sometimes called failover drives) to extend the probability of data preservation in the event of a failure.

Hot-swappable drives are usually connected to a server system via an open faced cage in the front of the system, which allows for the quick removal of a dead drive while the system is powered on. Sometimes these drives are enclosed within the case, but this is the exception not the rule. The use of Hot-swappable drives means that while the system is up and running, you can pull the failed hard drive and replace it with a replacement drive and the data will be “rebuilt” to the replacement drive via the stored redundant information. In the above example of RAID 1, the mirror would be re-established on the new drive by taking all of the data on the remaining disk and duplicating it to the new drive. In a true parity situation, (such as RAID 5, which will be discussed later) the parity information that is spread out over all of the hard drives is what will allow the data to be “added” to the replacement drive.

In the case of an online spare, an additional drive is committed to the RAID array, allowing for a secondary failure recovery point to exist. This additional drive’s (or drives’) sole purpose is to sit idle UNTIL a failure occurs. At the point of a single drive failure, the online spare would initialize and immediately begin to work with the controller to restore the RAID configuration to its fault tolerant configuration.

For example, if a RAID 1 configuration were deployed with a single online spare, you would need three hard drives of the same size. This would limit your usable logical space to just 33% of the total amount dedicated. Three 20GB hard drives total 60GB of space, but due to the fact that the array is set up as a mirror, (data that is totally duplicated from one drive to the mirror partner) AND that the third drive has the singular task of “waiting” for a failure, the total amount of usable space is 20GB.

This configuration allows for two drive failures, if the online space is totally “rebuilt” before the second failure.

Let’s say that on Friday evening around 9:00PM one of the drives in your RAID 1 configuration fails. The RAID controller takes the failed drive off-line and brings the spare online. Your online spare becomes active and all of the data is restored to the newly initialized spare. (The actual rebuild time will vary on how much data needs to be copied and the rebuild priority given. In this example let’s say it’s ten hours.)

If another failure occurs, either to the other existing RAID 1 drive or the newly initialize spare AFTER the rebuild period (in this example, ten hours) the system WILL stay up and running. If hot-swappable drives are being used, BOTH could be pulled from the system and replaced with new drives and the controller will take ONE of the two replacement drives and rebuild the RAID 1 data to it. It will take the other replacement drive and place it in an off-line status for use in the next drive failure scenario.

If the second failure were to take place on either drive BEFORE the rebuild period (in this example, ten hours) has an opportunity to complete, the system WILL NOT stay up and it WILL FAIL.

RAID 5 – RAID 5 is often referred to as striping with parity and has some similarities to RAID 0, the main difference being that here there is fault tolerance and in RAID 0 there is none. RAID 5 data is broken up into chunks anywhere in size from 512 bytes blocks up to 64KB blocks, (this depends on the hardware controller and / or the software configuration of that controller or the operating system, as RAID 5 is supported in both hardware and software solutions), and distributed across all of the disks in the array, with parity information being written to each drive.

Click here for the graphic

In the above image, each column is considered to be a physical drive with the letters representing the datablocks. It shows that datablock “0” is written first to drive A and then to B and then to C and then to D, with its parity information set to the fifth drive. (If there was more data to the “0” data block it would “wrap” back to the A drive and start the process over.) The next batch of data (the “1” datablock) starts on the A drive (although it could have been anywhere, based on the last write or the controller algorithm) and continues in the same fashion. You’ll notice that this time however, the location of the parity WRITE information is kept on a different drive. (Drive D). It is this “spreading out” of the parity (rebuild) information that gives this configuration its fault tolerance.

RAID 5 needs to be deployed using a 3 disk minimum in its standard configuration. (The example above shows 5 drives, which is the “best practice” minimum suggestion.)

A RAID 5 configuration will allow for use of the total combined space of all of the drives, minus a single drive. That is, if five 20GB hard drives totaling 100GB of space are committed to a RAID 5 array, the total usable space is 80GB. The “lost” space is allotted for the use of the parity storage.

One key point to remember with RAID 5 is that all of the parity information is not committed to a single drive. It is spread out among all of the drives and while it does TOTAL the space of a single drive, it is not stored ON a single.

RAID 5 can sustain the loss of a single drive (or more, in the case of online spare usage which I’ll cover in a minute.)

In the above image there are five drives. Let’s assume that the D drive fails and no online spare is in use. The remaining four drives can continue to allow the system to operate, as the “lost” data (D0, D2, D3 and D4) is “restored” for “direct” access from the 0 parity, 2 parity, 3 parity and 4 parity information, spread out on the other hard drives. The 1 parity information is lost with the drive failure and it is partly why this RAID array cannot sustain a second failure. Because all of the other parity information is being used to re-create (think “in the place of the D drive”) the data lost in the D drive failure and parity 1 is lost, ANOTHER failure would not allow the array to remain intact and a failure would occur.

The case of a RAID 5 array with an online spare is a little different.

The above example, assuming an online spare, would mean that drive D would fail. The controller would mark D off-line and active the online spare. The online spare would be populated (rebuilt) with all of the data that was originally on drive D.

D0, D2, D3 and D4 data would be restored to the newly initialized drive from 0 parity, 2 parity, 3 parity and 4 parity from drives A, B, C and E, and 1 parity would be rebuilt to the newly initialized drive using A1, B1, C1, and E1. (The actual rebuild time will vary on how much data needs to be copied and the rebuild priority given. In this example let’s say it’s ten hours.)

If another drive failure occurs AFTER ten hours has passed, the system will continue to remain operational, as it would function “directly” from the parity information, as shown in the “no online spare” example. If the second drive failure occurs BEFORE the ten hour threshold, the system will halt, as the rebuild of all of the information would not have completed.

(The use of the online spare in RAID 5 in terms of utilization are the same as they were in the RAID 1 section, an additional drive is committed to the RAID array, allowing for a secondary failure recovery point to exist. This additional drive’s (or drives’) sole purpose is to sit idle UNTIL a failure occurs. At the point of a single drive failure, the online spare would initialize and immediately begin to work with the controller to restore the RAID configuration to its fault tolerant configuration.

Jason Zandri has worked as a consultant, systems engineer and technical trainer for a variety of corporate clients in Connecticut over the past five years and currently holds the position of Technical Account Manager for Microsoft Corporation.

He has also written a number of COMPTIA and MICROSOFT prep tests for Boson Software and holds a number of certifications from both companies. Currently, he writes part time for a number of freelance projects, including numerous “HOW TO” and best practices articles for 2000Trainers.com and MCMCSE.com.

Related Articles

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles