Over the years one of the most consistent problems with RAID recovery is the rebuild. I would estimate that nearly 40 percent of the RAIDs that we cannot recover are due exclusively to the fact that a technician executed a rebuild before verifying the following three items.
The RAID went down for some reason. Many times it is because the hardware housing the array may have some issues. There may be cabling problems, heat problems, back plane problems, or a hundred and one other hardware issues that can cause the RAID to degrade.
2. Hard drives
A simple surface scan of all drives in the array can give you an indication of the state of the drives. A report outlining any anomalies found for each drive is always critical when diagnosing the array.
3. RAID Consistency
A RAID five bases its integrity on a simple XOR algorithm that is stored on a block by block basis within the array stripe. The firmware of a RAID five uses this algorithm to ensure that the data stored on the RAID is consistent. It also ensures that if a single drive goes down and the array becomes degraded, the technician has ample time to do a quick backup of critical data, get all users off in a timely manner, and cleanly shut down any database handlers that may residing and open on the array. In other words, don’t have a dirty shutdown of your exchange store.
A degraded RAID 5 should NEVER BE PRODUCTION RUN!!! But, this is not normally the case and is why RAID recovery is a multi-million dollar business. A degraded RAID five that is run in production for longer than twenty four hours now contains data on the offending drive that is considered stale. If a second drive goes down then the entire array goes down as RAID five cannot run with two drives out.
When I get a call from a technician that their RAID is down because the array lost two drives, I immediately assume that one of the drives is stale and quickly advise the technician not to do a rebuild. I can count on one hand in the entire time I have been recovering RAIDs that a client has lost two drives simultaneously.
Although items 1 and 2 are not my bread and butter, I am familiar with techniques used to do their respective checks. Item 3, but, I am very familiar with and can help you ascertain if in fact there is a stale drive within your array. The following are set of steps, as well as a free piece of software that you can use before any rebuild is initiated.
Step 1: Pull all drives that are in the array out. Get the drives that are configured as part of the array away from the hardware. This does not include any hot swap drives, only those drives configured in the array and working at time of degrade.
Step 2: Make images of all the drives in the array. This serves several purposes. First, during an imaging session you may find terrible sectors on the drives. Secondly, you never want to work on the live data as the drives may be on their last legs and any recovery, rebuild, or diagnostic run on live data may kill the drive. Lastly, if something happens to the drives then you will have the images as a way to recreate the original data set.
Step 3: Download the RAID Diagnostic Toolkit from our website and install it on a Windows NT type machine. The software is very simple to use and very self explanatory. There are options in the software that are not currently active, this is because I will be introducing them in later posts. So I just pop up a small window to let you know that this is a future software enhancement or the function is grayed out.
Currently the software defaults to a 64K, or 128 sector stripe size. Although, the stripe size, for this particular function has no bearing on the test, it is nevertheless used on 95 percent of the RAID fives that I work on and can give us a more real world type map.
The software will run the consistency check on your set of images and give you a report on whether the stripe is corrupt. It will not tell you which drive is the stale drive if the stripe is corrupt, only that a rebuild, using this set of drives would not be advisable.