Recovering Drive Configuration on vanilla PERC H700 (LSI 9260-8i)

AndrewTBense · Jul 10, 2016

I have a PERC H700 with a 7x3TB RAID5 array where one of the disks had recently failed. I went to replace the failed drive and now it appears that I have lost everything.

I am currently panicing because I fear that I have lost *EVERYTHING*. I am seriously up the creek if I have lost everything.

OS: Windows Server 2012 R2 Datacenter
Adapter: Dell PERC H700 - (I'll check the firmware, it's oem dell)

Here's what I have:
Controller0: PERC H700
Drive Group: 0, RAID 0
Virtual Drive(s):
Virtual Drive: 1 Ksys, 118.625 GB, Optimal
Drives
Slot: 4, SSD (SATA), 119.243 GB Online
Drive Group: 1, RAID 5
Virtual Drive(s):
Virtual Drive: 0, bigboss, 16.371 TB, Degraded
Drives
Slot: 0, SATA, 2.729, MISSING
Slot: 1, SATA, 2.729, Online, (512 B)
Slot: 2, SATA, 2.729, Online, (512 B)
Slot: 3, SATA, 2.729, Online, (512 B)
Slot: 5, SATA, 2.729, Online, (512 B)
Slot: 6, SATA, 2.729, Online, (512 B)
Slot: 7, SATA, 2.729, Online, (512 B)

#######################################################

So I purchased a 4TB SATA drive to replace the dead drive. I booted up into Windows, loaded MegaRAID Storage Manager.

Right clicked on the unconfigured 4TB drive and selected "Replace Missing Drive"

The storage manager then associated this new 4TB SATA drive in Slot 0 with Drive Group: 1, RAID 5.

I then right clicked on the drive and selected "force online" -- I've been able to do this before and all was fine.

Walked away from the machine, for ~5 minutes. Told windows to shut down, waited for windows to do it's thing, then when the machine powered off, I moved it backed to my closet connected the network cables, plugged it back in, and powered it back on.

A few minutes later I RDP'd into the machine. The file systems on that Drive group were no longer there.

I loaded up "Computer Management", the partition tool in Windows, and it appeared as this

Disk 0 - 16.374 TB
[ 3.xx TB Healthy Partition ] [ 12.xx TB Unallocated ]

I paniced. I opened up the MegaRAID Storage Manager and aborted the operation.

Normally I see a rebuild??????

I then tried to clear and re-import the configuration. Since I boot from the 128 GB SSD that's connected to Slot 4 on the controller, it wouldn't let me clear the configuration.

So I rebooted, loaded up the PERC BIOS utility, went to clear the configuration (thinking that I'd surely be able to re-import it)

And now everything is gone!??!!?!?!?

I'm freaking out and I don't know what to do. If any of you have any suggestions please, please advise.

For what it's worth, I've got an extra PERC H310 here that I can possibly cross flash to official LSI firmware if needed.

Looks like it's time for me to make a pot of espresso, and start my coffee percolator. Cause it's going to be a long night of researching to see if there's anything that can be done.

Terry Kennedy · Jul 10, 2016

AndrewTBense said:
So I rebooted, loaded up the PERC BIOS utility, went to clear the configuration (thinking that I'd surely be able to re-import it)

And now everything is gone!??!!?!?!?

I'm freaking out and I don't know what to do. If any of you have any suggestions please, please advise.

In the PERC BIOS (not from Windows, not from Mega-whatever), uparrow so the controller is highlighted (not any physical or logical drives). On the bottom you'll see a "More options" (I think it is F2). Press that key. On the top menu bar you should see "Foreign config" or similar. Selecting it will allow you to import a foreign configuration (found on the drives). That should restore your logical volume, assuming the config data still exists on the drives (if you clear the config with the drives connected, it can also clear the on-disk config).

If you need me to, I can shut down an R710 here that has a H700 in it and see what the exact menu steps are, but the above should be close enough to get you there.

AndrewTBense · Jul 10, 2016

Thank you for your reply.

Unfortunately, that option is not available.

vanfawx · Jul 16, 2016

"Clear Config" clears the existing configuration on the controller. Foreign config only comes into play when you're hooking up drives that were on a different LSI controller. Unfortunately I think you need to head to your backups to recover from this.

Terry Kennedy · Jul 16, 2016

vanfawx said:
Foreign config only comes into play when you're hooking up drives that were on a different LSI controller.

I'm not sure what you mean by "different LSI controller" - a different model, or a different unit of the same model. I've removed all the drives from a system with an H700, installed larger-capacity drives and created new volumes, and needed to go back to the old drives for various reasons. Take the new ones out, put the old ones back in (same controller) and they get reported as a foreign config. Import it and the old volumes are back.

vanfawx · Jul 16, 2016

The foreign comes into play if there's an existing config on the controller, and you add drives that don't match the existing config. I find with the same model controller (H310 -> H310) if there's no existing config, the array will not show as foreign but just import.

pricklypunter · Jul 16, 2016

If you don't have backups to go to for a restore, or even if you do, before you do anything else that may damage your data, bit image every drive with dd or similar. I suggest you do this in any case, that way you can work with just the image files, even in a virtual environment, and play with it until you get it working, without risking your data further. It's also a quick way to reset and do over if you don't get your config right first few times

Terry Kennedy · Jul 16, 2016

pricklypunter said:
If you don't have backups to go to for a restore, or even if you do, before you do anything else that may damage your data, bit image every drive with dd or similar.

Note that with RAID controllers, a dd of the "whole" drive will usually not include the RAID config metadata, as those controllers normally put that data at either the end or beginning of the drive and then reduce the reported capacity to hide that data from the operating system.

Connecting the drive to a "dumb" controller to perform the imaging operation may or may not help capture that data, depending on how the RAID controller does things.

pricklypunter · Jul 16, 2016

Utils like ReclaiMe and Raid Reconstructor and possibly others, might well be able to re-create or recover the array config from the first/ last 512bytes of the disks, however, if the actual data on the array is toasted by accident while trying to fix things, all is lost. Hence my suggesting taking a bit image of it before that happens. One of those utils, I can't remember which, claims to be able to re-create the config even with a failed disk in a raid5 array

AndrewTBense · Jul 18, 2016

pricklypunter said:
If you don't have backups to go to for a restore, or even if you do, before you do anything else that may damage your data, bit image every drive with dd or similar. I suggest you do this in any case, that way you can work with just the image files, even in a virtual environment, and play with it until you get it working, without risking your data further. It's also a quick way to reset and do over if you don't get your config right first few times

pricklypunter said:
Utils like ReclaiMe and Raid Reconstructor and possibly others, might well be able to re-create or recover the array config from the first/ last 512bytes of the disks, however, if the actual data on the array is toasted by accident while trying to fix things, all is lost. Hence my suggesting taking a bit image of it before that happens. One of those utils, I can't remember which, claims to be able to re-create the config even with a failed disk in a raid5 array

Thank you for your response. This machine has been untouched and unplugged since I made this first post. I am not going to touch anything until I have my hands on at least 7-8 drives in which I can 'dd clone' as you say.

Wish I could just erase (no pun intended) that night when I messed this up.

pricklypunter · Jul 21, 2016

Hey it happens to us all, at some point or other. Doing stuff when tired, brain not making sense of what you read, before hitting the enter key of doom and gloom etc. As I'm sure you are aware...if not before, certainly now, backups, backups, backups, raid will not save you from yourself (among other things)

BLinux · Jul 22, 2016

I've been in your situation several times, which is more than I'd like to admit to... but such is life with data these days. First thing first, calm down so you can think rationally. The many times where I've been in similar situations, I've been able to recover from most of them except twice (once was a failed RAID-0, so my expectations were total data loss if a drive failed, the 2nd time was software RAID-6 and 3 drives dropped out of the array); so just a point of info that might bring you some optimism.

If the data is important to you enough that it is worth buying a set of 8 spare drives, then definitely do that. I find the most convenient thing to have in situations like this is a USB3-SATA dock or adapter (with power supply, some adapters are made for SSD use and the power from USB won't be enough to run a spinning HDD). Get one of these as it should come in handy. Those should be able to allow you to dd image the entire drive, including meta data used by the RAID controller. (I've done this with Dell PERC5 and PERC6 cards, but not the H700).

The other thing you might want, other than the 7 drives to get a dd bit image of the drives in the array, are some large USB3 or some drives to backup your data should you be able to access them again. I got some 8TB USB3 drives recently for $215, a couple of these might be enough although I don't know how much data you had on the RAID-5.

The other thing you want to do now, before your memory is fuzzy... write down somewhere safe the Virtual drive volume configuration, not just that it was RAID-5, but what was the stripe size, etc. If you have accidentally cleared the RAID configuration data on the drives, usually that doesn't touch the data itself. So, if you can re-create the same RAID configuration, and choose NOT to initialize the drive, you can recover the data back.

Also, I think you might have the wrong procedure for repairing a degraded array. At least in my experience with the older PERC5/6 cards, you should add the new drive as a "hotspare", and the degraded array will automatically use the new "hotspare" to bring the array back to optimal state. Using the "force online" option is not really meant for the type of procedure you were hoping for.

That said, the "force online" option is useful for something that might help you here. With RAID-5, and such large drives, you sort of have to walk on eggshells due to the infamous uncorrectable read problems (URE). I ran into these every so often even with 8x 500GB drives, with 3TB drives, you might expect to run into them even more frequently. The problem here is that when you are rebuilding a RAID-5 array, if you encounter a URE on any one of the remaining drives, these RAID controllers have a tendency to want to mark the whole array as gone (you're already degraded, and the URE makes the controller mark one more drive out, now RAID-5 is broken). This may be what happened to your array, if indeed it did start a rebuild. Ok, so what do you do here if this happens to you? This is when you remove the new replacement drive, so that the RAID controller only sees the remaining drives that were working in degraded mode before the rebuild started. At this point, you can use the "force online" option to force the RAID controller to make the degraded array available again. The reason you want to remove the new replacement drive is so as not to confuse the RAID controller so it doesn't have to try to figure out which of the 2 drives marked "offline" it needs to use to bring the array back online again; so that step is really important before you "force online". If you don't do that, and it somehow brings the "partially rebuilt" drive into the array, you'll get data corruption. I once had a pretty stubborn RAID-5 array in a Dell server that kept doing this, and I had to repeated pull the new drive out, 'force online', and try the rebuild again... the 1st few times it was really nerve wracking since you keep getting this feeling your data is totally lost. I'm sure some people might even develop PTSD from this, but if you survive it, you sort of get comfortable with the idea and it doesn't bother you to see your RAID array disappear LOL.

So, to outline your plan, this is what I would recommend:

0) get some hardware: USB3-SATA dock or adapter, 6x 3TB hard drives to copy bit image of your drives, 2x large USB3 or other hard drives to backup your data should they become available at some point in the below steps.
1) take a dd bit image of every drive that was in the RAID-5 array when it was degraded (6 of them). this way you can rinse and repeat the following other steps should you screw up. at this point, I would also suggest using the "copy" instead of the original - you can keep the original drives in their original state for a hail mary pass when you've exhausted other options.
2) take the new replacement drive out of the array physically (slot 0), put in the 6 "copies" into their respective slots. then boot into PERC BIOS.
3) perform everything within the PERC BIOS just to avoid booting into the OS and avoid any OS activity on the RAID array. this just reduces noise you don't need when you're on the edge of seeing your data disappear.
4) boot into the PERC BIOS, see if the controller is now seeing the RAID-5 array. if not, but it sees a 'foreign config', try to import that into the controller. sometimes when the controller is confused, it won't see this, but by removing the new replacement drive, it just might see it again. if you succeed here, and the RAID-5 is visiable, albeit in degraded mode again, this is the time to copy off all your data to those backup drives. (see step 6)
5) if #4 is all fail, then try to recreate the RAID-5 array with the same exact settings but DO NOT initialize the drive at all. And if it tries to start an initialize, stop it immediately. In this case, you may need to have the slot 0 drive inserted so the RAID-5 is created over the 7 drives and not 6. But then pull it out after it is created and allow the RAID-5 to go back to degraded mode. You don't want the new drive in slot 0 as you are not initializing and it will have inconsistent/bad data.
6) Recover your data now, you will need to boot into the OS, mount the RAID-5 array drive read-only if you can, and connect your backup drives and start copying your data off. During this data copy, it is possible that the reading will cause another URE and the array will go offline again. If that's the case, just go back into the PERC BIOS, and force online the offline drive so that the array is back in degraded mode and go back to copying your data again. I've had to repeat this step many times in some situations.
7) once you have a copy of your data backed up, you can try to bring the RAID-5 back to normal operations. At this point, you can insert the new replacement slot 0 drive, assign it as a hotspare (this is all in PERC BIOS), and see if the controller automatically starts rebuilding the RAID-5 array. If during the build, the RAID-5 array goes offline, it is likely you ran into a URE and a 2nd drive was marked offline. In this case, pull the slot 0 replacement drive, force the RAID-5 online in degraded mode, then wipe the data off the slot 0 drive (use dd if=/dev/zero of=/dev/sdX .. this is again where your USB3-SATA dock comes in handy again). This is so that the controller doesn't confuse the partially rebuilt data on the new replacement drive as being valid in any way. Now, try this step 7 again (insert drive, add as hotspare, attempt rebuild).

You may have failures along the way... but hopefully you have the chance to "redo" since you have the bit image from step 1.

In my own experience, I usually succeed at step 4, then copy my data off, and then recover the array in step 7. If step 7 goes horribly bad, but you successfully backed up your data in step 6, you can just wipe all disks and create fresh clean array and copy your data back.

Good luck! Nothing is hopeless while there is still hope...

BLinux · Jul 22, 2016

By the way, forgot to mention...

I think your controller being on Dell firmware vs LSI firmware probably isn't going to make a difference nor related to the cause of the fault. I would leave that alone and not touch that right now so as to reduce the number of variables you need to deal with while recovering your data.

Also, now is the time to really start looking at backup solutions. Along those lines, a friend once gave me some great advice on this topic. He said choose a backup solution that is easy, because the more barriers there are to you doing your backups, the more likely it is you will not do your backups or you will not do them frequently enough to be effective. There are some complex backup tools out there, if you have to read a book and study online documentation and do a lot of things to get your backups to work, it's likely you won't do it right, or do it at all. Such tools may be fine for a business environment where experts are paid to do that job and a entire business's livelihood is at stake. For me, I ended up writing my own tool, which is simply a script in Linux that turns on a set of external USB drives and copies the ZFS data sets to the external drives and them shuts them down. When it is done, I am emailed a report of what was done so I have confirmation the job was completed successfully, and if not, I will know about it and can address the problem. I decided to write my own backup script because that was easier for me than researching a dozen different tools and understanding their pros/cons and how to work around those; i wrote it in a day and every now and then I've added enhancements like the ability to use snapshots for consistent data sets (my VM disks are constantly changing and large), etc. I'm soon going to also add a 'time machine' like ability so that i can have multiple copies of the data as it has changed. And I'm also looking at cloud backup as a 3rd "offsite" solution, but my data set is rather large so I don't know if it makes sense since the last time I calculated it would take about 6 months to upload my entire data set. Anyway, this was what was easiest for "me" and might not be for you, but as you look for backup solutions, start with that priority of finding something "easy", because you'll be more likely to actually implement it. Once you have something that works (maybe not perfect), then you can look at alternatives that might have a desired feature you're lacking. Also, remember to test your recovery process every now and then; set something in your calender to remind you every few months to attempt to recover some data from your backup.

Search

Recovering Drive Configuration on vanilla PERC H700 (LSI 9260-8i)

AndrewTBense

New Member

Terry Kennedy

Well-Known Member

AndrewTBense

New Member

vanfawx

Active Member

Terry Kennedy

Well-Known Member

vanfawx

Active Member

pricklypunter

Well-Known Member

Terry Kennedy

Well-Known Member

pricklypunter

Well-Known Member

AndrewTBense

New Member

pricklypunter

Well-Known Member

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast