Just lost 72TB of data after a HP C7000/ Dell MD1200 FW update

Eliav · May 3, 2016

Well today was a fun day...

Does anyone else here have an HP C7000 and Dell MD1200 storage connected via SAS?

We did a firmware update on our C7000 and while everything was fine for a few days, the raid controller on our c7000 just wiped its entire storage configuration. Spent many hours on the phone with HP who basically said we were SOL. The raid card saw 32 unconfigured drives when there were 3x 12 bay MD1200 chassis each with a volume across the entire chassis.
They claimed possible incompatibility with HP and Dell. This was a system leftover from a previous seutup (we usually don't mix storage brands), but its been working fine for years.

Has anyone experienced this? Do you mix your storage systems?

Luckily we have backups, it's just time consuming and annoying.

Evan · May 4, 2016

What smart array controllers in the blade node ?

Not that I can probably help my c7000's are all connected to SAN and do have one c7000 with a few nodes connected to some HP D6000.

I can't really imagine what could go wrong and if it did I assume it's just an array controller fault that could just as easy happen in a stand alone server as a blade in a c7000.

SomeGuyInTexas · May 4, 2016

Ouch! Definitely a painful experience, and kudos for having backups... I know for sure that every tier 1 vendor will respond in the same manner, and honestly it is for good reason. The qualification matrix for them is large, and there's always OPEX constraints on having a proper test bed/regression farm for their own hardware/software/firmware listings - let alone adding other vendors into the mix. This even extends to HDD/SSD models and even firmware revisions.

That being said, what you ran in would be odd if it would only occur with a mixed hardware configuration. (Hard to imagine that a JBOD topology would cause an array to remove all metadata on all drives attached.) What was the firmware revision you were running prior to the update, and what version did you update to?

Patrick · May 4, 2016

This stinks! Glad to hear you at least have backups. Hopefully you also have a nice fat pipe between the backup machine(s) and whatever you are planning to use going forward.

Eliav · May 4, 2016

Evan said:
What smart array controllers in the blade node ?

They're HP SmartArray P721m controllers.

HP is sending another controller for our peace of mind, but they don't think that'll get the data back.

I still can't understand how the controller will just wipe the entire raid config. There isn't even the ability to re-import it from the drives.

Backup restoration will be slow. We're still deciding what we want to switch to - maybe all iscsi.

PigLover · May 4, 2016

Ouch. I feel for you. A few years ago I lost a 6x2TB array when the controller faulted and scribbled all over all 6 drives. Hard lesson on the concept that "raid is not backup". Luckily most of it was ephemeral copies used during processing and the valuable data was recoverable from other sources.

Glad you had a backup.

A reminder to others: Raid is about performance and resiliency - it improves MTBF and maybe MTTR - but it does not ever replace disciplined backups. Raid controllers fail in odd ways that can be disasterous. Even raid-like systems (ZFS) suffer devastating loss if the metadata is damageod (lots of blogs out there documenting this).

Hope your recovery is not too time consuming.

whitey · May 4, 2016

And THIS is why I love/preach/am a HEAVY proponent for SW raid :-D but then again ALWAYS have backups/replicated and VALIDATE/TEST your backup data.

Sorry bud, as others have said, count your lucky stars you have backups. (tips hat for that good sir)

EDIT; JUST abt 10 years of heavy ZFS use and never encountered metadata fubar issue...now I gotta go research...SCARY

~goes to knock on wood as well.

Yarik Dot · May 5, 2016

whitey said:
And THIS is why I love/preach/am a HEAVY proponent for SW raid :-D but then again ALWAYS have backups/replicated and VALIDATE/TEST your backup data.

Yep, I agree. SW raid on servers. HW raid in vmware hosts (unfortunately there is no other option - but it is fine, I don't have so many of them).

whitey · May 5, 2016

I'm not even interested in HW raid for VMware as I run all my VM's off NFS storage over a 10GbE network fabric, NFS server uses ZFS/SW raid, performance through the roof.

Now if you were doing HW raid volumes/datastores for local performance and did not need shared/clustered stg then maybe but hell even vSAN recommends HBA's to just 'pass the disk along to them'

SomeGuyInTexas · May 5, 2016

whitey said:
I'm not even interested in HW raid for VMware as I run all my VM's off NFS storage over a 10GbE network fabric, NFS server uses ZFS/SW raid, performance through the roof.

Now if you were doing HW raid volumes/datastores for local performance and did not need shared/clustered stg then maybe but hell even vSAN recommends HBA's to just 'pass the disk along to them'

I hear you - though not all software based solutions are ones I'm ready to trust highly. ZFS I trust ... vSAN, not so much. Totally agree with you on local DAS for performance go HWRAID, if clustering, then the solutions start to narrow down - and you've got to keenly focus on what was qualified with that solution, and not change the recipe without ending up in the "non-supported zone."

Terry Kennedy · May 5, 2016

whitey said:
And THIS is why I love/preach/am a HEAVY proponent for SW raid :-D but then again ALWAYS have backups/replicated and VALIDATE/TEST your backup data.

Agreed - I am continually amazed at the number of other people who either go "I don't need backups for ZFS", "I use ZFS snapshots for backup" or "I use ZFS send/recv and that's my entire backup strategy". I wrote about some of my experiences here.

EDIT; JUST abt 10 years of heavy ZFS use and never encountered metadata fubar issue...now I gotta go research...SCARY

The only time this happened to me, I was using an older ZFS version (pre-zpool 19) and one of the flash DIMMs on my PCIe SSD (OCZ Z-Drive R2 P84) developed socket problems*. You couldn't remove a ZIL device before zpool 19, and any attempt to write to the pool panic'd the system. I ended up copying the pool to tape, nuking it, and restoring.

* I have no idea why OCZ had unreliable socket connections. They were using SODIMM sockets which have a proven track record in everything from notebooks to (at the time) high-end Cisco routers. To make things worse, if you tried moving all of the flash modules to another Z-Drive main board, the LSI firmware would complain that you were trying to create more RAID volumes than the license allowed (and since these were fixed-config, there was no user interface to deal with this).

EffrafaxOfWug · May 6, 2016

Round about 2006 IIRC we had the same problem with a bunch of LSI/IBM RAID cards completely forgetting the RAID config of the discs contained therein. Thankfully we were saved because we had doco on the exact disc geometry the arrays had been created with, and recreating the same geometry with the same discs gave us a recoverable array. If that hadn't worked we'd have been looking at about three days downtime for our production mail servers whilst we restored from tape.

whitey said:
And THIS is why I love/preach/am a HEAVY proponent for SW raid :-D but then again ALWAYS have backups/replicated and VALIDATE/TEST your backup data.

Had the same sort of shenanagines at the beginning of my career; catastrophic failure of a RAID controller (at that time due to a firmware bug that would cause not only the controller to forget the RAID config and somehow render it unable to read its own HDD signature, but other identical controllers couldn't read it either). Since then I've been a big proponent of RAIRAID - Redundant Arrays of Independent Redundant Arrays of Independent Discs. If something can't be done purely with software RAID (and thus recoverable on any machine capable of accessing the discs), then you better damned well make sure it's at least backed up to a system that can be recoverable on any machine capable of accessing the discs. RAID isn't backup and backups that you haven't tested restores from under multiple failure scenarios also aren't backups*.

If in doubt, always back up your backups to a backup backup lest your inability to restore a backup gets your bosses' backs up.

* One memorable digression; company I did consultancy for had a DR plan that involved taking the tapes they had in Iron Mountain and manually restoring them to their DR site. On inspecting the DR site, I found that it didn't have a tape library or even a tape drive. Their official response was "if the main site went down we'd just take the tape library from there and move it to the DR site to restore the tapes". Their DR planning had not even considered the entire site being lost due to flooding or fire or zombies or anything.

whitey · May 6, 2016

LOLS at failed BCDR plans :-D

Patriot · May 6, 2016

Eliav said:
They're HP SmartArray P721m controllers.

HP is sending another controller for our peace of mind, but they don't think that'll get the data back.

I still can't understand how the controller will just wipe the entire raid config. There isn't even the ability to re-import it from the drives.

Backup restoration will be slow. We're still deciding what we want to switch to - maybe all iscsi.

It can't? The raid config is on the drives not the controller... I'd bet the jbod being the culprit more than controller... You should be able to see the config on another blade plugged into the same slot.

cheezehead · May 6, 2016

I'd see if I could get the card loaded with the original firmware....then connect the drives and see what can be seen. As mentioned earlier SmartArray controllers configs are stored on the drives and not the actual controller.

Eliav · May 6, 2016

Patriot said:
It can't? The raid config is on the drives not the controller... I'd bet the jbod being the culprit more than controller... You should be able to see the config on another blade plugged into the same slot.

cheezehead said:
I'd see if I could get the card loaded with the original firmware....then connect the drives and see what can be seen. As mentioned earlier SmartArray controllers configs are stored on the drives and not the actual controller.

We did load the previous firmware without luck.
The firmware on the JBOD wasn't touched, the only thing that changed was the controller FW.
I thought the same thing - HP said they've never seen this happen ( a firmware update causing a config wipe).

As soon as we can schedule a downtime we're going to swap to the replacement card - we've shuffled VMs around and are all back up and running on other gear now at least.

Looks like we're just going to migrate everything over to iSCSI. I don't want to touch that storage with a 10' pole.

Search

Just lost 72TB of data after a HP C7000/ Dell MD1200 FW update

Eliav

Member

Evan

Well-Known Member

SomeGuyInTexas

Member

Patrick

Administrator

Eliav

Member

PigLover

Moderator

whitey

Moderator

Yarik Dot

Active Member

whitey

Moderator

SomeGuyInTexas

Member

Terry Kennedy

Well-Known Member

EffrafaxOfWug

Radioactive Member

whitey

Moderator

Patriot

Moderator

cheezehead

Active Member

Eliav

Member