VM Corrupts RAID array on physical host - Hyper V

Sielbear · Feb 14, 2016

Gang-

I have an odd problem that neither Microsoft nor Dell have been able to resolve yet. I know many of you have lots of experience and knowledge on both the hardware and the software side of the equation. This is more of a post of desperation / check to see if anyone has experienced anything like what we are seeing.

In late November, we installed 2 new, identical Dell R630 servers with the H730 RAID controller. The RAID array is a single RAID 10. OS is Server 2012 R2. We have only the hyper V role installed - nothing else on the physical hosts. All updates installed / applied. AV and no AV for testing. No difference.

The servers ran for a couple of months with no issue at all. To be clear, we installed them and had a very methodical migration plan for this client. We spun up a new dedicated DC for them first. This ran for a couple of weeks. We then setup Exchange and migrated their exchange data from the old server to the new, clean install of Exchange 2010. The migration was uneventful and in early January we shut down the Exchange services / uninstalled Exchange from the old server.

The week of Jan 18th, we got an event log message Disk 153 on the physical host #1 - a logical block timeout. We contacted Dell, and they informed us (after running the usual checks) that the hardware was perfectly fine and that we could disregard this message. It wasn't filling the event log or anything, so they determined it was a fluke.

Tuesday, January 26th (2-4 weeks after finalizing the migration), we received a call from the client that Exchange wasn't working - outlook was showing disconnected. We connected to the physical host and saw the VM was not running. 2 servers (prepared for data, but no data moved) were running in addition to the DC on this box. We attempted to start the Exchange VM and received a message about the VHD being on storage that did not support clustered sharing - odd since this is direct attached storage that has never been in a cluster. After troubleshooting for 20 - 30 minutes, we got Microsoft involved. Eventually they recommended we reboot the physical host.

Upon reboot, we received the generic message that no operating system was found. Luckily we had setup replication, so we did a failover to Server #2 and did a repair from the Windows boot media. The D volume was seen as RAW - so pretty useless.

Dell and Microsoft asked for time to research the issue and we allowed them to work on it. 2 Days later on the 28th, we received a Disk 153 error on Server #2. Shortly thereafter, we received reports that Exchange was acting up again. Same story as above - eventual request to reboot and we saw the message that no OS was detected. This time we had to do a full restore from backup. By the time it was completely rebuilt / up and running, it was an 8 - 12 hour impact to the client.

Thursday, during the rebuild process, Dell upgraded the firmware on both raid cards - they determined the firmware had crashed multiple times on Server #1. During the day Friday, Dell determined that Server #1 did infact have a bad RAID controller. We replaced the raid controller on Server #1 and verified all was well - replication re-enabled between server #1 to server #2. We also installed an old server from our office as an emergency (our old C6100 in fact).

Monday morning at 6:30 AM we received reports that some mail was acting up. Around the same time, our monitoring agent alerted us to a Disk 153 error on Server #1 (hosting Exchange). We failed over to Server #2 and got them back up. We then setup replication to our emergency server as we were tired of rebuilding these things.

On Friday of that week, Feb 4, we received a Disk 153 Error on Server #2, along with reports of mail acting up. We failed them over to our emergency hardware, and all has been running fine for the past week.

I have 2 servers that seem to completely flake out ONLY when the Exchange VM server is running on that physical host. We will log a Disk 153 error, and it's all downhill from there. Microsoft is blaming Dell and Dell is blaming Microsoft.

After lots of searching, I found the following post that is... similar enough to give me pause:
https://social.technet.microsoft.com/Forums/windowsserver/en-US/cec8c172-4390-45f1-83a9-578dc0655627/hyperv-host-and-all-guests-become-unresponsive-event-id-153-fills-system-log?forum=winserverhyperv

Prior to this episode, I'd have bet big money that this simply could not happen. My understanding was that just about anything could go wrong inside a VM without impacting the physical host. I am now completely lost as to how to proceed. I have 2 more server migrations lined up (one in my office), but I have no confidence in the hardware + software we've spec'd at this point. There's clearly something very wrong with this combination of VM load, Hyper v, RAID controller, etc.

I'm at my wits end, and I'm contemplating moving to VMWare + veeam for replication and backup. I really don't know what else to do.

ANY thoughts, suggestions, tangential experiences are welcome. Thank you for reading this super long post.

Alan

cesmith9999 · Feb 14, 2016

this sounds like hardware issues. have you re-applied all HW firmware updates? and you have not mentioned if your hosts are connected with shares storage (clustered).

and have you stopped the service listed on the link you provided to see if that helps?

Chris

Sielbear · Feb 14, 2016

One other note - since Feb 6th, we've built-up a VM inside Hyper v on Server #1. We setup io meter and have been throttling that box. No errors as of yet. I'm starting to thing that we may have to do a restore of the DC and the Exchange box in a sandboxed environment to get to the bottom of this? Also - the backup is provided by Replibit - they do basic VSS calls, so nothing really special about it.

Sielbear · Feb 14, 2016

All firmware has been reinstalled on both servers. Hosts are NOT using any shared storage. These are standalone, shared nothing with simple hyper v replication setup between them.

Which service are you referencing in the link? The storage accelerator driver seems to only be a factor in 2003 / xp boxes that are 32 bit. This is 2012 R2 64 bit VMs. We also don't seem to have an option to disable "data protection" in the raid controller like this other post did.

j_h_o · Feb 14, 2016

I have a few Server 2012 R2 boxes with Exchange 2013/2016 VMs running fine, on Supermicro systems with 3ware 9750-8i hardware RAID5. The VMs are using Hyper-V replication without issue.

Sielbear · Feb 14, 2016

For sure- this became our standard install over the past couple of years. Dell chassis, LSI raid card, server 2012 R2. Until now, it's been absolutely rock solid.

What I can't do is continue down the path without knowing a root cause from either Dell or Microsoft.

Hoping someone here has experienced a similar issue.

Sielbear · Feb 14, 2016

Sielbear said:
Dell chassis, LSI raid card, server 2012 R2.

Hoping someone here has experienced a similar issue.

To clarify, the higher-end Dell branded raid cards, which are manufactured by LSI... I realized I could confuse the issue here.

Diavuno · Feb 14, 2016

That is quite odd. It should be a sound setup. Have you actually replaced disc 153?

More than a few times I've seen raid cards pass a drive that was actually failed. They can cause intermittent issues, but the fact it's happening on two hosts is a real brain teaser.

Sielbear · Feb 14, 2016

To clarify, we are seeing event ID: 153, from the source: Disk. This is purely an abstracted, logical warning in Windows. This error was not previously logged in windows. From what I can tell, it only is triggered when the minidriver (hardware driver) encounters an error under the storport driver. Up until 2008 R2 (or maybe 2012 - I've done a lot of reading on this error, but quite frankly lost some of the details), Windows did not capture these errors. If the error was significant enough, the storport driver would trigger an alert.

I'd agree with you on the possible physical disk issue, however, 1.) we're seeing the same symptoms on 2 identically configured servers and 2.) we've run multiple drive tests without detecting anything.

The event ID 153 will ONLY trigger (so far) when this Exchange VM is running on the associated physical host. There is some combination of VM, host OS, and disk subsystem that is consistently not playing nicely together.

Jeggs101 · Feb 14, 2016

Yea my first thought is that you had a drive or backplane that is failing but if it is happening on two hosts, that's crazy. I mean Dell sells like a zillion Windows servers for Exchange with RAID cards so it doesn't sound like you're doing anything crazy.

What kind of drives are you using? The R630 can take HDDs or SSDs. I've seen more than a number of strange errors from poor SSD firmware.

With all that, if you can get a different server, that'd be my next move. If you're in a situation where you have a specific config showing failures across machines that are the same, then if you can use a different config, that might be useful. Especially if the R630's are dying on a weekly basis causing client downtime. I'd get a third maybe even non-Dell server (HP, Lenovo, Supermicro) and see if you get the 153 error on the non-R630 as well. If you do not, then you know it is a hardware issue and Dell needs to fix. If you do see the same issue, then it is a Microsoft issue.

It might cost a lot to get a new server if you can't put the R630's to use but if a client's e-mail is down for 8 hours they're going to be really unhappy. The cost of your team rebuilding and the client going down is probably not too far off from a new box depending on your licensing.

Another odd thought but are you running Windows Server as the base OS for licensing issues? We've used Hyper-V server for virtualization hosts just to ensure we have a lower base OS and hypervisor footprint.

Sielbear · Feb 14, 2016

Thanks for all the replies...

We are using Dell SAS 10k drives - off the top of my head, I think they are the 600 GB drives. I want to say 10 in a RAID 10 config, so like 2.7 TB usable.

We are running on the Dell C6100 currently (just a RAID 1 on SSDs) without incident. Now, it's only been 1 week, but that's the longest stretch we've seen without catastrophic data loss. No Disk 153 errors yet, no calls from client. If we go another week, it's certainly a combination of this hardware + the Exchange VM.

We are running Server 2012 R2 - full install - with Hyper V role installed. That's always been our standard build as the full install is easy to manage, and all of our traditional tools run on the physical hosts. Plus, the combination of features Microsoft has packed into hyper v is just astounding.

That being said, this issue is causing some concerns.

I see the thoughts about new servers, etc., however, my initial thought was to try ESXi and see what happens. The biggest problem is that we seem to only be able to replicate the Disk 153 error when the problematic VM is running on the box. So we either test with live client data - and that's not gonna happen - they are frustrated enough, setup a sandbox and restore the data / run the microsoft load test app, or just give up and move to VMWare...

ecosse · Feb 14, 2016

The first link you gave talked about the hyper-v integration services being a problem. Have you tried uninstalling it and seeing if errors occur - then installing it again?

This link had similar-ish area - with an obvious fix. So probably no good for you but just in case T-420 server 2012 R2 backup causes disk failure eventIds: 153/140/14 - PowerEdge General HW Forum - Servers - Dell Community

Diavuno · Feb 15, 2016

I too have been deploying more Hyper V setups, mostly for my small clients who need an all in one box.... but it hasn't been without pain.

It almost sounds like you found a hardware bug, from what I've read so far you have done everything correct.

I also agree with Jeggs though, it might be time for another box. It's a hard chunk of change to swallow, but as an MSP I know sometimes it's worth it for client satisfaction. I had to eat a $2200 workstation a few months ago, and I've since recouped costs one my monthlies.... but ouch. On the other hand, they are already talking about my renewal and have referred me out since then.

TuxDude · Feb 16, 2016

In theory, I agree with this quote from the OP:

Prior to this episode, I'd have bet big money that this simply could not happen. My understanding was that just about anything could go wrong inside a VM without impacting the physical host.

And at least from a security perspective, I can't think of any any attacks that have made the news by attacking hosts from within guests. And if attackers can't get at the host on purpose, it seems unlikely that you would be doing it by accident.

It sounds to me like you've found a bug, still a few places where it could be. The kind of bug that is only triggered under a very specific sequence of events that is occurring in your exact configuration/workload. The workload is coming from the VM, but the bug is at a lower level and is thus affecting the entire host. I would guess most likely that the issue is in the RAID firmware or drivers. If that is the case it isn't going to be easy for Dell or MS to reproduce the problem to debug it - good luck.

Sielbear · Mar 3, 2016

It's been a while since I updated this thread, so I figured I'd close this out.

No root cause was ever determined. After 5 failovers and 2 recoverys in 2 weeks, we were out of options. We've been running them on an emergency C6100. We have converted the 2 servers over to ESXi boxes. We will be performing a migration of the production load in the next couple of weeks (slowly). We will be implementing Veeam to provide replication capability between boxes as well as data backup.

I greatly appreciate all of the feedback here. This was one of the more frustrating experiences I've ever had in IT, and I'll always wonder in the back of my mind, "what was the root cause of the failure???"

Sielbear · Jun 16, 2016

Updating this thread...

This saga has been continuing for months. The summary:

We moved to VMWare - problem continued in the form of PSODs.
Dell replaced the entire server - problems continued
Since we are in IT, we had other clients who had ordered the same hardware... This problem continued to grow. We have 3 projects that have been half-installed on loaner hardware from us - older C2100 systems with older H700 controllers - not a single glitch out of those boxes.

We've updated firmware / drivers / bios more than I care to tell you.
We've had multiple cases opened with Dell with no meaningful response. On a second server that was struggling, Dell swapped motherboard and RAID card. PSOD came back shortly thereafter. To say the team's belief in this platform was utterly shattered would be an understatement.

I finally reached out to a resolution managed and was at the point of demanding Dell buy all this crap back - it was patently unstable. I also was starting to think maybe we were doing something wrong on our end. I could not find another person facing what we were facing.

Now... if you look back at the top of my post, you'll see a link to a guy with an IBM System X server with oddly similar behavior. Different platform, but similar enough. I looked through my options in the BIOS, but I didn't see anything called exactly what this guy disabled. Makes sense as it's a different OEM version of the LSI card... I start thinking maybe this option is either permanently enabled or not accessible.

Fast forward to this week. Dell resolution manager emails us the following:

I had a hardware escalation contact review the report Bryan and I just pulled and he recommended the steps below. This PSOD can be seen when a particular setting is enabled on the PERC and Windows guests are running:

1. Download the PERCCLI Utility here: Dell Product support

2. Extract and upload the vib to an ESXi datastore.

3. Run command on hypervisor: esxcli software vib install -v=<pathtovibfile>vmware-esx-perccli.vib --no-sig-check

a. This .vib install requires no reboot – will be confirmed with the False value to the Reboot Required output after the installation completes.

4. Once it is installed, navigate to /opt/lsi/perccli on hypervisor.

5. Run command: ./perccli /c0 show all

a. Identify the Virtual Disk number that relates to the VD.

b. Example from my lab that indicates my intended VD is number 1:

i. ------------------------------------------------------------

ii. DG/VD TYPE State Access Consist Cache sCC Size Name

iii. ------------------------------------------------------------

iv. 0/0 RAID1 Optl RW Yes RWBD - 558.375 GB

v. 1/1 RAID10 Optl RW No RWBD - 5.455 TB

vi. ------------------------------------------------------------

6. Run command: ./perccli /c0 /vX set pi=off

a. Replace the X on /vX with the number of the Virtual Disk ID’d on step 5.

7. To validate current runtime settings, run command: ./perccli /c0 /vX show all

I do a search and can't quite make out what the command does / is supposed to do.

He follows up this morning with:
Please let us know how things went with the system and if you were able to disable the T10 errors? If you need assistance with this please let us know.

Wait a minute... T10?? The SAME terminology used a year ago on this IBM server? Shut the front door.

We are making this change now. Will report back as to stability.

To clarify, T10 is some advanced data protection scheme that is supposed to keep the system from experiencing silent data corruption. From my experience, it seems to lead to very loud data corruption (as opposed to silent).

The manual states this feature is not supported on 512 byte or 512e byte disks. I've not gone through all of our configs to try and determine the disks, but if I were a betting man, I'd guess we had some incompatibility somewhere.

cesmith9999 · Jun 16, 2016

and it is the default setting on creating new raid sets...

Chris

TuxDude · Jun 16, 2016

Sielbear said:
To clarify, T10 is some advanced data protection scheme that is supposed to keep the system from experiencing silent data corruption. From my experience, it seems to lead to very loud data corruption (as opposed to silent).

T10 is a technical committee that handles lots of the standards around SCSI, SAS, etc. - they do a LOT more than just data protection. Introduction to T10

Sielbear said:
The manual states this feature is not supported on 512 byte or 512e byte disks. I've not gone through all of our configs to try and determine the disks, but if I were a betting man, I'd guess we had some incompatibility somewhere.

That sounds like DIF, which requires 520-byte blocks on disk - the regular 512 + 8 bytes of protection data (think checksums). Just one of many standards that have come out of T10. Data Integrity Field - Wikipedia, the free encyclopedia

Search

VM Corrupts RAID array on physical host - Hyper V

Sielbear

Member

cesmith9999

Well-Known Member

Sielbear

Member

Sielbear

Member

j_h_o

Active Member

Sielbear

Member

Sielbear

Member

Diavuno

Active Member

Sielbear

Member

Jeggs101

Well-Known Member

Sielbear

Member

ecosse

Active Member

Diavuno

Active Member

TuxDude

Well-Known Member

Sielbear

Member

Sielbear

Member

cesmith9999

Well-Known Member

TuxDude

Well-Known Member