Gang-
I have an odd problem that neither Microsoft nor Dell have been able to resolve yet. I know many of you have lots of experience and knowledge on both the hardware and the software side of the equation. This is more of a post of desperation / check to see if anyone has experienced anything like what we are seeing.
In late November, we installed 2 new, identical Dell R630 servers with the H730 RAID controller. The RAID array is a single RAID 10. OS is Server 2012 R2. We have only the hyper V role installed - nothing else on the physical hosts. All updates installed / applied. AV and no AV for testing. No difference.
The servers ran for a couple of months with no issue at all. To be clear, we installed them and had a very methodical migration plan for this client. We spun up a new dedicated DC for them first. This ran for a couple of weeks. We then setup Exchange and migrated their exchange data from the old server to the new, clean install of Exchange 2010. The migration was uneventful and in early January we shut down the Exchange services / uninstalled Exchange from the old server.
The week of Jan 18th, we got an event log message Disk 153 on the physical host #1 - a logical block timeout. We contacted Dell, and they informed us (after running the usual checks) that the hardware was perfectly fine and that we could disregard this message. It wasn't filling the event log or anything, so they determined it was a fluke.
Tuesday, January 26th (2-4 weeks after finalizing the migration), we received a call from the client that Exchange wasn't working - outlook was showing disconnected. We connected to the physical host and saw the VM was not running. 2 servers (prepared for data, but no data moved) were running in addition to the DC on this box. We attempted to start the Exchange VM and received a message about the VHD being on storage that did not support clustered sharing - odd since this is direct attached storage that has never been in a cluster. After troubleshooting for 20 - 30 minutes, we got Microsoft involved. Eventually they recommended we reboot the physical host.
Upon reboot, we received the generic message that no operating system was found. Luckily we had setup replication, so we did a failover to Server #2 and did a repair from the Windows boot media. The D volume was seen as RAW - so pretty useless.
Dell and Microsoft asked for time to research the issue and we allowed them to work on it. 2 Days later on the 28th, we received a Disk 153 error on Server #2. Shortly thereafter, we received reports that Exchange was acting up again. Same story as above - eventual request to reboot and we saw the message that no OS was detected. This time we had to do a full restore from backup. By the time it was completely rebuilt / up and running, it was an 8 - 12 hour impact to the client.
Thursday, during the rebuild process, Dell upgraded the firmware on both raid cards - they determined the firmware had crashed multiple times on Server #1. During the day Friday, Dell determined that Server #1 did infact have a bad RAID controller. We replaced the raid controller on Server #1 and verified all was well - replication re-enabled between server #1 to server #2. We also installed an old server from our office as an emergency (our old C6100 in fact).
Monday morning at 6:30 AM we received reports that some mail was acting up. Around the same time, our monitoring agent alerted us to a Disk 153 error on Server #1 (hosting Exchange). We failed over to Server #2 and got them back up. We then setup replication to our emergency server as we were tired of rebuilding these things.
On Friday of that week, Feb 4, we received a Disk 153 Error on Server #2, along with reports of mail acting up. We failed them over to our emergency hardware, and all has been running fine for the past week.
I have 2 servers that seem to completely flake out ONLY when the Exchange VM server is running on that physical host. We will log a Disk 153 error, and it's all downhill from there. Microsoft is blaming Dell and Dell is blaming Microsoft.
After lots of searching, I found the following post that is... similar enough to give me pause:
https://social.technet.microsoft.com/Forums/windowsserver/en-US/cec8c172-4390-45f1-83a9-578dc0655627/hyperv-host-and-all-guests-become-unresponsive-event-id-153-fills-system-log?forum=winserverhyperv
Prior to this episode, I'd have bet big money that this simply could not happen. My understanding was that just about anything could go wrong inside a VM without impacting the physical host. I am now completely lost as to how to proceed. I have 2 more server migrations lined up (one in my office), but I have no confidence in the hardware + software we've spec'd at this point. There's clearly something very wrong with this combination of VM load, Hyper v, RAID controller, etc.
I'm at my wits end, and I'm contemplating moving to VMWare + veeam for replication and backup. I really don't know what else to do.
ANY thoughts, suggestions, tangential experiences are welcome. Thank you for reading this super long post.
Alan
I have an odd problem that neither Microsoft nor Dell have been able to resolve yet. I know many of you have lots of experience and knowledge on both the hardware and the software side of the equation. This is more of a post of desperation / check to see if anyone has experienced anything like what we are seeing.
In late November, we installed 2 new, identical Dell R630 servers with the H730 RAID controller. The RAID array is a single RAID 10. OS is Server 2012 R2. We have only the hyper V role installed - nothing else on the physical hosts. All updates installed / applied. AV and no AV for testing. No difference.
The servers ran for a couple of months with no issue at all. To be clear, we installed them and had a very methodical migration plan for this client. We spun up a new dedicated DC for them first. This ran for a couple of weeks. We then setup Exchange and migrated their exchange data from the old server to the new, clean install of Exchange 2010. The migration was uneventful and in early January we shut down the Exchange services / uninstalled Exchange from the old server.
The week of Jan 18th, we got an event log message Disk 153 on the physical host #1 - a logical block timeout. We contacted Dell, and they informed us (after running the usual checks) that the hardware was perfectly fine and that we could disregard this message. It wasn't filling the event log or anything, so they determined it was a fluke.
Tuesday, January 26th (2-4 weeks after finalizing the migration), we received a call from the client that Exchange wasn't working - outlook was showing disconnected. We connected to the physical host and saw the VM was not running. 2 servers (prepared for data, but no data moved) were running in addition to the DC on this box. We attempted to start the Exchange VM and received a message about the VHD being on storage that did not support clustered sharing - odd since this is direct attached storage that has never been in a cluster. After troubleshooting for 20 - 30 minutes, we got Microsoft involved. Eventually they recommended we reboot the physical host.
Upon reboot, we received the generic message that no operating system was found. Luckily we had setup replication, so we did a failover to Server #2 and did a repair from the Windows boot media. The D volume was seen as RAW - so pretty useless.
Dell and Microsoft asked for time to research the issue and we allowed them to work on it. 2 Days later on the 28th, we received a Disk 153 error on Server #2. Shortly thereafter, we received reports that Exchange was acting up again. Same story as above - eventual request to reboot and we saw the message that no OS was detected. This time we had to do a full restore from backup. By the time it was completely rebuilt / up and running, it was an 8 - 12 hour impact to the client.
Thursday, during the rebuild process, Dell upgraded the firmware on both raid cards - they determined the firmware had crashed multiple times on Server #1. During the day Friday, Dell determined that Server #1 did infact have a bad RAID controller. We replaced the raid controller on Server #1 and verified all was well - replication re-enabled between server #1 to server #2. We also installed an old server from our office as an emergency (our old C6100 in fact).
Monday morning at 6:30 AM we received reports that some mail was acting up. Around the same time, our monitoring agent alerted us to a Disk 153 error on Server #1 (hosting Exchange). We failed over to Server #2 and got them back up. We then setup replication to our emergency server as we were tired of rebuilding these things.
On Friday of that week, Feb 4, we received a Disk 153 Error on Server #2, along with reports of mail acting up. We failed them over to our emergency hardware, and all has been running fine for the past week.
I have 2 servers that seem to completely flake out ONLY when the Exchange VM server is running on that physical host. We will log a Disk 153 error, and it's all downhill from there. Microsoft is blaming Dell and Dell is blaming Microsoft.
After lots of searching, I found the following post that is... similar enough to give me pause:
https://social.technet.microsoft.com/Forums/windowsserver/en-US/cec8c172-4390-45f1-83a9-578dc0655627/hyperv-host-and-all-guests-become-unresponsive-event-id-153-fills-system-log?forum=winserverhyperv
Prior to this episode, I'd have bet big money that this simply could not happen. My understanding was that just about anything could go wrong inside a VM without impacting the physical host. I am now completely lost as to how to proceed. I have 2 more server migrations lined up (one in my office), but I have no confidence in the hardware + software we've spec'd at this point. There's clearly something very wrong with this combination of VM load, Hyper v, RAID controller, etc.
I'm at my wits end, and I'm contemplating moving to VMWare + veeam for replication and backup. I really don't know what else to do.
ANY thoughts, suggestions, tangential experiences are welcome. Thank you for reading this super long post.
Alan