My ESXi server completely locks up at times

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ncarty97

New Member
Apr 4, 2016
22
1
1
48
Hello everyone! I came across your forum trying to find a solution to a problem I am having with my server and I thought I'd make a post and see if anyone here has any ideas.

Here is my setup:
Motherboard - ASRock 970 Extreme4
CPU - AMD FX-8320
Memory - 32GB (G.SKill Ripjaws X Series 4 x 8GB DDR3 1600 PC3 12800 (F3-1600C9Q-32GXM)
Drive Controller - IBM M1015 flashed to LSI SAS2008
NIC - Intel PRO/1000 Dual Port Server Adapter (disabled the onboard Realtek NIC)

On the M1015, I'm running 3 4TB Seagate NAS ST4000VN000 drives, 2 3TB Seagate drives (not sure the version) and 2 2TB Western Digital Green Drives. I have a couple 256GB SSDs and a 750GB WD plugged into the controllers on the motherboard.

Here's how the VM's breakout:
WHS2011 - 8GB RAM, 4 cores (1 socket), passed through the M1015 (with its 8 disks) and it has two virtual disks (1 on an SSD for the OS, one on the 750GB for torrents). I am running StableBit DrivePool. I was running FlexRAID (tried SnapRAID too), but I'll get to that later
Windows 7 VM - hosts an Emby server and I use this for playing around - 8GB Ram, 4 cores (1 socket), 1 Virtual disk on an SSD
Windows 7VM 2 - hosts a Plex Server - 8GB Ram, 4 cores (1 socket), 1 Virtual disk on an SSD
Windows Xp VM - 1GB Ram, 1 cores, 1 virtual disk on SSD (not usually running)

All three machines use the network adapter type VMXNET 3

So here is what happens. Occasionally, the server completely locks up. I cannot see it on my network (neither the Host or any of the machines). If I plug a monitor into the box, it seems like the Host is still working (I can plug a keyboard in and do the few things you can do that way), but I have no way of telling if the VMs are still running and just disconnected from the network or if they are all locked up (or only some of them are locked up). The only way to get back up and running is a hard reboot of the whole box.

I thought initially this was a networking issue so I tried a couple things. I tried different types of virtual network adapters. I then went and bought the Intel NIC I'm using now (I had been using the onboard Realtek despite warnings that ESXi doesn't play nice with them). It didn't change anything.

I was using FlexRAID for parity redundancy. I noticed that it was locking up often when I made big changes (like ripped a new Blu-Ray to my server), so I figured it might be something with FlexRAID. I ditched it for SnapRAID for a time, but the problem remained. I finally turned off both and just run FlexRAID manually once in a while to make sure I'm covered. Its hit and miss if it locks up.

I did go to the VMWare forums for help, but I never really got anywhere. I posted my logs but no one could really point me to what I should be looking for. The only thing I took from it was that it was an I/O issue relating to the drives, not a problem with the NICs (real and/or virtual). I have noticed since that it does seem to happen more when there are lots of reads and writes (like if I'm doing a bunch of torrents at the same time).

Beyond that, I'm stumped. I've upgraded to ESXi 6.0 (from 5.5) hoping that might make a difference, but it hasn't. Outside of this problem, I've really liked ESXi, but my inability to solve it is making me think of ditching it. I'm by no means a power user. I looked at maybe Windows Server Hyper-V, but frankly I find it too confusing to get it straight. I was looking tonight at setting up Windows 10 as the host by enabling Hyper-V and just doing that, but I'm not sure if that would be a good solution.

So, any ideas on how I can fix my ESXi setup? Any help would be greatly appreciated!
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,709
517
113
Canada
Things to check would be CPU temp, RAM, Power supply, configuration of your entire LAN to rule out IP conflicts etc :)
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
I had a similar strange lockup issue on an older version of ESXi (<5 I believe) and it was due to a dying hard drive.
 

Danic

Member
Feb 6, 2015
84
35
18
jrdm.us
I have noticed since that it does seem to happen more when there are lots of reads and writes (like if I'm doing a bunch of torrents at the same time).
I would check the 750GB drive to see if its healthy. Would simple disk benchmark cause the system to come to a halt?
What is ESXi installed on?
I had a gigabyte 990fx-ud3 as my ESXi server for a while and my SAS2008 card only worked in the main PCIe 16x slot. I used PCI video card in that system.

Also remember an FX-8320 is basically a 4 core with hyper-threading. (I know not exactly, but that's how I treat them) So change all your VM's 1 core/1 socket. I bet someone can explain better but if a VM guest is provisioned 4 cores and guest OS wants to use all 4 threads, guest OS will have to wait for ESXi to say, 'Go ahead I stopped all other workloads so you can run on all 4 threads" This makes the system unresponsive and other guest VM's wait until the work is complete.
 

CreoleLakerFan

Active Member
Oct 29, 2013
485
180
43
So, any ideas on how I can fix my ESXi setup? Any help would be greatly appreciated!
You mentioned you upgraded from 5.5 to 6.0, but you did not specify whether you had applied any patches. I had some random lockups including the occasional PSOD when I first deployed my ESXi 6.0 AIO server about six weeks ago. I used ESXi 6.0 Update 1 for my install ... I applied four patches: 10/15, 11/15, 01/16, and 02/16 and the stability issues were resolved. FWIW there is a March 2016 patch as well that came out on 3/15/2016 - might as well install that one as well.

The notes for one of the patches (don't remember which one) specifically mentioned the ESXi host going unresponsive. You will have to install the patches manually unless you have vCenter. Manual install involves transfering files to ESXi host, opening SSH session, unzipping the packages, and executing an esxcli command. Relatively simple stuff - it sounds like you have enough technical acumen to be able to pull it off.

Here is a nice summary page of patches/updates:

VMware ESXi 6.0.0 Patch History
 

ncarty97

New Member
Apr 4, 2016
22
1
1
48
Thanks for all the quick responses!

Things to check would be CPU temp, RAM, Power supply, configuration of your entire LAN to rule out IP conflicts etc :)
Well, I think I can rule out the RAM. I upgraded the RAM a while back and went from 16GB to 32GB (I pulled the original 16GB and put that in my desktop). I was having the problems before and after. I can rule out the LAN too I think as I moved last summer and switched out my router. I've checked for IP conflicts and nothing there. CPU temp is about the only thing I haven't checked.

I would check the 750GB drive to see if its healthy. Would simple disk benchmark cause the system to come to a halt?
I'll give this a try. The 750Gb isn't passed through directly, its a virtual disk taking up about 500GB of it, so I can't get smart data in the VM. That said, I set it up this way because torrents and FlexRAID didn't go well together (and I was short on space on my main DrivePool). That's not a problem anymore, so I may just move the torrent files to the drivepool and remove the 750GB drive all together for now.

All the drives that are on my M1015 show as fine in the smart data.

What is ESXi installed on?
USB flash drive

I had a gigabyte 990fx-ud3 as my ESXi server for a while and my SAS2008 card only worked in the main PCIe 16x slot. I used PCI video card in that system.
I'll check what slot its in, I can't remember at this point.

Also remember an FX-8320 is basically a 4 core with hyper-threading. (I know not exactly, but that's how I treat them) So change all your VM's 1 core/1 socket. I bet someone can explain better but if a VM guest is provisioned 4 cores and guest OS wants to use all 4 threads, guest OS will have to wait for ESXi to say, 'Go ahead I stopped all other workloads so you can run on all 4 threads" This makes the system unresponsive and other guest VM's wait until the work is complete.
My understanding was that the host managed how this worked and that the whole concept of virtualizing was designed around allowing a setup like I have. I tried the Plex VM with 1 core and it flat out didn't work. If I have to go down to one core on each VM, it seems like I'd be better just ditching virtualization all together.

You mentioned you upgraded from 5.5 to 6.0, but you did not specify whether you had applied any patches. I had some random lockups including the occasional PSOD when I first deployed my ESXi 6.0 AIO server about six weeks ago. I used ESXi 6.0 Update 1 for my install ... I applied four patches: 10/15, 11/15, 01/16, and 02/16 and the stability issues were resolved. FWIW there is a March 2016 patch as well that came out on 3/15/2016 - might as well install that one as well.
I believe when I did the upgrade, it included the February patch. I'll have to check. I know I applied some patch, but not sure exactly which ones.

The notes for one of the patches (don't remember which one) specifically mentioned the ESXi host going unresponsive. You will have to install the patches manually unless you have vCenter. Manual install involves transfering files to ESXi host, opening SSH session, unzipping the packages, and executing an esxcli command. Relatively simple stuff - it sounds like you have enough technical acumen to be able to pull it off.
I believe that I did it that way the last time. I'm still pretty unfamiliar with using the SSH session to do things, but I can follow directions if they are laid out well!

Thanks again for the ideas! I'll give them all a try.
 

ncarty97

New Member
Apr 4, 2016
22
1
1
48
Well, I have narrowed it down to a hardware issue. The 750GB passes all checks on it, so that drive is OK. One of my 3TB drives seems to have a dead sector, but as I said, this problem predated those drives.

I did a test by pulling out my ESXi flash drive and putting a 120GB SD in and installed Windows 10 on it. I had some other spare drives that I was able to convert the ESXi virtual disks to Hyper-V disks, so I essentially migrated everything over and passed through the disks on the SAS2008 to the WHS2011 VM since you can't PCIe pass through on Hyper-V.

Ran it for a week and on Sunday night, total lock up! I was running StableBit to do a drive scan during this period so fairly disk intensive. Maybe I was wrong about being able to put a keyboard in and have the host respond before. This time, it was just completely frozen mid screensaver. Haven't seen a PC do that since my old 486SX used to do it in some games!

So going through my hardware, the only things that have not been replaced (outside of the drives that I was able to test) are the CPU, motherboard and the SAS2008.

Good news is that my desktop PC has the exact same motherboard and CPU! (Long story, but I thought I fried my CPU and board right after buying it so I bought a replacement, then realized I had just made a dumb error). So, this weekend, I'm going to swap the CPU and motherboard and see if that changes anything. Basically, I'll just leave it there until it locks up. If it doesn't, problem fixed! If it does, I guess I'll swap out the SAS2008 for some other card(s) and see if that fixes it.
 

ncarty97

New Member
Apr 4, 2016
22
1
1
48
Do you have a fan on (or blowing on) yours? There's nothing in the slots around it and the case seems to have pretty good ventilation (an Antec 900)
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,709
517
113
Canada
I don't have a fan directly attached to mine, but it is in direct line of sight unobstructed from one of the fans on the fan wall :)
 

ncarty97

New Member
Apr 4, 2016
22
1
1
48
Well, I believe the issue is fixed. After my last post, the problem got worse and worse. I had added a couple SSD's so I could have each VM on its own SSD. These were all plugged into the SATA ports on the motherboard. Well, one particular VM seemed to trigger the lock up. I realized that from my initial test, I had moved it from the 750GB drive to one of the SSDs. I did a little testing and move it around. It was fine on the 750GB (mostly) but on the SSD, total lock up pretty much at boot of the VM. So, I swapped the ports the SATA cables were connected to on the motherboard. The situations was reversed. Now the SSD was stable (mostly) and the 750GB was instant crash.

So, I swapped the motherboard out from my desktop CPU. I also upgraded my case (ran out of room for drives on the old one) and its a lot more roomy inside, so if the M1015 was having any heat issues, it should be pretty good now. Its now running again, with all three of my main VMs running on the SSDs. 12 hours with everything running and no issues. Assuming its still looking nice tonight, I'm going to run my FlexRAID parity update and see how that does. That was guaranteed to lock it up. If that passes, I'll consider the issue solved!