Slow power on of VMs...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

JimPhreak

Active Member
Oct 10, 2013
553
55
28
I just built a new ESXi 5.5 server and I'm noticing a weird issue with regard to when I manually power on a guest VM.

Every time I manually power on a VM the process takes exactly 45 seconds to complete. I'm referring to the time the VMware task itself takes to complete not how long it takes the guest OS to startup. After that 45 seconds I then see the OS booting normally in the console.

However when I power down a VM the task completes immediate. Since it takes 45 seconds for the task to complete every time I'm assuming it's a setting somewhere but I've never seen this one before on my previous ESXi server builds.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
How much memory does the guest have and what sort of storage are you powering up on? Are the discs busy when you hit the power button? I suspect you're seeing a delay as the hypervisor initialises the .vswp files...
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
How much memory does the guest have and what sort of storage are you powering up on? Are the discs busy when you hit the power button? I suspect you're seeing a delay as the hypervisor initialises the .vswp files...
The guest have 8GB, 4GB and 4GB of RAM dedicated to them. VMs are stored on a Samsung SM951 M.2 512GB SSD.

I just find it strange that the delay is EXACTLY 45 seconds every time.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
If you look in the vmware log file it should give you some indication of what it's doing or waiting on for when you hit the power button...

If you power on the VM from the command line do you still have the same delay?
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
If you look in the vmware log file it should give you some indication of what it's doing or waiting on for when you hit the power button...

If you power on the VM from the command line do you still have the same delay?
I'm not familiar with VMware's command line. Here is what the log file shows.

VMware log.txt - Google Drive

This is from the moment I started the power on task and you'll see I left two spaces the moment the task completed.
 
Last edited:

JimPhreak

Active Member
Oct 10, 2013
553
55
28
Tried moving a VM from the SM951 to a datastore on one of my Intel 730's to see if it was something to do with the M.2 drive. Unfortunately it does the exact same thing regardless of which drive the guest VM is on.
 

Entz

Active Member
Apr 25, 2013
269
62
28
Canada Eh?
Are you doing any replication or is the VM pointing to a host CDdrive that may have a non-bootable disk etc.
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
Are you doing any replication or is the VM pointing to a host CDdrive that may have a non-bootable disk etc.
No replication. No host CD drives. They are all set to client device under CD/DVD drive.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
The guest have 8GB, 4GB and 4GB of RAM dedicated to them. VMs are stored on a Samsung SM951 M.2 512GB SSD.

I just find it strange that the delay is EXACTLY 45 seconds every time.
Something definitely wrong, my VM's take 5-7 sec to fully boot on a ZFS hybrid pool via 10G NFS. Never had issues like this, sounds disk I/O related, with 3 Vm's on a decent SSD you would not think it though...odd.

Check esxtop and look at your 'd' (adpaters) vmhba## if HBA/RAID card/FC/iSCSI/FCoE) and 'u' (disks) esxtop pages/outputs.

In the 'u' devices page check naa.obscureWWNnumber for direct attached/DAS disks, {NFS} datastore-name for NFS) and check DAVG/KAVG/GAVG for NFS datastores latencies (NFS will only yield guest average/GAVG). If spikes/peaks above 15 ms you've got issues. Also check logs as stated (specifically vmkernel.log).

Something is amiss.
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
For log files, there's the vmware.log files stored within the VMFS datastore that contains data pertinent to that sole VM, plus you'll also want to check the multitude of logs in /var/log/ (a bevvy of terminal windows and tail -f is your friend); just make sure that your system time is in sync with your workstations via NTP or somesuch first so that they're easier to correlate with the events you're seeing. Can't see files on google drive et al whilst I'm at work sadly.

If your host is resource constrained you'll often see a delay booting VMs since the hypervisor has to evacuate guest memory from running VMs into the .vswp files, then initialise the .vswp files for the VM you're booting. When you say they have XGB "dedicated" to them, do you mean that they all have a reservation set in the resources tab? Or do they have the unlimited option button ticked? Does this host have enough RAM to give them all 100% of their allocated memory plus overheads and the service console?

But strange that it's always exactly 45s which, as Entz suggests, sounds like the VMs are waiting on an I/O request to timeout before completing. Are these VMs you've brought over from another host or were they clean builds on this tin?
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
For log files, there's the vmware.log files stored within the VMFS datastore that contains data pertinent to that sole VM, plus you'll also want to check the multitude of logs in /var/log/ (a bevvy of terminal windows and tail -f is your friend); just make sure that your system time is in sync with your workstations via NTP or somesuch first so that they're easier to correlate with the events you're seeing. Can't see files on google drive et al whilst I'm at work sadly.

If your host is resource constrained you'll often see a delay booting VMs since the hypervisor has to evacuate guest memory from running VMs into the .vswp files, then initialise the .vswp files for the VM you're booting. When you say they have XGB "dedicated" to them, do you mean that they all have a reservation set in the resources tab? Or do they have the unlimited option button ticked? Does this host have enough RAM to give them all 100% of their allocated memory plus overheads and the service console?

But strange that it's always exactly 45s which, as Entz suggests, sounds like the VMs are waiting on an I/O request to timeout before completing. Are these VMs you've brought over from another host or were they clean builds on this tin?
Hi Effra,

Thanks for the reply. To answer some of your questions:

1) In regard to how much RAM is dedicated to my VMs, theya re all set as such:



So yes, the Unlimited box is checked.

2) I currently have 4 active VMs that have the following RAM allocations (4GB, 4GB, 8GB, 4GB). I also have 2 inactive VMs that have 4GB allocated to them but whether these are on or off, nothing changes with regard to the slow power on of any VM. My host currently has 32GB of RAM in it.

3) 2 of the 4 active VMs were brought over from another host (restore through Veeam backup) but the other 2 are fresh installs on this host.


Is there anything in particular I should be looking for in these logs that might stand out? I've never really had to review VMware logs before so I'm not even really sure what I'm looking at/for.

Also for some strange reason the time in my logs has been off ever since I set NTP. My host shows the correct time when I go Time Configuration and it matches my workstations but int he logs the time shows being 4 hours ahead.
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
1) Think you've actually included the screenie from the CPU tab there rather than memory :) But if you haven't been tweaking settings, out-of-the-box VMware will leave those set at "unlimited" with no reservation. The reservation size affects the amount of space given to the .vswp files when you're running in a resource-constrained scenario.

Sounds like you shouldn't be running into any resource problems or SCSI stuff dangling off into nowhere especially if both the new and old VMs have the same problem.

Looking at vmware logs is a chore at the best of times (I swear they design them to be as "you should buy a support contract"-readable as possible, hence why I'd recommend getting on the shell of the host via SSH, clearing out the logs so that they're all as empty as possible and see if you can see what it's doing. whitey's comment about esxtop is valid as well - be handy to see if there's a period of disc activity or a CPU bottleneck or similar before the power-on event happens.
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
1) Think you've actually included the screenie from the CPU tab there rather than memory :) But if you haven't been tweaking settings, out-of-the-box VMware will leave those set at "unlimited" with no reservation. The reservation size affects the amount of space given to the .vswp files when you're running in a resource-constrained scenario.

Sounds like you shouldn't be running into any resource problems or SCSI stuff dangling off into nowhere especially if both the new and old VMs have the same problem.

Looking at vmware logs is a chore at the best of times (I swear they design them to be as "you should buy a support contract"-readable as possible, hence why I'd recommend getting on the shell of the host via SSH, clearing out the logs so that they're all as empty as possible and see if you can see what it's doing. whitey's comment about esxtop is valid as well - be handy to see if there's a period of disc activity or a CPU bottleneck or similar before the power-on event happens.
You're right about the screen shot, I just fixed that in my previous post.

As for the logs, I'll go ahead and clear them and see what I can find.
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
Ok finally have some time to troubleshoot this issue. I've been working on getting my storage VM going as preclearing 8TB drives takes a LONG time so I wanted to get that going.

With regard to esxtop, what should I be looking for? I've shut down all my non-essential VMs so the only ones running are pfSense, Media Server, and Storage Server. When I click to startup a VM, are there some indicators I should be looking for specifically in esxtop?

I've never really had to troubleshoot ESXi as I'm sure you can tell.
 

TuxDude

Well-Known Member
Sep 17, 2011
616
338
63
I usually look mostly at the latency numbers when using esxtop to dig into IO issues. I like to look at storage from the disk device perspective (hit 'u' while in esxtop), and make sure the latency fields are enabled (hit 'f' to get the field list, and 'i' to enable the * next to 'Overall Latency Stats', or 'j' or 'k' if you specifically want read or write latencies). Also usually turn off any fields I'm not interested in otherwise the display quickly gets too wide and I can't see the right-most fields, and hit 's', '2', '<enter>' to increase the refresh rate to 2 seconds (as fast as it can go). You will get 4 fields for latency, DAVG, KAVG, GAVG, and QAVG, which correspond to Device latency, Kernel, Guest, and Queue. Of those, Device is how much latency there was going over the wire to the storage, and coming back (whether that is local SATA/SAS, or remote iSCSI, FC, etc.), Kernel is the time spend in the hypervisor kernel, and Queue is time spent in the IO queue - add all those together and you get the Guest latency, which is what the guest OS sees for total latency.

I usually like to keep all my latencies under 10 ms, occasional short spikes into the 20s are usually ok (often associated with a large batch of IOs from somewhere that gets the queues going), but if it stays above 20 for more than a second or two or is doing it often then the chance of users complaining about slowness is pretty high - with latency under 10 we never hear a complaint.

If you need to figure out which VM is causing the high IO, then switch over to the 'disk VM' perspective ('v' key).
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
I usually look mostly at the latency numbers when using esxtop to dig into IO issues. I like to look at storage from the disk device perspective (hit 'u' while in esxtop), and make sure the latency fields are enabled (hit 'f' to get the field list, and 'i' to enable the * next to 'Overall Latency Stats', or 'j' or 'k' if you specifically want read or write latencies). Also usually turn off any fields I'm not interested in otherwise the display quickly gets too wide and I can't see the right-most fields, and hit 's', '2', '<enter>' to increase the refresh rate to 2 seconds (as fast as it can go). You will get 4 fields for latency, DAVG, KAVG, GAVG, and QAVG, which correspond to Device latency, Kernel, Guest, and Queue. Of those, Device is how much latency there was going over the wire to the storage, and coming back (whether that is local SATA/SAS, or remote iSCSI, FC, etc.), Kernel is the time spend in the hypervisor kernel, and Queue is time spent in the IO queue - add all those together and you get the Guest latency, which is what the guest OS sees for total latency.

I usually like to keep all my latencies under 10 ms, occasional short spikes into the 20s are usually ok (often associated with a large batch of IOs from somewhere that gets the queues going), but if it stays above 20 for more than a second or two or is doing it often then the chance of users complaining about slowness is pretty high - with latency under 10 we never hear a complaint.

If you need to figure out which VM is causing the high IO, then switch over to the 'disk VM' perspective ('v' key).
I'm pretty convinced at this point it's not a disk I/O issue. Did exactly as you asked and during the 45 seconds from when I try to turn on a VM and the task completes (tried this 5 times), the highest value I got on the Samsung Sm951 (VM Datastore) was 1.94 for DAVG which was probably from one of my active VMs and that was for a split second. It almost seems like the disk isn't even being accessed for those 45 seconds while the power on task is stuck at "In Progress."

I think I'm going to have to start diving into the logs. FUN!
 

TuxDude

Well-Known Member
Sep 17, 2011
616
338
63
I'm pretty convinced at this point it's not a disk I/O issue. Did exactly as you asked and during the 45 seconds from when I try to turn on a VM and the task completes (tried this 5 times), the highest value I got on the Samsung Sm951 (VM Datastore) was 1.94 for DAVG which was probably from one of my active VMs and that was for a split second. It almost seems like the disk isn't even being accessed for those 45 seconds while the power on task is stuck at "In Progress."

I think I'm going to have to start diving into the logs. FUN!
Logs sound like the next logical place to check (I'd check CPU usage while still in esxtop but really doubt its the problem). I've found for issues like this the best log to look in is also the one that's hardest to get at (or maybe there's an easier way that I haven't found yet) - from a shell session on the host (local or SSH) cd into /vmfs/volumes/<datastore-name>/<vm-name>/ and there you will find the log files for that specific VM, the current (or most-recent if the VM is off) is named vmware.log. Note that <vm-name> in that path doesn't get renamed if you have renamed the VM (until you storage-vmotion it), so its possible it won't match. Also, if you are using shared storage with multiple hosts that the current log file for a running VM will be locked by the host running it - your shell session must be on the same host if you want to read that file.
 

JimPhreak

Active Member
Oct 10, 2013
553
55
28
Logs sound like the next logical place to check (I'd check CPU usage while still in esxtop but really doubt its the problem). I've found for issues like this the best log to look in is also the one that's hardest to get at (or maybe there's an easier way that I haven't found yet) - from a shell session on the host (local or SSH) cd into /vmfs/volumes/<datastore-name>/<vm-name>/ and there you will find the log files for that specific VM, the current (or most-recent if the VM is off) is named vmware.log. Note that <vm-name> in that path doesn't get renamed if you have renamed the VM (until you storage-vmotion it), so its possible it won't match. Also, if you are using shared storage with multiple hosts that the current log file for a running VM will be locked by the host running it - your shell session must be on the same host if you want to read that file.
Can't I just access those logs directly through the datastore browser in vSphere Client?
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
Technically you can, but the datastore browser doesn't behave like a file shared log file might do. Nothing to stop you downloading it after the fact and reading it at your leisure but I generally find it helps to view these things in realtime.

SSHing into the host and opening the log file with something like `tail -f /vmfs/volumes/somevol/somevm/vmware.log` and you'll get an almost-realtime log output into your terminal window (I think it refreshes every 1s by default) and this can be a big help in spotting issues as they occur. VMware logs are hideous enough that examining them after the fact is a great way to convince yourself that you need to drown that pack of ibuprofen in cheap gin.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I can't tell you how many times this doc has methodically and logically resolved issues for me.

http://communities.vmware.com/servl...952/vsphere41-performance-troubleshooting.pdf

Hope that link works, if not just google 'vSphere performance Troubleshooting Guide' and focus in on the flow chart (starting on page 15) for whatever resource you feel or find evidence of being constrained. I know the guide is a bit old (focusing on vSphere 4.1) but i could not find an updated one and the troubleshooting methodology is spot on still.

These may be useful as well...kiss your day goodbye. Good stuff w/in, I GUARANTEE it!

http://pubs.vmware.com/vsphere-51/t...er-server-51-monitoring-performance-guide.pdf

https://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf

VMware KB: Troubleshooting ESX/ESXi virtual machine performance issues
 
Last edited: