Needle in a Hay Stack - NFS R/W Latency ESXi

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
So I have been trying to work out what has been causing latency in my system. I have gone through this bug before.

<Before>

I decided foolishly to upgrade to ESXi 6.0 from ESXi 5.1U2 before Veeam supported it. I saw no issues while running on 6.0 and it seemed stable but due to needing backups I rolled back to 5.5 latest and figured I was good to go. NFS started immediately having issues where the latency was through roof. We are talking 10-15 second spikes and sometimes the NFS just falls off the ESXi server. After looking at what I thought caused the problem I rolled back to ESXi 5.1U2 and everything seemed to stabilize back to normal.

</Before>

3 Weeks ago I rolled up to 6.0 since it was fully supported by Veeam and it seemed like the thing to do since it was going to jump me passed the problem child ESXi 5.5 latest. Everything was beautiful for about 2 and 1/2 weeks and then things came right back.

I am now back at ESXi 5.1U2 still seeing a slow down but smaller than on ESXi 6.0

I believe it all relates to my underlying ZFS Storage.

System Specs
SuperMicro X9DRH-7TF
80GB of Memory
2x E5-2630

VM is running
running on:SunOS Prometheus 5.11 omnios-b281e50 i86pc i386 i86pc
Nappit 0.9f5

24GB of Memory and 4 Cores give to it
Passthrough of a LSI 2008 and a 6 port clarksdale onboard.


pool: data
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.
scan: scrub repaired 0 in 9h37m with 0 errors on Thu May 21 19:19:19 2015
config:

NAME STATE READ WRITE CKSUM CAP Product /napp-it IOstat mess
data ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c4t5000CCA228C0C183d0 ONLINE 0 0 0 3 TB Hitachi HDS5C303 S:0 H:0 T:0
c4t5000CCA228C0D2E4d0 ONLINE 0 0 0 3 TB Hitachi HDS5C303 S:0 H:0 T:0
c4t5000CCA228C120FEd0 ONLINE 0 0 0 3 TB Hitachi HDS5C303 S:0 H:0 T:0
c4t5000CCA228C12DE3d0 ONLINE 0 0 0 3 TB Hitachi HDS5C303 S:0 H:0 T:0
logs
c3t1d0 ONLINE 0 0 0 256.1 GB Samsung SSD 840 S:0 H:0 T:0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Sun May 24 13:26:20 2015
config:

NAME STATE READ WRITE CKSUM CAP Product /napp-it IOstat mess
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0 17.2 GB Virtual disk S:0 H:0 T:0

errors: No known data errors

pool: vmware
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0 in 5h29m with 0 errors on Thu May 21 03:43:57 2015
config:

NAME STATE READ WRITE CKSUM CAP Product /napp-it IOstat mess
vmware ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c4t50014EE25EFAD895d0 ONLINE 0 0 0 1 TB WDC WD10EFRX-68P S:0 H:0 T:0
c5t1d0 ONLINE 0 0 0 750.2 GB ST3750640AS S:0 H:0 T:0
mirror-1 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0 750.2 GB ST3750640AS S:0 H:0 T:0
c7t3d0 ONLINE 0 0 0 750.2 GB ST3750640AS S:0 H:0 T:0

errors: No known data errors

These are also in the system and I did try for a while to use them as a ZIL and l2arc but that seems to just cause more problems most likely due to Memory limitations

c3t0d0 (!parted) via dd ok 256.1 GB S:0 H:0 T:0 ATA Samsung SSD 840 S1ATNEAD525980K
c3t1d0 (!parted) via dd ok 256.1 GB S:0 H:0 T:0 ATA Samsung SSD 840 S1ATNEAD525981V
c3t4d0 (!parted) via dd ok 256.1 GB S:0 H:0 T:0 ATA Samsung SSD 840 S1ATNEAD525982H

This is what I would classify as Laggy for 5.5 latest

This is what I am now seeing on 5.1U2 as well as 6.0



I wish I had a single screenshot of actually what was stable before I started messing around with it.


What I can't figure out is what changed I know I added a VM but looking over the whole system I don't think that is the cause of the problem. since when I turn it off it does not change the system.

I am up for suggestions and or thoughts.