Haystack Sorted Multiple Needles Found - NFS R/W Latency ESXi

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
This originally started at this thread.
https://forums.servethehome.com/index.php?threads/needle-in-a-hay-stack-nfs-r-w-latency-esxi.5847/

So I have been working on resolving Latency Issues on our ESXi host which is being effected by our ZFS host. This is a weird issues where I think multiple things were contributing to it not just one. A little previous history is required before I actually describe the problem because I think it’s relevant.


Note: This is the best images I have since I did not think to take a picture before It does not show the highest latency which was upwards of 100+/s

March 12, 2015 ESXi 6.0 is officially released. I jump the gun upgrading before Veeam supports it. Since I had already upgraded I figured I would wait a few hours and see if I saw any issues. During this time it seemed stable but due to needing Veeam I rolled back to 5.5 express patch 6 and figured I was good to go boy was I wrong. NFS started immediately having issues where the latency was going through the roof. 15/s per request and sometimes they would spike to 100+/s. At a certain point ESXi just could not wait anymore and flushed the requests. Normally this occurs after ESXi showed the datastore as disconnected. During this experience I only had the single variable that I had upgraded to 5.5EP6 and it was breaking NFS and I resolved it by downgrading to ESXi 5.1U2 resolving the issue.


April 29, 2015 I decided to upgrade ESXi since Veeam now supported 6.0. I patched Veeam to 8 patch 2 and then upgraded ESXi. Everything went smoothly and I only needed a reboot. There were no changes in Latency, I/O or throughput. This seemed like everything was functioning correctly.


May 6, 2015 I decided to add two More VM's to the host. Everything is running stable added the hosts to the backups and the full images were completed successfully.


May 25, 2015 On Sunday night I started experiencing something that I normally would consider a latency storm or amplified latency. Veeam seemed to bring this to the forefront due to the fact it is hitting ZFS harder than my normal workload. The problem is when Veeam ran a backup job it spiked the read and write latency on the ESXi host upwards of 20s per I/O request. This was not occurring only a day before and Veeam ran fine.




System Specifications for those who are curious


Super Micro 2U (SC933T-R760B)
Super Micro X9DRH-7TF
2GB Thumb drive for ESXi Boot Drive
M4 64GB SSD for ESXi Local datastore
80GB of Memory
6x HYNIX 8GB DDR3-1600 (HMT31GR7CFR4C-PB)
2x HYNIX 16GB DDR3-1600 (HMT42GR7MFR4C-PB)
2x E5-2630
1x M1015 (IT)
RAID-Z (data)
┌ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
└ Hitachi 3TB 5.4k (Hitachi HDS5C303)
Mirrored (vmware)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ Seagate 750GB 7.2k (ST3750640AS)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ WD Red 1TB 7.2k (WD10EFRX)
SLOG Samsung 840 Pro 256GB (MZ-7KE256BW)
Mirrored (ssd)
┌ Samsung 840 Pro 256GB (MZ-7KE256BW)
└ Samsung 840 Pro 256GB (MZ-7KE256BW)

The Virtual Machine Hosting ZFS

OmniOS r151006 with napp-it as the front end.
Allocated 24GB Memory
Allocated CPU 1 Processor with 4 Threads

May 26-28, 2015 System was Laggy but usable. Things seemed to drag but other real world problems got in my way before I fixed it.

May 28, 2015 I figured it was something that I had changed causing this issue. So starting with that assumption I methodically rolled the system backwards. Hoping to find a resolution to the problem. Below is the list of steps I tried before finding the cause.


+ Turn off some VMs
Latency stayed the same with a single VM or all VMs on

+ ESXi 6.0 -> ESXi 6.0 Latest
Suggested by Veeam due to the Change Block Tracking Bug

+ Reset Change Block Tracking on All VMs
Using the VMware PowerCLI Script
Sign in to Veeam

+ ESXi 6.0 -> ESXi 5.1U2
Had to re-import all virtual machine vmx files
Read/write latency dropped by 10-20% but still was 5+/s

+ Veeam Remove and Recreate all Jobs
Just trying to get a good backups before changing ZFS.
Recreated mostly due to VMs having their ids changed
Throughput was a pitiful 300Kb/s using Hot Add
Latency stayed the same 10+/s

+ Move Veeam to original host and rolling back to Veeam 8 launch code
This host is limited to 1Gbe. But in the end it still caused the same High latency.
Throughput was a pitiful 125Kb/s using Network

+ Rebuilding the ZFS OmniOS Host from the ground
Checking if a kernel bug or something else existed
Set IP addresses the same and installed napp-it
Imported disks and severed them to ESXi

Edit: After finding the problems I placed the pool back on the original host it had the same issues as before but after doing an OS upgrade the problems were resolved.

After watching esxtop for a while you could see a pattern where a VM would try to write and then become a noise neighbor. I removed the SLOG and tried "zfs set sync=disabled vmware" writes started resolving in 3-10ms response times. But when I would running a Veeam Job it spiked back up to 5+/s of read latency so I knew it was not just a write issue that was the problem. I turned sync=standard back on and it went back to a hold rate of 50ms with spikes in the 1s range in a 15 minute period.

At this point I shutdown all VMs and rebooted everything after the ESXi host and ZFS were up I could run a Veeam Backup and was seeing about 18MB/s from disk and around 250ms of Latency on write with minimal spikes. I grabbed a copy of all VMs so that I had a point in time since it had been 3 days since I actually had a good backup.

At this point I was going on the assumption that the software issue was fixed and there still had to be hardware issue. I placed all 3 840 Pros in a RAID-Z and experience horrible latency and 350Kb/s of throughput on it. Something was not right. I should be seeing at least 50MB/s.

+ Check Smart Data on All Drives
Returned no warning signs of Disk Failure
All SSDs have yet to experience a single bad block

Devices:
SSD1 was used as a portable device
SSD2: L2ARC for a while before I read about the memory requirements (removed around end of 2014)
SSD3: SLOG drive for vmware pool SLOG was provisioned at the full size of the SSD at 256GB





As you can see each seem to have different properties that you normally would not see. After talking with some people on the freenode #zfs channel it was asked what firmware were they running. After checking I was not running the latest firmware

+ Firmware Flash 840 Pro from DXM05B0Q -> DXM06B0Q
I saw no difference between firmware versions.
All disks reacted the same as in the two pictures above

+ Secure Erased all 840 Pros
All drives returned to the performance of ssd1 graph

Note: Due to the BIOS not allowing the drives to be un-frozen I had to flash them in a desktop. To unfreeze the SSD has to have the drive power cycled after the OS is booted. This is the case for firmware flashing and configuring HPA. (Reason for this is so that a rootkit/virus does not brick your device.) But some computer unfreeze with an S3 sleep cycle which makes this basically useless.

This is where it gets interesting because all three drives come back to being responsive which I find really interesting. You can see that under provisioning a drive is required for basically all use cases. I knew this but thought that since that SLOG uses about 5GB max and L2ARC Might use 50GB it would be fine. A bad assumption on my part.

+ Configure HPA on the SLOG Drive (ssd1)
SSD Over-provisioning using hdparm - Thomas-Krenn-Wiki
Set to 30GB after unfreezing the drive

Putting ssd1 in as SLOG and everything starts running normally 10-30ms latency but VM IO is still having some read latency but I assumed this was due to them not falling back into ARC yet. I would assume since the array just can't handle the load before cache kicks in. Keeping off all non-essential SSDs this kept the system Stable even with Veeam running.

+ Migrate the write intensive workload off the HDDs to SSDs
Completely resolved the issue.
As you can see from the graphs everything has returned to acceptable levels.

May 29, 2015 It’s all back up and stable.








 

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
Conclusion: Multiple things where contributing factors that caused this failure all relating to IO but if I started by looking at this problem from a storage perspective not as a bug. I think I could have solved it quicker. Key points seem to be me adding the two hosts mentioned before and both VMs having a higher write workload for the ZFS vmware datastore. Veeam taking a snapshot creating even more load in the environment. omniOS having a bug that cause read issues. Which caused me to make the assumption that read and write were both being impacted when in reality it was only writes. The root cause was the SLOG SSD not having enough TRIMed space causing a high Busy/Wait time. Garbage collection ran on the disk but could not keep up with the inbound writes.

I would also like to give a shout out to the people at https://forums.servethehome.com/ andhttps://forums.freenas.org/ for having some great information on their sites about when and how to use ZFS in a smaller environment.

TLDR: Write intensive workload was added to ESXi and then when Veeam took snapshots it would increase the workload. omniOS had a bug causing reads to be slow. The SLOG SSD was not under provisioned and started having issues with garbage collection. Writes were filling the drive faster than the disk could clean up space so it started telling ZFS to wait while garbage collection found space grinding the whole array down into the ground.


Tools used to diagnose this issue.
napp-it
esxtop in the vm view showing Read and Write Latency
vSphere Performance Tab
zfs status
iostat -x
iostat -Exn
zpool iostat vmware
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,292
1,756
113
CA
Truly an amazingly useful post. I must thank you for this as performance tuning is one of the things I enjoy doing, and this really opens up my eyes to a lot more options :)

Again. THANK YOU for spending the time to write this all.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,292
1,756
113
CA
"SLOG Samsung 840 Pro 256GB (MZ-7KE256BW)"

Any chance you tried another SSD for a SLOG before all this?

It seems to me what you mentioned with the 840P and cleanup/trim/garbage is the 1 huge negative thing I've heard from the Samsung 840P drives (fall flat on their face when write-cache is done).

Interesting that a secure erase, and OP solved it but it does make sense essentially mitigating any firmware collection/garbage latency issues by just giving yourself a TON more available space to use for that, allowing it to catch-up during down time. At-least that's my take on the drive.

thoughts?
 

neo

Well-Known Member
Mar 18, 2015
672
362
63
Man those 840s where huge sellers and offered great performance, but overtime I see more and more people experiencing headaches with them. The one thing Intel has seemed to done well compared to the competition is their SSD firmwares.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,292
1,756
113
CA
Man those 840s where huge sellers and offered great performance, but overtime I see more and more people experiencing headaches with them. The one thing Intel has seemed to done well compared to the competition is their SSD firmwares.
Basically any heavy usage that utilizes all the write-cache for periods of time cause issues. At-least this is what I've seen from my research, and some graphs (I forget which site did it.).

The Samsung 830 was used successfully in servers too with fewer issues IIRC. It's been a while, I stil have a couple kicking around I use for testing like some older Intels :D
 

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
With time I will know more about if the Under Provisioning actually worked correctly but I think it will. I also thought about underprovision my mirrored pool just so that I did not break them. When the 840 Pros where purchased they were bought new at a really good deal which I think was .43 per GB at that point in time. Knowing what I know now I would have bought used drives.

Thanks for the feedback I enjoyed writing it up.
 
  • Like
Reactions: T_Minus

wildchild

Active Member
Feb 4, 2014
394
57
28
there is a reason why people all over the net say overprovision those samsungs (830 and 840 pro's)..
awesome as l2arc devices, but even then op them to 70%.
ran into same issue myself, even after op them, had to reboot machine every 3 months or so.
replaced the 840's by 3700's for SLOG.. havent had an issue since
 

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
there is a reason why people all over the net say overprovision those samsungs (830 and 840 pro's)..
awesome as l2arc devices, but even then op them to 70%.
ran into same issue myself, even after op them, had to reboot machine every 3 months or so.
replaced the 840's by 3700's for SLOG.. havent had an issue since
Now the real question is how did your over provision your drives?
HPA
Partitions
Refreservation
 

wildchild

Active Member
Feb 4, 2014
394
57
28
please make sure you perform a secure erase before creating the host protected area
 

TechIsCool

Active Member
Feb 8, 2012
263
117
43
Clinton, WA
techiscool.com
So to bump this thread and tell everyone what has been going on. I was still having issues with the other two 840 Pros in my SSD pool. They have been getting slower and actually causing latency in my non-SSD array which I found interesting. This has been a weird issue since it only seemed to happen on one VM when Veeam backed it up some days but not others on the days it happens the data rate falls to about 1.8MBps. After digging a bit I started using some interesting commands to track it down.

I have been using this script to watch NFS latency since it's as accurate if not more than the VMware Datastore

D-Trace IOPS
dtrace I/O monitoring (io,zfs,nfs) - Oracle Monitor

IOStat - Figure out if a single device is causing issues
Code:
#all devices and pools
iostat -xn 1
# specific pools
iostat -xn mypool 1
IO Pattern - Random or Sequential Writes for data
Disk IO:iopattern - DTraceBook

ZFSTOP - Basic Overview of Read Writes
zettatools/zfstop at master · jkjuopperi/zettatools · GitHub

So after use these I pinned down a 60-70% busy/wait time on the SSDs even with VMs off and a simple test script writing a few KB a second.

What I did was zfs send the data to my trusty rust drives and then pulled the SSDs securely wiped them and then enabled the HPA partition like the ZIL. After pushing the data back onto them I ran Veeam and I have been constantly getting 280+MBps. So I know this is the issue.

Time will only tell if this will become an issues and the same assumption is correct. Garbage Collection is still causing issues and the next purchase of SSDs will not be consumer grade for ZFS
 
  • Like
Reactions: lmk and T_Minus