This originally started at this thread.
https://forums.servethehome.com/index.php?threads/needle-in-a-hay-stack-nfs-r-w-latency-esxi.5847/
So I have been working on resolving Latency Issues on our ESXi host which is being effected by our ZFS host. This is a weird issues where I think multiple things were contributing to it not just one. A little previous history is required before I actually describe the problem because I think it’s relevant.
Note: This is the best images I have since I did not think to take a picture before It does not show the highest latency which was upwards of 100+/s
March 12, 2015 ESXi 6.0 is officially released. I jump the gun upgrading before Veeam supports it. Since I had already upgraded I figured I would wait a few hours and see if I saw any issues. During this time it seemed stable but due to needing Veeam I rolled back to 5.5 express patch 6 and figured I was good to go boy was I wrong. NFS started immediately having issues where the latency was going through the roof. 15/s per request and sometimes they would spike to 100+/s. At a certain point ESXi just could not wait anymore and flushed the requests. Normally this occurs after ESXi showed the datastore as disconnected. During this experience I only had the single variable that I had upgraded to 5.5EP6 and it was breaking NFS and I resolved it by downgrading to ESXi 5.1U2 resolving the issue.
April 29, 2015 I decided to upgrade ESXi since Veeam now supported 6.0. I patched Veeam to 8 patch 2 and then upgraded ESXi. Everything went smoothly and I only needed a reboot. There were no changes in Latency, I/O or throughput. This seemed like everything was functioning correctly.
May 6, 2015 I decided to add two More VM's to the host. Everything is running stable added the hosts to the backups and the full images were completed successfully.
May 25, 2015 On Sunday night I started experiencing something that I normally would consider a latency storm or amplified latency. Veeam seemed to bring this to the forefront due to the fact it is hitting ZFS harder than my normal workload. The problem is when Veeam ran a backup job it spiked the read and write latency on the ESXi host upwards of 20s per I/O request. This was not occurring only a day before and Veeam ran fine.
System Specifications for those who are curious
Super Micro 2U (SC933T-R760B)
Super Micro X9DRH-7TF
2GB Thumb drive for ESXi Boot Drive
M4 64GB SSD for ESXi Local datastore
80GB of Memory
6x HYNIX 8GB DDR3-1600 (HMT31GR7CFR4C-PB)
2x HYNIX 16GB DDR3-1600 (HMT42GR7MFR4C-PB)
2x E5-2630
1x M1015 (IT)
RAID-Z (data)
┌ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
└ Hitachi 3TB 5.4k (Hitachi HDS5C303)
Mirrored (vmware)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ Seagate 750GB 7.2k (ST3750640AS)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ WD Red 1TB 7.2k (WD10EFRX)
SLOG Samsung 840 Pro 256GB (MZ-7KE256BW)
Mirrored (ssd)
┌ Samsung 840 Pro 256GB (MZ-7KE256BW)
└ Samsung 840 Pro 256GB (MZ-7KE256BW)
The Virtual Machine Hosting ZFS
OmniOS r151006 with napp-it as the front end.
Allocated 24GB Memory
Allocated CPU 1 Processor with 4 Threads
May 26-28, 2015 System was Laggy but usable. Things seemed to drag but other real world problems got in my way before I fixed it.
May 28, 2015 I figured it was something that I had changed causing this issue. So starting with that assumption I methodically rolled the system backwards. Hoping to find a resolution to the problem. Below is the list of steps I tried before finding the cause.
+ Turn off some VMs
Latency stayed the same with a single VM or all VMs on
+ ESXi 6.0 -> ESXi 6.0 Latest
Suggested by Veeam due to the Change Block Tracking Bug
+ Reset Change Block Tracking on All VMs
Using the VMware PowerCLI Script
Sign in to Veeam
+ ESXi 6.0 -> ESXi 5.1U2
Had to re-import all virtual machine vmx files
Read/write latency dropped by 10-20% but still was 5+/s
+ Veeam Remove and Recreate all Jobs
Just trying to get a good backups before changing ZFS.
Recreated mostly due to VMs having their ids changed
Throughput was a pitiful 300Kb/s using Hot Add
Latency stayed the same 10+/s
+ Move Veeam to original host and rolling back to Veeam 8 launch code
This host is limited to 1Gbe. But in the end it still caused the same High latency.
Throughput was a pitiful 125Kb/s using Network
+ Rebuilding the ZFS OmniOS Host from the ground
Checking if a kernel bug or something else existed
Set IP addresses the same and installed napp-it
Imported disks and severed them to ESXi
Edit: After finding the problems I placed the pool back on the original host it had the same issues as before but after doing an OS upgrade the problems were resolved.
After watching esxtop for a while you could see a pattern where a VM would try to write and then become a noise neighbor. I removed the SLOG and tried "zfs set sync=disabled vmware" writes started resolving in 3-10ms response times. But when I would running a Veeam Job it spiked back up to 5+/s of read latency so I knew it was not just a write issue that was the problem. I turned sync=standard back on and it went back to a hold rate of 50ms with spikes in the 1s range in a 15 minute period.
At this point I shutdown all VMs and rebooted everything after the ESXi host and ZFS were up I could run a Veeam Backup and was seeing about 18MB/s from disk and around 250ms of Latency on write with minimal spikes. I grabbed a copy of all VMs so that I had a point in time since it had been 3 days since I actually had a good backup.
At this point I was going on the assumption that the software issue was fixed and there still had to be hardware issue. I placed all 3 840 Pros in a RAID-Z and experience horrible latency and 350Kb/s of throughput on it. Something was not right. I should be seeing at least 50MB/s.
+ Check Smart Data on All Drives
Returned no warning signs of Disk Failure
All SSDs have yet to experience a single bad block
Devices:
SSD1 was used as a portable device
SSD2: L2ARC for a while before I read about the memory requirements (removed around end of 2014)
SSD3: SLOG drive for vmware pool SLOG was provisioned at the full size of the SSD at 256GB
As you can see each seem to have different properties that you normally would not see. After talking with some people on the freenode #zfs channel it was asked what firmware were they running. After checking I was not running the latest firmware
+ Firmware Flash 840 Pro from DXM05B0Q -> DXM06B0Q
I saw no difference between firmware versions.
All disks reacted the same as in the two pictures above
+ Secure Erased all 840 Pros
All drives returned to the performance of ssd1 graph
Note: Due to the BIOS not allowing the drives to be un-frozen I had to flash them in a desktop. To unfreeze the SSD has to have the drive power cycled after the OS is booted. This is the case for firmware flashing and configuring HPA. (Reason for this is so that a rootkit/virus does not brick your device.) But some computer unfreeze with an S3 sleep cycle which makes this basically useless.
This is where it gets interesting because all three drives come back to being responsive which I find really interesting. You can see that under provisioning a drive is required for basically all use cases. I knew this but thought that since that SLOG uses about 5GB max and L2ARC Might use 50GB it would be fine. A bad assumption on my part.
+ Configure HPA on the SLOG Drive (ssd1)
SSD Over-provisioning using hdparm - Thomas-Krenn-Wiki
Set to 30GB after unfreezing the drive
Putting ssd1 in as SLOG and everything starts running normally 10-30ms latency but VM IO is still having some read latency but I assumed this was due to them not falling back into ARC yet. I would assume since the array just can't handle the load before cache kicks in. Keeping off all non-essential SSDs this kept the system Stable even with Veeam running.
+ Migrate the write intensive workload off the HDDs to SSDs
Completely resolved the issue.
As you can see from the graphs everything has returned to acceptable levels.
May 29, 2015 It’s all back up and stable.
https://forums.servethehome.com/index.php?threads/needle-in-a-hay-stack-nfs-r-w-latency-esxi.5847/
So I have been working on resolving Latency Issues on our ESXi host which is being effected by our ZFS host. This is a weird issues where I think multiple things were contributing to it not just one. A little previous history is required before I actually describe the problem because I think it’s relevant.
Note: This is the best images I have since I did not think to take a picture before It does not show the highest latency which was upwards of 100+/s
March 12, 2015 ESXi 6.0 is officially released. I jump the gun upgrading before Veeam supports it. Since I had already upgraded I figured I would wait a few hours and see if I saw any issues. During this time it seemed stable but due to needing Veeam I rolled back to 5.5 express patch 6 and figured I was good to go boy was I wrong. NFS started immediately having issues where the latency was going through the roof. 15/s per request and sometimes they would spike to 100+/s. At a certain point ESXi just could not wait anymore and flushed the requests. Normally this occurs after ESXi showed the datastore as disconnected. During this experience I only had the single variable that I had upgraded to 5.5EP6 and it was breaking NFS and I resolved it by downgrading to ESXi 5.1U2 resolving the issue.
April 29, 2015 I decided to upgrade ESXi since Veeam now supported 6.0. I patched Veeam to 8 patch 2 and then upgraded ESXi. Everything went smoothly and I only needed a reboot. There were no changes in Latency, I/O or throughput. This seemed like everything was functioning correctly.
May 6, 2015 I decided to add two More VM's to the host. Everything is running stable added the hosts to the backups and the full images were completed successfully.
May 25, 2015 On Sunday night I started experiencing something that I normally would consider a latency storm or amplified latency. Veeam seemed to bring this to the forefront due to the fact it is hitting ZFS harder than my normal workload. The problem is when Veeam ran a backup job it spiked the read and write latency on the ESXi host upwards of 20s per I/O request. This was not occurring only a day before and Veeam ran fine.
System Specifications for those who are curious
Super Micro 2U (SC933T-R760B)
Super Micro X9DRH-7TF
2GB Thumb drive for ESXi Boot Drive
M4 64GB SSD for ESXi Local datastore
80GB of Memory
6x HYNIX 8GB DDR3-1600 (HMT31GR7CFR4C-PB)
2x HYNIX 16GB DDR3-1600 (HMT42GR7MFR4C-PB)
2x E5-2630
1x M1015 (IT)
RAID-Z (data)
┌ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
├ Hitachi 3TB 5.4k (Hitachi HDS5C303)
└ Hitachi 3TB 5.4k (Hitachi HDS5C303)
Mirrored (vmware)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ Seagate 750GB 7.2k (ST3750640AS)
┌ Seagate 750GB 7.2k (ST3750640AS)
└ WD Red 1TB 7.2k (WD10EFRX)
SLOG Samsung 840 Pro 256GB (MZ-7KE256BW)
Mirrored (ssd)
┌ Samsung 840 Pro 256GB (MZ-7KE256BW)
└ Samsung 840 Pro 256GB (MZ-7KE256BW)
The Virtual Machine Hosting ZFS
OmniOS r151006 with napp-it as the front end.
Allocated 24GB Memory
Allocated CPU 1 Processor with 4 Threads
May 26-28, 2015 System was Laggy but usable. Things seemed to drag but other real world problems got in my way before I fixed it.
May 28, 2015 I figured it was something that I had changed causing this issue. So starting with that assumption I methodically rolled the system backwards. Hoping to find a resolution to the problem. Below is the list of steps I tried before finding the cause.
+ Turn off some VMs
Latency stayed the same with a single VM or all VMs on
+ ESXi 6.0 -> ESXi 6.0 Latest
Suggested by Veeam due to the Change Block Tracking Bug
+ Reset Change Block Tracking on All VMs
Using the VMware PowerCLI Script
Sign in to Veeam
+ ESXi 6.0 -> ESXi 5.1U2
Had to re-import all virtual machine vmx files
Read/write latency dropped by 10-20% but still was 5+/s
+ Veeam Remove and Recreate all Jobs
Just trying to get a good backups before changing ZFS.
Recreated mostly due to VMs having their ids changed
Throughput was a pitiful 300Kb/s using Hot Add
Latency stayed the same 10+/s
+ Move Veeam to original host and rolling back to Veeam 8 launch code
This host is limited to 1Gbe. But in the end it still caused the same High latency.
Throughput was a pitiful 125Kb/s using Network
+ Rebuilding the ZFS OmniOS Host from the ground
Checking if a kernel bug or something else existed
Set IP addresses the same and installed napp-it
Imported disks and severed them to ESXi
Edit: After finding the problems I placed the pool back on the original host it had the same issues as before but after doing an OS upgrade the problems were resolved.
After watching esxtop for a while you could see a pattern where a VM would try to write and then become a noise neighbor. I removed the SLOG and tried "zfs set sync=disabled vmware" writes started resolving in 3-10ms response times. But when I would running a Veeam Job it spiked back up to 5+/s of read latency so I knew it was not just a write issue that was the problem. I turned sync=standard back on and it went back to a hold rate of 50ms with spikes in the 1s range in a 15 minute period.
At this point I shutdown all VMs and rebooted everything after the ESXi host and ZFS were up I could run a Veeam Backup and was seeing about 18MB/s from disk and around 250ms of Latency on write with minimal spikes. I grabbed a copy of all VMs so that I had a point in time since it had been 3 days since I actually had a good backup.
At this point I was going on the assumption that the software issue was fixed and there still had to be hardware issue. I placed all 3 840 Pros in a RAID-Z and experience horrible latency and 350Kb/s of throughput on it. Something was not right. I should be seeing at least 50MB/s.
+ Check Smart Data on All Drives
Returned no warning signs of Disk Failure
All SSDs have yet to experience a single bad block
Devices:
SSD1 was used as a portable device
SSD2: L2ARC for a while before I read about the memory requirements (removed around end of 2014)
SSD3: SLOG drive for vmware pool SLOG was provisioned at the full size of the SSD at 256GB
As you can see each seem to have different properties that you normally would not see. After talking with some people on the freenode #zfs channel it was asked what firmware were they running. After checking I was not running the latest firmware
+ Firmware Flash 840 Pro from DXM05B0Q -> DXM06B0Q
I saw no difference between firmware versions.
All disks reacted the same as in the two pictures above
+ Secure Erased all 840 Pros
All drives returned to the performance of ssd1 graph
Note: Due to the BIOS not allowing the drives to be un-frozen I had to flash them in a desktop. To unfreeze the SSD has to have the drive power cycled after the OS is booted. This is the case for firmware flashing and configuring HPA. (Reason for this is so that a rootkit/virus does not brick your device.) But some computer unfreeze with an S3 sleep cycle which makes this basically useless.
This is where it gets interesting because all three drives come back to being responsive which I find really interesting. You can see that under provisioning a drive is required for basically all use cases. I knew this but thought that since that SLOG uses about 5GB max and L2ARC Might use 50GB it would be fine. A bad assumption on my part.
+ Configure HPA on the SLOG Drive (ssd1)
SSD Over-provisioning using hdparm - Thomas-Krenn-Wiki
Set to 30GB after unfreezing the drive
Putting ssd1 in as SLOG and everything starts running normally 10-30ms latency but VM IO is still having some read latency but I assumed this was due to them not falling back into ARC yet. I would assume since the array just can't handle the load before cache kicks in. Keeping off all non-essential SSDs this kept the system Stable even with Veeam running.
+ Migrate the write intensive workload off the HDDs to SSDs
Completely resolved the issue.
As you can see from the graphs everything has returned to acceptable levels.
May 29, 2015 It’s all back up and stable.