NFS -> vmware latency data (please)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

james23

Active Member
Nov 18, 2014
462
127
43
53
please post some of your VMware -> SAN Latency graphs (ideally with related IOPS / disk load - or proxmox data will work). In my case im using vmware → Truenas (so im looking for any sync related loads that also measure latency). I’m posting some of mine below. (Honestly I’d appreciate any type of truenas related pool latency graphs/data). nb; my truenas is not virtualized, its physical (specs below in spoiler)

why-
Ive been trying to troubleshoot truenas (or zfs) “cross-pool latency issues” (mostly around NFS) - where on VMware-> TN via NFS data-stores, if a HDD backed pool is getting hit hard, it causes my other NVMe or SSD pools to latency spike unreasonably high from the standpoint of the VMware hosts. (all pools are optane p900 slog backed).

for example, every night i have 2x NVR (video) VMs that move about 80GB of video from SSDs Pools to HDD Pools (via nfs VMware disks), The latency spikes have forced me to move some of the VM’s OS boot disks to host direct attached storage as it was causing some services to fail or some OSes to reboot due to IO timeouts.
(pool info at bottom)

The 2x VM hosts have 25gb networking, TN (physical) has 100g networking (im aware 10g is more than enough). I do have a decent/high amount of IOPS load from vmware->truenas, So perhaps what I’m seeing is to be expected , I just need some comparisons or points of reference from the community hopefully . Below are my own datapoints, and what im hoping some others will post so i can compare (or im happy to answer any questions , even if unrelated)

I have done a crazy amount of Troubleshooting and testing, and this has been an issue over two entirely different sets of truenas hardware over the years thus:
There is a decent chance that I’m just stressing this SAN/ZFS system and what I’m seeing is normal /expected with ZFS (and is why Businesses / Enterprises pay so much reoccurring for SANs / something like truenas enterprise ) - But this is why I'm hoping to see latency data from others.

I've covered all the low hanging fruit like:
- watching gstat to be sure its not disk / HBA Saturation (same with system load avg)
- networking: seperate vlans for NFS / vMotion / jumbo frames (and -df tested end to end) / iperf showing 10gbit+ in both directions
- zfs: using good slog , using sas3 ssds in mirror, using 4 drive nvme mirror

thank you for your time

Pool / truenas system info (and screenshots showing load/ latency spikes at bottom):
OS: FreeNAS-13.0u6 (stable) (boot volume is a mirror of 60gb intel 520 SSDs)

MB: H11SSL-NC (supermicro AMD)
CPU: AMD EPYC 7262 8-Core
RAM: 8x 32gb DDR4 ECC (256gb)

HBAs: 2x SAS3008 LSI sas3 HBAs to SAS3 supermicro expander backplanes
NIC: 100gb to netgear m4500 switch
SLOG: 280g Optane p900 (SLOG)

DISKS / POOL:
16x 16TB HGST 8tb SAS 7200rpm ( 8x disks in 2x vDev , each vDev raid Z2)
10x 1.6TB HGST SSD sas3 (5-way mirror)
4x 1.6tb Intel p3605 NVMe (2-way mirror)

(6x images - click for full screen):

1755463097104.jpeg1755463105288.png1755463110223.png1755463117138.png1755463147603.png1755463215538.png
 
Last edited:

pimposh

hardware pimp
Nov 19, 2022
432
263
63
Aren't you facing nfs+zfs combo noisy neighbour issue here ? start with increasing nfsd threads @ truenas, to see if spikes got lowered.
 

james23

Active Member
Nov 18, 2014
462
127
43
53
Aren't you facing nfs+zfs combo noisy neighbour issue here ? start with increasing nfsd threads @ truenas, to see if spikes got lowered.
thank you! and yes i think you are more correct than my original post, im facing nfs+zfs noisy neighbor more-so.

I actually have done a good bit of testing of exactly changing around the number of NFSd "servers" (as its called on truenas -> nfs service settings), and 16x-24x (24x more so) has been optimal vs 4x (despite reading that this should be tied to the number of cpu cores you have, in my case 8x epic cores, 16x threads with my AMD EPYC 7262 8-Core).

actually i now plan on setting up a iscsi zvol / share to vmware hosts and seeing how that performs (for only a few higher iops VMs, As I will keep NFS as well).

It's hard to see, but this 7-day graph below is how I've been comparing changes to number of nfsd servers, I would make the change, let it run for 24 hours. The worst (around 8/19/2025) was when I tried 4x (all others were either 8x or 16x /24x nfsd servers). I do compare / control for the number of wr IOPS.

im surprised I still can't find anything on the web or forums of what kind of load / iowait / latency users are seeing with around 4-5k constant write IOPS and NFS on ZFS (to vm-hosts, sync writes). ( i actually cant find this data at any write IOPS level).
thanks
1755718781227.png

EDIT; I should add I've also started logging the output of the great `nfsstat -dW` builtin tool, although only over the past 2 days so far, but it gives great data:

1755719050869.png
 

pimposh

hardware pimp
Nov 19, 2022
432
263
63
ZFS is designed for data integrity first, and it does a lot of expensive checksumming, journaling, and transactional operations.

Synchronous Writes & Global TXG Commit
Even though your pools are separate, ZFS has global subsystems like the transaction group (TXG) thread scheduler.
When a synchronous-heavy pool (e.g., HDD pool with video writes) is busy flushing TXGs or handling high write latency, it can block or delay the global commit thread that even SSD/NVMe pools must wait for.

ARC Lock Contention
ZFS’s Adaptive Replacement Cache (ARC) is global, not per-pool. A big workload like video write from NVRs can cause heavy ARC eviction and locking, starving other workloads.

Thread Pool Bottlenecks
Some of ZFS’s thread pools (e.g., spa_sync, TXG thread, ZIL writers) are shared or limited in concurrency, leading to bottlenecks during concurrent heavy writes—even across pools.

SLOG (even Optane) Won’t Save You Entirely
Guessing that Optane is used for SLOG (which is great), but remember - It helps latency of synchronous writes, but flush-to-disk still needs to happen eventually in TXG sync.
Also, metadata I/O (ZFS does a lot of it) still hits the main pool, and your HDD pool likely has a high seek time / queue depth penalty.

Tune ZFS Threading and TXG
Increase zfs_txg_timeout to reduce how often TXG flushes globally.
Tune zfs_vdev_scheduler and other sysctls—though this is a rabbit hole.

Move NVR Workload Off to SMB/CIFS or Separate System
As long as you do not need the same POSIX compliance or synchronous integrity that NFS offers, do a separate SMB share for video archive transfer—bypasses NFS’s fsync() hell.

Enable per-dataset sync=disabled just for video ingest target temporarily to isolate ZIL as extra load.
 
  • Like
Reactions: james23 and nexox

Captain Lukey

Member
Jun 16, 2024
56
17
8
This should be so much faster and the ZIL is not 100% over committed. On TrueNAS I assume (simple 101) you have atime off and no dedupe compression on..



1756105199680.png
 

zachj

Active Member
Apr 17, 2019
296
158
43
Watching this…I’ve had the same problem for years but never troubleshot because it isn’t breaking to my scenario. A 16x1TB SATA SSD array should be an order of magnitude faster than what I’m seeing.
 
  • Like
Reactions: james23

james23

Active Member
Nov 18, 2014
462
127
43
53
sorry for the late update- i kinda got frustrated with the NFS perf issues, and have started messing with using iSCSI (but iam still using NFS also).
so i will update this in a week or so with those results.

so far, the perf using iscsi for my same 4x or so high IOPS VMs has been ALOT better (with sync=always on iscsi zvol). but there are def. downsides, such as poor space usage (NFS deals with highly compressible vDisks much better, and is easier to work with, as each VM gets a file you can see).

but so far it looks like the "solution" is to use both (and to increase storage capacity :(
im also going to upgrade my slog from optane p900 to a p5801x , so that may help too (but thats a few weeks away).
still would be great if others could post their vSphere disk latency graphs for VM disks backed by NFS (on zfs ideally).
thanks