Truenas Scale lackluster performance over NFS/esxi

oh2ftu · Feb 28, 2025

Hi,
I'm trying to set up a new shared storage server to replace our old qnap TS-463-RP.
Core switch is Ubiquiti ES-16XG
It's running with 16GB of ram and two Samsung MZ7KH3T8HALS-00005 SSDs in RAID1.
Connection with copper (quality cable, short run in the same rack).
I find the performance lacking for our esxi-host with a dozen or so vm's. Mostly linux, some are windows.

I got an DL360 gen9 for free, installed latest Truenas scale and upgraded it to the following specs;
- 1x E5-2640v4 cpu (10c/20t)
- 8x 32GB ram for a total of 256GB
- Emulex OneConnect OCe14102B-NT 10GbaseT, again quality cable, short run in the same rack. 1x 10G uplink
- Storage:
-- P440ar in HBA-mode
-- OS: 2x cheapo SSD
-- sas-pool to test: 2x ST900MM0006 10k rpm SAS-drives (mirror)
-- ssd-pool to test: 1x Samsung 850 EVO 500GB

test were run with dd over NFS. The VM is running on the ESXI that has dual 10G uplinks to the core switch.

time dd if=/dev/zero of=/mnt/tn-stor-sas/testfile bs=16k count=256k
time dd if=/mnt/tn-stor-ssd/testfile of=/dev/null bs=16k

Both give about ~300MBps read. The qnap has a bit slower writes (250MBps) against 370-380MBps for either truenas with ssd or sas -targets.

I find the reads (at least for the SAS-mirror) lacking. Should be more? Is this a truenas-thing or should I look for something in the network?
iperf3 gives about 8.8Gbps between vm and truenas.

There's also a second truenas-box (Dell T3620 IIRC) with a much lower spec CPU and less ram. Same Samsung MZ7KH in mirror and it gives same performance as well.

Should I be getting some nvme instead? I kinda was expecting to saturate the sata-ssd bandwith to somewhere around 500MBps

dragonme · Mar 4, 2025

I am not an expert but some observations...

first .. NFS is a forced sync-write file system.. so each of those 16k bs writes you are issuing have to be sent and acknolaged as written to DISC not ZIL (zfs first writes to zil 'fast' then clean writes again in a better place... so that slows things down considerably.. providing that is the bottleneck which given the ram and cpu spec, and network sounds likely...

to test you can 'temporarily' turn off sync on that dataset.. even. in a all in one setup where truenas / zfs is on the same esxi host and doing virtual networking inside the same box.. I leave sync on... as any vm crash etc could result in corrupt writes..

if turning off sync ie disabled.. speeds things up.. which is highly likely.. then your options are get faster storage with faster iops... optimize your block sizes to maximize writes/network traffic.. or add a SLOG device which should be a enterprise power safe NVME or mirror of such to give a small pool of write area for the NFS sync writes.. on a 10g interlink.. I would think 20GB should be plenty..

over provision this storage to give it some longevity as EVERY sync write to the pool you attach the SLOG to will be written there first...

also. NFS3 or NFS4 ? depending on your workload .. you could see performance differences..

nabsltd · Mar 4, 2025

dragonme said:
or add a SLOG device which should be a enterprise power safe NVME or mirror of such to give a small pool of write area for the NFS sync writes..

Remember that SLOG is not a cache. It can help if the average ingest through the network is slower than the speed of the underlying pool, but if the pool is slower than the network, and you are writing for long enough, eventually the speed over the network will be slowed to match the speed of the pool.

dragonme · Mar 4, 2025

nabsltd said:
Remember that SLOG is not a cache. It can help if the average ingest through the network is slower than the speed of the underlying pool, but if the pool is slower than the network, and you are writing for long enough, eventually the speed over the network will be slowed to match the speed of the pool.

it IS a cache in the sense that it is a much faster place for the SYNC write to land and get ACK so the next sync write can be sent via NFS.. and a NVME SLOG will be way more performant that the pool he has mentioned or the 10g ethernet.. that is wait for it, why SUN implemented SLOG in the first place..

nabsltd · Mar 5, 2025

dragonme said:
it IS a cache in the sense that it is a much faster place for the SYNC write to land and get ACK so the next sync write can be sent via NFS.

Except that it isn't cache in that data is (almost) never read from SLOG (only on pool import after a crash). Data is written from RAM to the actual pool, and once data has been in RAM for 5 seconds (the default), it must be written to the pool. So, with 10Gbps network, SLOG only helps for the first 5-6GB of data. After that, sync writes are not returned as complete until data waiting in RAM is written, and this is limited by the speed of the underlying pool.

If SLOG was a true cache, a very large, very fast device would allow you to write at full network wire speed despite the speed of the underlying pool. Sure, even that would have its limit, but a 1.92TB NVMe would allow for 30-40 minutes of full-speed transfer before it finally slowed down (assuming a 10Gbps NIC and a pool that can write at 200MB/sec).

dragonme · Mar 5, 2025

nabsltd said:
Except that it isn't cache in that data is (almost) never read from SLOG (only on pool import after a crash). Data is written from RAM to the actual pool, and once data has been in RAM for 5 seconds (the default), it must be written to the pool. So, with 10Gbps network, SLOG only helps for the first 5-6GB of data. After that, sync writes are not returned as complete until data waiting in RAM is written, and this is limited by the speed of the underlying pool.

If SLOG was a true cache, a very large, very fast device would allow you to write at full network wire speed despite the speed of the underlying pool. Sure, even that would have its limit, but a 1.92TB NVMe would allow for 30-40 minutes of full-speed transfer before it finally slowed down (assuming a 10Gbps NIC and a pool that can write at 200MB/sec).

now you are really showing you dont know of which you speak

a SLOG is not a read cache and has nothing to do with read operations..

it replaces on disk ZIL as a separate write intent log device that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...

I am done here

oh2ftu · Mar 5, 2025

Thanks for the replies.
As I see it, I'd need a suitable PCIe-nvme -drive (PM1725, P3600) or similar. Or two. To get more performance.
Granted, more disks (now testing against one ssd) would yield more performance, but not by much.

Also, looking into it, NFS on ZFS does not seem to support VAAI, which is a bummer. iSCSI should support it, but I have to test it.
This kinda sucks

nabsltd · Mar 7, 2025

dragonme said:
a SLOG is not a read cache and has nothing to do with read operations..

Which is exactly what I said. The only time ZFS reads from the SLOG is when the pool is imported after an unclean shutdown. It does this to recover the ZIL that was sitting in RAM and was lost.

it replaces on disk ZIL as a separate write intent log device

Again, exactly what I said.

that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...

Yes, it does that, until ZFS requires the data to actually be written to the pool. At that point, ZFS won't respond back to a client until the in-memory copy of the ZIL has no data that is more than 5 seconds old. At that point, it doesn't matter how fast the SLOG is...the limiting factor once again becomes the speed of the pool disks.

As long as your overall average write speed is less than the underlying pool speed, a SLOG helps with an individual write returning more quickly than it otherwise would. But, once you are saturating the pool, that's the limit of transfer, and the SLOG doesn't help anymore.

One thing that a user can do to improve this is to increase the timeout that data can stay in RAM before it must be committed. This still just delays the point at which the underlying pool speed becomes the limiting factor, but for many workloads, it's enough to have 20-30 seconds of full-speed network burst instead of the default 5.

fohdeesha · Mar 9, 2025

dragonme said:
now you are really showing you dont know of which you speak

a SLOG is not a read cache and has nothing to do with read operations..

it replaces on disk ZIL as a separate write intent log device that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...

I am done here

literally what he said lmao

dragonme · Jun 5, 2025

oh2ftu said:
Hi,
I'm trying to set up a new shared storage server to replace our old qnap TS-463-RP.
Core switch is Ubiquiti ES-16XG
It's running with 16GB of ram and two Samsung MZ7KH3T8HALS-00005 SSDs in RAID1.
Connection with copper (quality cable, short run in the same rack).
I find the performance lacking for our esxi-host with a dozen or so vm's. Mostly linux, some are windows.

I got an DL360 gen9 for free, installed latest Truenas scale and upgraded it to the following specs;
- 1x E5-2640v4 cpu (10c/20t)
- 8x 32GB ram for a total of 256GB
- Emulex OneConnect OCe14102B-NT 10GbaseT, again quality cable, short run in the same rack. 1x 10G uplink
- Storage:
-- P440ar in HBA-mode
-- OS: 2x cheapo SSD
-- sas-pool to test: 2x ST900MM0006 10k rpm SAS-drives (mirror)
-- ssd-pool to test: 1x Samsung 850 EVO 500GB

test were run with dd over NFS. The VM is running on the ESXI that has dual 10G uplinks to the core switch.

time dd if=/dev/zero of=/mnt/tn-stor-sas/testfile bs=16k count=256k
time dd if=/mnt/tn-stor-ssd/testfile of=/dev/null bs=16k

Both give about ~300MBps read. The qnap has a bit slower writes (250MBps) against 370-380MBps for either truenas with ssd or sas -targets.

I find the reads (at least for the SAS-mirror) lacking. Should be more? Is this a truenas-thing or should I look for something in the network?
iperf3 gives about 8.8Gbps between vm and truenas.

There's also a second truenas-box (Dell T3620 IIRC) with a much lower spec CPU and less ram. Same Samsung MZ7KH in mirror and it gives same performance as well.

Should I be getting some nvme instead? I kinda was expecting to saturate the sata-ssd bandwith to somewhere around 500MBps

run a test for yourself.. but here is an observation on my system..

note on config: only given 4 cores E5-2650v4 on and ESXI all-in-one serving NFS vm storage back to ESXI from very fast NVME

leaving NFS out of it..

internal to the ESXI host (ie using almost unlimited bandwidth 10g plus) network

Finally .. test your network performance with iperf3

I get significantly faster iperf3 speeds to a linux 24.04 vm than I do to the truenas vm .. same storage, same virtual network

so the bottleneck might not be the storage stack but the network stack

louie1961 · Jun 5, 2025

I may have missed it but are all devices on the same VLAN, or is there any inter-VLAN routing going on here. That will slow you down as well. Plus for giggles I looked up those SAS drives. Seagate says they are rated for 204MB/sec max sustained transfers. So you may be at the limit of the drives

https://www.seagate.com/www-content/product-content/savvio-fam/savvio-10k/savvio-10k-6/ru/docs/savvio10k6-ds1768-1cr-1301us.pdf

bugacha · Jun 5, 2025

dragonme said:
I get significantly faster iperf3 speeds to a linux 24.04 vm than I do to the truenas vm .. same storage, same virtual network

There is absolutely no reason to have difference in iperf3 performance between those, especially on 10gbe network

Make sure that :

- you give VM host CPUs and not virtualized stuff
- you either pass thru Nic as PCIe device or enable multi-queue equal to number of VM cpus

louie1961 · Jun 5, 2025

bugacha said:
There is absolutely no reason to have difference in iperf3 performance between those, especially on 10gbe network

I am not sure I agree. If TrueNAS is running a lot of docker containers and VMs then it could impact the performance. It really depends on what's running concurrently and the available CPU/memory on both sides of the link.

dragonme · Jun 5, 2025

bugacha said:
There is absolutely no reason to have difference in iperf3 performance between those, especially on 10gbe network

Make sure that :

- you give VM host CPUs and not virtualized stuff
- you either pass thru Nic as PCIe device or enable multi-queue equal to number of VM cpus

yeah.. right

like drivers are the same.. not
like networking stack or settings / tunings are the same .. not
like the kernel is the same.. not
like truenas doesn't customize and bastardize everything ... not

I could go for at least 2 pages.. but you get the gist..

bugacha · Jun 5, 2025

dragonme said:
yeah.. right

like drivers are the same.. not
like networking stack or settings / tunings are the same .. not
like the kernel is the same.. not
like truenas doesn't customize and bastardize everything ... not

I could go for at least 2 pages.. but you get the gist..

you dont have any gist, all you have is no experience

I ran 2 x TrueNAS Scale 25, bunch of random Debian, Ubuntu, most on ConnectX-4 Lx connected via CRS510 on 25gbpe network

all iperf3 tests, between all hosts in forward and reverse combinations are 25gbpe

10G is much easier

Search

Truenas Scale lackluster performance over NFS/esxi

oh2ftu

New Member

dragonme

Active Member

nabsltd

Well-Known Member

dragonme

Active Member

nabsltd

Well-Known Member

dragonme

Active Member

oh2ftu

New Member

nabsltd

Well-Known Member

fohdeesha

Kaini Industries

dragonme

Active Member

louie1961

Active Member

bugacha

Active Member

louie1961

Active Member

dragonme

Active Member

bugacha

Active Member