Truenas Scale lackluster performance over NFS/esxi

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

oh2ftu

New Member
Aug 2, 2022
11
3
3
Hi,
I'm trying to set up a new shared storage server to replace our old qnap TS-463-RP.
Core switch is Ubiquiti ES-16XG
It's running with 16GB of ram and two Samsung MZ7KH3T8HALS-00005 SSDs in RAID1.
Connection with copper (quality cable, short run in the same rack).
I find the performance lacking for our esxi-host with a dozen or so vm's. Mostly linux, some are windows.

I got an DL360 gen9 for free, installed latest Truenas scale and upgraded it to the following specs;
- 1x E5-2640v4 cpu (10c/20t)
- 8x 32GB ram for a total of 256GB
- Emulex OneConnect OCe14102B-NT 10GbaseT, again quality cable, short run in the same rack. 1x 10G uplink
- Storage:
-- P440ar in HBA-mode
-- OS: 2x cheapo SSD
-- sas-pool to test: 2x ST900MM0006 10k rpm SAS-drives (mirror)
-- ssd-pool to test: 1x Samsung 850 EVO 500GB

test were run with dd over NFS. The VM is running on the ESXI that has dual 10G uplinks to the core switch.

time dd if=/dev/zero of=/mnt/tn-stor-sas/testfile bs=16k count=256k
time dd if=/mnt/tn-stor-ssd/testfile of=/dev/null bs=16k

Both give about ~300MBps read. The qnap has a bit slower writes (250MBps) against 370-380MBps for either truenas with ssd or sas -targets.

I find the reads (at least for the SAS-mirror) lacking. Should be more? Is this a truenas-thing or should I look for something in the network?
iperf3 gives about 8.8Gbps between vm and truenas.

There's also a second truenas-box (Dell T3620 IIRC) with a much lower spec CPU and less ram. Same Samsung MZ7KH in mirror and it gives same performance as well.

Should I be getting some nvme instead? I kinda was expecting to saturate the sata-ssd bandwith to somewhere around 500MBps
 

dragonme

Active Member
Apr 12, 2016
352
28
28
I am not an expert but some observations...

first .. NFS is a forced sync-write file system.. so each of those 16k bs writes you are issuing have to be sent and acknolaged as written to DISC not ZIL (zfs first writes to zil 'fast' then clean writes again in a better place... so that slows things down considerably.. providing that is the bottleneck which given the ram and cpu spec, and network sounds likely...

to test you can 'temporarily' turn off sync on that dataset.. even. in a all in one setup where truenas / zfs is on the same esxi host and doing virtual networking inside the same box.. I leave sync on... as any vm crash etc could result in corrupt writes..

if turning off sync ie disabled.. speeds things up.. which is highly likely.. then your options are get faster storage with faster iops... optimize your block sizes to maximize writes/network traffic.. or add a SLOG device which should be a enterprise power safe NVME or mirror of such to give a small pool of write area for the NFS sync writes.. on a 10g interlink.. I would think 20GB should be plenty..

over provision this storage to give it some longevity as EVERY sync write to the pool you attach the SLOG to will be written there first...

also. NFS3 or NFS4 ? depending on your workload .. you could see performance differences..
 

nabsltd

Well-Known Member
Jan 26, 2022
689
480
63
or add a SLOG device which should be a enterprise power safe NVME or mirror of such to give a small pool of write area for the NFS sync writes..
Remember that SLOG is not a cache. It can help if the average ingest through the network is slower than the speed of the underlying pool, but if the pool is slower than the network, and you are writing for long enough, eventually the speed over the network will be slowed to match the speed of the pool.
 

dragonme

Active Member
Apr 12, 2016
352
28
28
Remember that SLOG is not a cache. It can help if the average ingest through the network is slower than the speed of the underlying pool, but if the pool is slower than the network, and you are writing for long enough, eventually the speed over the network will be slowed to match the speed of the pool.
it IS a cache in the sense that it is a much faster place for the SYNC write to land and get ACK so the next sync write can be sent via NFS.. and a NVME SLOG will be way more performant that the pool he has mentioned or the 10g ethernet.. that is wait for it, why SUN implemented SLOG in the first place..
 

nabsltd

Well-Known Member
Jan 26, 2022
689
480
63
it IS a cache in the sense that it is a much faster place for the SYNC write to land and get ACK so the next sync write can be sent via NFS.
Except that it isn't cache in that data is (almost) never read from SLOG (only on pool import after a crash). Data is written from RAM to the actual pool, and once data has been in RAM for 5 seconds (the default), it must be written to the pool. So, with 10Gbps network, SLOG only helps for the first 5-6GB of data. After that, sync writes are not returned as complete until data waiting in RAM is written, and this is limited by the speed of the underlying pool.

If SLOG was a true cache, a very large, very fast device would allow you to write at full network wire speed despite the speed of the underlying pool. Sure, even that would have its limit, but a 1.92TB NVMe would allow for 30-40 minutes of full-speed transfer before it finally slowed down (assuming a 10Gbps NIC and a pool that can write at 200MB/sec).
 
  • Like
Reactions: itronin

dragonme

Active Member
Apr 12, 2016
352
28
28
Except that it isn't cache in that data is (almost) never read from SLOG (only on pool import after a crash). Data is written from RAM to the actual pool, and once data has been in RAM for 5 seconds (the default), it must be written to the pool. So, with 10Gbps network, SLOG only helps for the first 5-6GB of data. After that, sync writes are not returned as complete until data waiting in RAM is written, and this is limited by the speed of the underlying pool.

If SLOG was a true cache, a very large, very fast device would allow you to write at full network wire speed despite the speed of the underlying pool. Sure, even that would have its limit, but a 1.92TB NVMe would allow for 30-40 minutes of full-speed transfer before it finally slowed down (assuming a 10Gbps NIC and a pool that can write at 200MB/sec).
now you are really showing you dont know of which you speak

a SLOG is not a read cache and has nothing to do with read operations..

it replaces on disk ZIL as a separate write intent log device that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...

I am done here
 
  • Like
Reactions: itronin

oh2ftu

New Member
Aug 2, 2022
11
3
3
Thanks for the replies.
As I see it, I'd need a suitable PCIe-nvme -drive (PM1725, P3600) or similar. Or two. To get more performance.
Granted, more disks (now testing against one ssd) would yield more performance, but not by much.

Also, looking into it, NFS on ZFS does not seem to support VAAI, which is a bummer. iSCSI should support it, but I have to test it.
This kinda sucks
 

nabsltd

Well-Known Member
Jan 26, 2022
689
480
63
a SLOG is not a read cache and has nothing to do with read operations..
Which is exactly what I said. The only time ZFS reads from the SLOG is when the pool is imported after an unclean shutdown. It does this to recover the ZIL that was sitting in RAM and was lost.

it replaces on disk ZIL as a separate write intent log device
Again, exactly what I said.

that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...
Yes, it does that, until ZFS requires the data to actually be written to the pool. At that point, ZFS won't respond back to a client until the in-memory copy of the ZIL has no data that is more than 5 seconds old. At that point, it doesn't matter how fast the SLOG is...the limiting factor once again becomes the speed of the pool disks.

As long as your overall average write speed is less than the underlying pool speed, a SLOG helps with an individual write returning more quickly than it otherwise would. But, once you are saturating the pool, that's the limit of transfer, and the SLOG doesn't help anymore.

One thing that a user can do to improve this is to increase the timeout that data can stay in RAM before it must be committed. This still just delays the point at which the underlying pool speed becomes the limiting factor, but for many workloads, it's enough to have 20-30 seconds of full-speed network burst instead of the default 5.
 
  • Like
Reactions: fohdeesha

fohdeesha

Kaini Industries
Nov 20, 2016
2,890
3,398
113
34
fohdeesha.com
now you are really showing you dont know of which you speak

a SLOG is not a read cache and has nothing to do with read operations..

it replaces on disk ZIL as a separate write intent log device that can clear the SYNC_write requires far faster than the typical write to on pool ZIL then committing the full write ...

I am done here
literally what he said lmao