Very slow ZFS RaidZ2 Performance on TrueNAS 12

frawst · May 27, 2021

To whom might be able to lend a hand here, I have a relatively new ZFS setup that I've recently moved into a new server and am experiencing far lower than expected performance.

Admittedly I know very little of, and am new to the ZFS filesystem. My hardware setup is as follows:

Proxmox 6.4 -> TrueNAS VM (12.0-U3.1)
Supermicro H11DSi-NT
Dual EPYC 7601's
128Gb DDR4 2666 ECC (4x 32Gb)
12x 8TB Disks in RaidZ2 (mixture of Seagate and WD, mostly 5400 rpm or unlabled) <-- mostly shucked drives.

The disks are passed through using PCI-E Passthough on LSI 9211-8i Cards in IT mode.
Previously the disks and cards were used in a much slower server on XFS, and could easily saturate the 10Gbe network, Now i'm lucky if I get to ~200MB/s read or write. Most of the time i'm seeing less than 100MB/s.

I currently have 2 SATA SSD's acting as a read cache, and have tried using an NVMe for the writes (all 256Gb), However the write cache was coming from open space on the proxmox os disk (pci-e card coming for more m.2 NVMe's that are standing by).

Summary / TDLR;
TrueNAS VM on Proxmox using a 12-disk RaidZ2 vDev is running like hot garbage. Halp!

And thanks for any help you all can provide.

sboesch · May 28, 2021

ZFS performance can be improved be tweaking some settings. I always set these when using both Proxmox and TrueNAS Core, or any other ZFS setup. You should stripe mirrors for the best IO, RAIDZ2 is not exactly fast, and I personally don't use it for pools larger than 10 disks.

make sure when you created your zfs pool that you used the ashift=12 for setting the block size on your disks
zfs set xattr=sa (pool) set the Linux extended attributes as so, this will stop the file system from writing tiny files and write directly to the inodes
zfs set sync=disabled (pool) disable sync, this may seem dangerous, but do it anyway! You will get a huge performance gain.
zfs set compression=lz4 (pool/dataset) set the compression level default here, this is currently the best compression algorithm.
zfs set atime=off (pool) this disables the Accessed attribute on every file that is accessed, this can double IOPS
zfs set recordsize=(value) (pool/vdev) The recordsize value will be determined by the type of data on the file system, 16K for VM images and databases or an exact match, or 1M for collections 5-9MB JPG files and GB+ movies ETC. If you are unsure, the default of 128K is good enough for all around mixes of file sizes.
You will see no gains with read cache or an SLOG with your configuration, don't bother.

ZFS loves RAM. you can also tweak the ARC to a larger size for some performance gains.

frawst · May 28, 2021

First of all. THANK YOU!

The pool was already set to ashift=12, which is LUCKY because I had no idea and it already has a lot of data in place. phew.
Block size has been settled to 128Kb, I just use it for too much everything to go too big or too small. so here we sit.

I tried all the settings in combos to see what would happen. I also removed all caching vdevs to get real / raw results. I figured this was a good opportunity to "science" the options. All this did was leave me confused, I'm not sure what did the trick, but I seem to be unable to re-create the crazy bottleneck I had. I should've tested and screenshot beforehand!

(all tests were done over 10Gbe links)

With xattr=on, sync set to standard, and LZ4 enabled. // This was my current situation for settings, with the addition of those cache vdevs

Read from NAS to Proxmox Host over NFS

Write from Windows Desktop to NAS

With xattr=sa, sync set to standard, and LZ4 enabled.

Read from NAS to Proxmox Host over NFS

Write to NAS from Windows Desktop:

With xattr=sa, sync disabled, and LZ4 enabled.

Read from NAS to Proxmox Host over NFS

Write to NAS from Windows Desktop:

just to verify the network isn't throttling to 1Gbe, I did a read test as well from the NAS to the Windows desktop;

So NFS seems to do great, but Samba is terrible. Any thoughts on this one?

Thanks again for the super informative post! it's been very helpful in this venture.

-Tyson

sboesch · May 28, 2021

On your Proxmox host, install fio, open a shell and run apt install fio -y
Do a benchmark with fio,
cd /mnt/pve/Vault/

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

This will run a benchmark on your pool and give you an idea about your speed.

Ars has a good article on fio here.

sboesch · May 28, 2021

Running fio on my TrueNAS core host with 8x 5900rpm 4TB Iron Wolf drives in RAIDZ2 I see these numbers.
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

frawst · May 28, 2021

Done and Done! It seems we're consistently hitting ~200MB/s write through this method. Sync is off, xattr=sa, LZ4.

As you can see I didn't give it all the available horsepower (the host has other duties as well ofc), but I can bump it if needed.

frawst · May 28, 2021

sboesch said:
Running fio on my TrueNAS core host with 8x 5900rpm 4TB Iron Wolf drives in RAIDZ2 I see these numbers.
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

View attachment 18805

You're certainly performing better than I! That at least gives me confidence that it's possible.

frawst · May 28, 2021

Running Fio directly on the VM proves only slightly better.

Stephan · May 28, 2021

For comparison here is my RAIDZ2 on 8x 14 TB (shucked white label HGST HC530 7200rpm) connected partly to SATA 6 GBps mainboard ports and a jmb585 controller:

WRITE: bw=602MiB/s (631MB/s), 602MiB/s-602MiB/s (631MB/s-631MB/s), io=36.0GiB (38.7GB), run=61251-61251msec

I have an Intel P3700 400GB (HPE OEM MO0400KEFHN) that provides 8 GB SLOG but it is unused during the fio run (async writes instead of sync).

So something is holding you back and an otherwise beefy system of yours only delivers 1/3 of expected performance.

My hunch is the LSI controller. Does it have a dedicated fan or is this a server case with forced air flow? This card needs good cooling, see if a 8..14cm fan with 1000rpm+ temporarily installed to blow right at the heatsink improves things.

Another item to check is the LSI firmware revision. Which one are you running? 20.00.07.00 should be ok but you could try a 19.xx or even a 16.xx and retest. Later LSI firmwares are not always better. Also do not skip the LSI UEFI BIOS image, even in IT mode. Sometimes the controller will want to complain of something and without BIOS you will never find out.

hdparm -I /dev/sdX
Write cache enabled?

Let us know.

frawst · Jun 3, 2021

Stephan said:
For comparison here is my RAIDZ2 on 8x 14 TB (shucked white label HGST HC530 7200rpm) connected partly to SATA 6 GBps mainboard ports and a jmb585 controller:

WRITE: bw=602MiB/s (631MB/s), 602MiB/s-602MiB/s (631MB/s-631MB/s), io=36.0GiB (38.7GB), run=61251-61251msec

I have an Intel P3700 400GB (HPE OEM MO0400KEFHN) that provides 8 GB SLOG but it is unused during the fio run (async writes instead of sync).

So something is holding you back and an otherwise beefy system of yours only delivers 1/3 of expected performance.

My hunch is the LSI controller. Does it have a dedicated fan or is this a server case with forced air flow? This card needs good cooling, see if a 8..14cm fan with 1000rpm+ temporarily installed to blow right at the heatsink improves things.

Another item to check is the LSI firmware revision. Which one are you running? 20.00.07.00 should be ok but you could try a 19.xx or even a 16.xx and retest. Later LSI firmwares are not always better. Also do not skip the LSI UEFI BIOS image, even in IT mode. Sometimes the controller will want to complain of something and without BIOS you will never find out.

hdparm -I /dev/sdX
Write cache enabled?

Let us know.

Thanks for the input! the cooling concern here was the main thing that stood out to me, the cards did move from a 2U blade with super quick fans to a 4U with slower ones. I swapped out a row of fans for some higher cfm, and then also rested a 120mm over top the cards while testing. it doesn't seem to have made a difference. These raid cards recently left another server that was running an XFS system with 8 total disks, and was reaching 2Gb/s. I've not changed anything with the cards otherwise. So firmware aside, They're certainly capable.

I've cooled the system down, and seem to still be getting very poor performance (mostly floating in the 100-200 Mb/s range for both read and write. even locally within the vm. I'm not sure where to go from here, but i'm unsure the next step.

-Tyson

frawst · Jun 3, 2021

To add to this, I created a RAIDZ of NVMe SSD's on the proxmox host and proceeded to move an lvm onto it. it started out nice and fast at about 700Mb/s a second, and then quickly trickled down to 60 Mb/s. Totally different disks with completely different connections. somethings gotta be off here. I kinda wonder if there is some kina BIOS stuff at play?

okrasit · Jun 5, 2021

frawst said:
To add to this, I created a RAIDZ of NVMe SSD's on the proxmox host and proceeded to move an lvm onto it. it started out nice and fast at about 700Mb/s a second, and then quickly trickled down to 60 Mb/s. Totally different disks with completely different connections. somethings gotta be off here. I kinda wonder if there is some kina BIOS stuff at play?

View attachment 18896
View attachment 18897

View attachment 18898

The cause is right in front of you. The actual problem is the OpenZFS code quality. It uses the Solaris Porting Layer, which is basically a not-so-great wrapper around linux kernel threading api.

- It doesn't know NUMA at all
- It doesn't understand cache locality nor hierarchy
- It does idiotic thread placement
- It doesn't respect core isolation
The list goes on forever :'pain':

To add to the :'pain':, most worker thread counts, the ZFS spawns, are directly related to the system cpu core (thread) count. So, the more cores & sockets you system has, the lower the performance you can expect. Can you even imagine how it affects VMs, for example, running on your system when any disk activity, anywhere on the system, triggers that kind of cpu-time hogging and cache contention.

That's the reason for the shitty performance! It was basically "designed" to work on a low core count single cpu scenario. And I say "designed" because, from the looks of it, it wasn't designed at all.

Excuse me for my rant, I spent a whole ******* week to get it to work decently.

pr1malr8ge · Apr 18, 2022

Sheesh.. I've been scrounging around trying to find some things and stumbled accross this thread.
My setup atm is 6x10tb hgst sas rz2 & 6x3tb hgst sata rz2 for 2vdevs single pool.
My results are horrid compared to what you guys are pulling

ESX-i vm Truenas Core 64gigs dedicated ram. Sas3 SM 3008l-l8e passed through on a sas3 expander & 12 vcpus [phy cpu is 10c-20t]
pool = only change was atime=off
ashift=12
the rest is defaults from truenas

Main system is a SM x9-dri-lnf4 e5-2848l v2 10c-20t
128gig 4x32 lr dims 1800

this is my results

Code:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=1
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [w(1)][4.3%][eta 25m:07s]
random-write: (groupid=0, jobs=1): err= 0: pid=67940: Mon Apr 18 13:37:32 2022
  write: IOPS=10, BW=10.4MiB/s (10.9MB/s)(698MiB/66929msec); 0 zone resets
    slat (nsec): min=18084, max=77672, avg=33330.49, stdev=9293.91
    clat (usec): min=287, max=35700k, avg=95815.69, stdev=1787140.61
     lat (usec): min=305, max=35700k, avg=95849.02, stdev=1787140.67
    clat percentiles (usec):
     |  1.00th=[     289],  5.00th=[     293], 10.00th=[     297],
     | 20.00th=[     306], 30.00th=[     310], 40.00th=[     314],
     | 50.00th=[     318], 60.00th=[     326], 70.00th=[     330],
     | 80.00th=[     338], 90.00th=[     351], 95.00th=[     367],
     | 99.00th=[     603], 99.50th=[     603], 99.90th=[17112761],
     | 99.95th=[17112761], 99.99th=[17112761]
   bw (  KiB/s): min=688725, max=721869, per=100.00%, avg=705297.00, stdev=23436.35, samples=2
   iops        : min=  672, max=  704, avg=688.00, stdev=22.63, samples=2
  lat (usec)   : 500=98.14%, 750=1.58%
  lat (msec)   : >=2000=0.29%
  cpu          : usr=0.03%, sys=0.02%, ctx=699, majf=0, minf=1
  IO depths    : 1=100.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,698,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=10.4MiB/s (10.9MB/s), 10.4MiB/s-10.4MiB/s (10.9MB/s-10.9MB/s), io=698MiB (732MB), run=66929-66929msec

BackupProphet · Apr 25, 2022

Keep in mind, ZFS has dynamic recordsizes, and in most cases you can just set the default to 1MB. Your database or VM can still write 4kb, its just that max block size is 1MB.

ashift=12 is great for large harddrives, but SSD's can still benefit from ashift9 / 512byte block sizes. Especially with compression as some 4kb writes can be compressed to anything between 1-2kb. So you save a lot of space and improve latency.

Stephan · Apr 25, 2022

@BackupProphet If you think about it, ZFS with recommended settings will already LZ4-compress data, so a SSD with e.g. SandForce will not accomplish more compression. In that case, linear writes will cause 8x as much SSD writes to a flash page e.g. 512-512-512-512-512-512-512-512 instead of a single 4096 write. I would only ever use ashift=9 on a 4k device if I knew very very well in advance that I will gain alot of slack space because I have literally a myriad of small files of a certain (misaligned) size.

BackupProphet · Apr 25, 2022

On a 4k device, I wouldn't use ashift=9
But on a SSD that supports 512 block sizes and you have small data that you knows compresses well, you should consider it.

Joel · Apr 25, 2022

FWIW, I came to this thread because my writes were very slow... turns out I had 2 bad drives.

Joel · Apr 27, 2022

Joel said:
FWIW, I came to this thread because my writes were very slow... turns out I had 2 bad drives.

By very slow... I was getting ~30MB/s as reported by rsync --info=progress2, transferring from a single 14TB drive via USB3. Mea culpa for not running badblocks on some new-to-me drives. All good drives in now, and the transfer rate was up to ~160MB/s.

Search

Very slow ZFS RaidZ2 Performance on TrueNAS 12

frawst

New Member

sboesch

Active Member

frawst

New Member

Attachments

sboesch

Active Member

sboesch

Active Member

frawst

New Member

frawst

New Member

frawst

New Member

Stephan

Well-Known Member

frawst

New Member

frawst

New Member

okrasit

Member

pr1malr8ge

Member

BackupProphet

Well-Known Member

Stephan

Well-Known Member

BackupProphet

Well-Known Member

Joel

Active Member

Joel

Active Member