M.2 NVMe's on Proxmox: pool vs array vs insanity?!?

pipe2null

New Member
May 25, 2021
13
3
3
I've been wrestling with insanity on this one.

I'm trying to find the optimum configuration for raw performance for 3x M.2 NVMe SSDs on my Proxmox server to be used for VM images and containers. For the life of me, I cannot understand the performance numbers I'm getting for various zfs and mdadm configs I have tried.

Perf testing:
I'm using fio for testing, with the individual tests configured to be somewhat like CrystalDisk default tests. In retrospect I wish I had added 64k r/w, but I'm not restarting my testing from scratch. I have also used CrystalDisk from inside a windows VM in a few instances. All perf tests were done on the host other than a couple purposely done in-VM.

Issue 1 ZFS)
Putting all 3 NVMe's in either a striped/raid0/aka 3x single vdev ZFS pool, or a putting all 3 nvme's in a raidz1 ZFS pool, or all in a 3 drive mirror pool, yields perf numbers at or below the single disk performance of the NVMe drives. I tested with many zfs recordsize values, and the performance of the pool is always on par or less than a single drive with the exception of 4k rand read/write perf tests which I don't really care about, and the magical 512k r/w test that for some reason tends to perform better than other perf tests, but still nowhere near the perf I would expect. In all variations, performance tests of NVMe pool are on par or worse than my 4 drive SATA raidz1 ZFS pool running on the same machine.

WTF?


Issue 2 Mdadm)
Putting all 3 NVMe's in a mdadm raid0 array gets perf numbers I would expect, roughly 2.5x the single drive perf. Using a raid10 array also yield perf that I would expect for that configuration. But when I move a VM's Qcow2 to the md array, the perf numbers from running CrystalDisk from within the VM gives perf number not very different than running the VM on my spinny disk array, although boot is faster and apps start faster so io latency seems much better, but not getting the overal BW perf I'm expecting, the perf of the NVMe array as seen by the VM is not much better than spinny disk overall io throughput. Using PCIe passthrough and running CrystalDisk inside the VM for a "native" nvme gets perf numbers roughly on par with single disk perf numbers obtained on the host system with fio.


I'm mostly concerned with "Issue 1 ZFS". I want to run my NVMe array uncovered at max perf, but do regular backups of it to my spinny raidz1, so I'd like to use ZFS for block level snapshot/backup, but not with the bizarre performance I'm seeing right now.

For "Issue 2 Mdadm", I'm scratching my head due to the massive performance drop seen inside the VM compared to host perf tests. Obviously it will be less, much less, but on the host I get about 5 GB/s and only seeing about 500MB/s inside the VM when that VM is the only thing touching that disk array, the qcow2 literally the only image file on the drive.


Can anyone shed some light on what is going on with these numbers, or should I see if Amazon can overnight a straight jacket? thanks!

[EDIT]
Solved Issue #2 "Poor perf seen from inside VM". Solved by making sure guest had Virtio drivers installed, and changing HDD emulation settings in Proxmox to virtio, now seeing inside-VM io perf on par with host side perf ~8GB/s seq read with virtual drive on host mdadm raid0 nvme array.
 
Last edited:

zer0sum

Well-Known Member
Mar 8, 2013
714
373
63
Did you happen to try passing the nvme drives directly to a VM guest to see how the numbers stack up doing it that way?
 

pipe2null

New Member
May 25, 2021
13
3
3
I did a PCIe passthough to a windows VM and checked the crystaldisk for a single nvme, and those numbers roughly matched fio perf numbers obtained on the host, and somewhat close to what people have posted within amazon reviews of that specific nvme drive. In that configuration I did not see any significant performance degredation from within the VM compared to the host, in regards to issue #2.


I'm just wondering if there is some problem with my server config. A ZFS pool of NVMe drives should have better perf than a ZFS pool of spinny disks, and in no sane world should NVMe perf be on par or worse overall throughput than sata. And when attempting to use mdadm/ext4 instead of zfs and seeing a 90% decrease in IO thoroughput from within the VM compared to the host seems excessive to me.

Clearly either I am doing something completely wrong or there is some config issue with my proxmox server? I've tried everything I can think of, so I'm looking for advice on a new rathole to disappear into trying to figure this out and get it working at reasonable performance given the hardware. Much appreciated.
 

amalurk

Active Member
Dec 16, 2016
275
97
28
100
Disable sync?
 

pipe2null

New Member
May 25, 2021
13
3
3
The fio perf tests I was running are "supposed" to be running async io (fio --ioengine=libaio)... But I'll rerun a short test with sync disabled on the pool and see if the numbers are any different. Thanks for the suggestion!
 

pipe2null

New Member
May 25, 2021
13
3
3
No difference when I reran some of the perf tests with sync=disabled on zfs datasets.

I just tried another approach with a zpool with only a single NVME drive. Creating a zpool with exactly one of my NVMe drives has roughly the same perf as a zpool with all 3 NVMe drives?!?

So, here is the dump of info I've collected so far... Perhaps someone can see what I'm doing wrong?


My performance test: Loops = 5, writezero=0, size=1024m (1 GB), qsize=32m
fio --loops=$LOOPS --size=$SIZE --filename="$TARGET/.fiomark.tmp" --stonewall --ioengine=libaio --direct=1 --zero_buffers=$WRITEZERO \
--name=Bufread --loops=1 --bs=$SIZE --iodepth=1 --numjobs=1 --rw=readwrite \
--name=Seqread --bs=$SIZE --iodepth=1 --numjobs=1 --rw=read \
--name=Seqwrite --bs=$SIZE --iodepth=1 --numjobs=1 --rw=write \
--name=512kread --bs=512k --iodepth=1 --numjobs=1 --rw=read \
--name=512kwrite --bs=512k --iodepth=1 --numjobs=1 --rw=write \
--name=SeqQ32T1read --bs=$QSIZE --iodepth=32 --numjobs=1 --rw=read \
--name=SeqQ32T1write --bs=$QSIZE --iodepth=32 --numjobs=1 --rw=write \
--name=4kread --bs=4k --iodepth=1 --numjobs=1 --rw=randread \
--name=4kwrite --bs=4k --iodepth=1 --numjobs=1 --rw=randwrite \
--name=4kQ32T1read --bs=4k --iodepth=32 --numjobs=1 --rw=randread \
--name=4kQ32T1write --bs=4k --iodepth=32 --numjobs=1 --rw=randwrite \
--name=4kQ8T8read --bs=4k --iodepth=8 --numjobs=8 --rw=randread \
--name=4kQ8T8write --bs=4k --iodepth=8 --numjobs=8 --rw=randwrite > "$TARGET/fiomark.txt"

Results in (MiB/s):

Group 0​
Group 1​
Group 2​
Group 3​
Group 4​
Group 5​
Group 6​
Group 7​
Group 8​
Group 9​
Group 10​
Group 11​
Group 12​
Drive/Pool/Array​
ZfsRecordSize/Details​
Bufread​
Seqread​
Seqwrite​
512kread​
512kwrite​
SeqQ32T1read​
SeqQ32T1write​
4kread​
4kwrite​
4kQ32T1read​
4kQ32T1write​
4kQ8T8read​
4kQ8T8write​
Samsung 512GB​
Single ext4 drive​
2943​
2810​
2395​
2425​
2226​
2914​
2721​
73​
216​
827​
640​
1194​
729​
WD Blue 500GB​
Single ext4 drive​
2901​
2636​
2099​
2008​
2032​
2819​
2328​
67​
233​
770​
891​
955​
1616​
WD Black 500GB​
Single ext4 drive​
2653​
2543​
2223​
1920​
2053​
2639​
2476​
55​
257​
764​
558​
1505​
2132​
Individual Nvme Baseline​
(Weakest per group)​
2653​
2543​
2099​
1920​
2032​
2639​
2328​
55​
216​
764​
558​
955​
729​
Zpool Samsung512GB only​
64k​
2142​
2032​
2227​
5339​
2583​
3382​
2358​
679​
339​
843​
325​
2358​
1344​
Zpool 3 single drives​
4k​
719​
691​
482​
943​
488​
789​
474​
494​
347​
484​
344​
2318​
887​
16k​
1508​
1411​
1335​
1999​
1422​
1950​
1285​
721​
347​
423​
338​
2256​
1128​
64k​
2096​
1913​
2158​
3124​
2705​
2466​
1438​
522​
276​
269​
252​
1420​
892​
256k​
2343​
2121​
2421​
3310​
3626​
2159​
2414​
121​
140​
331​
195​
493​
1164​
1M​
2560​
2053​
1616​
3274​
2257​
3257​
2107​
31​
62​
142​
185​
119​
940​
64k,nocomp,nosync​
3261​
3259​
2386​
5872​
3234​
3366​
2467​
850​
372​
826​
366​
2928​
1585​
64k,nocomp,stdsync​
2212​
2075​
2301​
6416​
3128​
3368​
2300​
867​
344​
827​
340​
2515​
1205​
64k,lz4,nosyc​
4339​
3382​
2394​
6606​
3122​
3471​
2402​
825​
408​
807​
343​
2439​
1240​
64k,lz4,stdsync​
2207​
2065​
2321​
6649​
2954​
3411​
2299​
869​
389​
821​
319​
3188​
1056​
Ext4 on zvol​
7474​
5453​
3272​
2736​
1357​
5812​
2334​
301​
224​
833​
721​
1620​
912​
Zpool 3 drive RaidZ1​
4k​
775​
726​
489​
2378​
489​
1869​
471​
611​
350​
622​
356​
2563​
1038​
16k​
1465​
1388​
1351​
2086​
1435​
2883​
1005​
709​
356​
648​
355​
2568​
1253​
64k​
2004​
1850​
1496​
2706​
2370​
3114​
1977​
661​
332​
705​
321​
2168​
1348​
256k​
2432​
2228​
2568​
6985​
3015​
3450​
2487​
284​
188​
537​
267​
1020​
1371​
1M​
2498​
2235​
2399​
6252​
2643​
3411​
2349​
468​
183​
306​
295​
202​
1263​
Mdadm 3 drive raid0​
chunk=4k​
524​
517​
233​
470​
240​
521​
241​
38​
3​
45​
3​
45​
3​
chunk=16k​
6059​
5988​
5084​
4119​
3870​
5812​
6693​
61​
244​
746​
586​
2670​
3572​
chunk=64k​
7161​
5241​
5333​
4576​
4256​
5505​
6719​
60​
236​
716​
638​
2661​
1855​
chunk=256k​
7314​
5306​
5025​
3462​
3210​
5541​
6995​
60​
225​
765​
598​
2613​
3491​
chunk=1M​
8678​
6252​
5300​
1721​
2222​
6400​
6975​
60​
242​
783​
623​
2605​
3452​
Mdadm 3 drive raid10​
chunk=4k​
1101​
1066​
1212​
916​
981​
1165​
970​
51​
176​
659​
374​
2353​
909​
chunk=16k​
4267​
3336​
2288​
1708​
1681​
3673​
1477​
55​
188​
622​
390​
2430​
990​
chunk=64k​
7420​
5025​
2611​
2353​
1942​
3935​
1577​
54​
200​
693​
383​
2390​
587​
chunk=256k​
8752​
5262​
2649​
2332​
1959​
5494​
1484​
53​
189​
649​
366​
2407​
1041​
chunk=1M​
8605​
5412​
2760​
1596​
1823​
5447​
1488​
53​
205​
680​
392​
2396​
1092​
Hybrid: zfs on raid10​
c=64k, rs=64k​
2271​
1958​
2251​
6313​
2560​
2775​
2267​
773​
411​
668​
399​
2415​
1690​
Hybrid: zfs on raid0​
c=64k, rs=64k​
2265​
2013​
2326​
6564​
2984​
3386​
2279​
811​
339​
805​
353​
2262​
1386​
Hybrid ext4 on zvol on raid0​
c=64k​
4592​
3751​
2992​
3049​
1287​
6059​
2301​
291​
212​
786​
685​
1068​
895​
Hybrid ext4 on zvol on raid0​
c=64k zvol=thin​
4214​
3373​
1688​
2984​
1325​
6139​
2568​
273​
202​
735​
633​
1110​
910​


The perf numbers for the Mdadm tests are what I would expect for those nvme drives in those soft-raid configurations, so I'm thinking my test methodology is more or less adequate but I could be wrong. Have I done something head-smackingly-stupid? Or... ?

Thoughts?
 

pipe2null

New Member
May 25, 2021
13
3
3
Supermicro X10DRU-i+
NVMe's on Asus 4x slot M.2 NVMe adapter board - No bridge, but mobo supports PCIe-x16 bifurcated to 4x PCIe x4's
 

pipe2null

New Member
May 25, 2021
13
3
3
NOTE: I solved Issue #2 "poor perf from inside VM". Now VM io perf is on par with host side perf. CrystalDisk running inside VM is getting 8.6GB/s read and 6.8GB/s write when qcow2 drive file is hosted on mdadm raid0 array of 3x NVMe's. Updated OP.

Poor perf with Zfs on nvme's still an issue.
 

XeonLab

Member
Aug 14, 2016
31
10
8
My two cents:
1. What NVMe drives are you running? QLC drives with limited cache might cause issues with ZFS
2. Can you test the drives with other ZFS version, older OpenZFS or even with Solaris ZFS?
 

pipe2null

New Member
May 25, 2021
13
3
3
1. NVMe's, plain ol' ok-ish consumer grade M.2 SSDs:
Western Digital Blue: SN570 500GB
Western Digital Black: SN750 500GB
Samsung NVMe... Came with my Lenovo laptop, not positive exact model, but looks like a PM981a? inxi model number "MZVLB512HBJQ-000L7"

2. Testing in other systems is not really an option at the moment. My windows VM is my only workstation for non-linux apps which I need at the moment, so doing anything as significant as swapping proxmox for something else is probably not a good idea. I have a couple more servers sitting on the bench waiting to get retrofitted and added to the stack so I can play with my own cluster, but those are projects for another day.

Generally speaking, this whole thing was supposed to be an afternoon of playing with and perf testing different nvme drive/pool/array configurations and making an informed decision on which direction to go for the long term given a primarily qcow2/LXC io workload. Unfortunately, I've hit this ZFS-on-nvme perf issue which to me seems to indicate I have some underlying problem with my system/configuration, thus I'm stuck on making a final decision. For the time being, I'm just going to use mdadm raid0 and up my backup frequency (but not too much since limited to file system level backup instead of block level).

I really want to use zfs on my nvmes for many reasons, especially the block level backup and the flexibility of tuned datasets alongside zvols, etc. But it seems like I have some type of underlying problem that needs to get addressed.


I guess the better questions I should have asked to begin with:
Compared to my perf numbers for "Ext4 single drive" (just formatted raw physical disk), "Zpool Samsung512GB only" (zfs on a single nvme/single vdev pool), and "mdadm 3 drive raid0" (rough estimate for max io throughput, kindof, depending on chunking/workload/etc), comparing those perf numbers to the various ZFS 3 nvme drive pool configurations:
1) Does it look like I have a problem with my zfs installation/configuration/etc?
2) Or am I chasing ghosts of unrealistic perf expectations?
 
Last edited: