TrueNAS Scale NVME Performance Testing

nickf1227 · Oct 15, 2022

Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based platforms. There was some interesting performance variations between those platforms and I was determined to see what a more modern platform would do. It seemed that most of my testing indicated that ZFS was being bottlenecked by the platform it was running on, with high CPU usage being present during testing.

I recently picked up an AMD EPYC 7282, Supermicro H12SSL-I, and 256GB of DDR4-2133 RAM. While the RAM is certainly not the fastest, I now have a lot of PCIE lanes to play with and I don't have to worry as much about which slot goes to which CPUs.

For today's adventure, I tested 4 and 8 Samsung 9A1 512GB SSDs (PCIE Gen4), in 2 PLX PEX8747 (PCIE Gen3) Linkreal Quad M.2 adapter as well as 2 Bifurcation-based Linkreal Quad M.2 adapters in that new platform. My goal was to determine the performance differances between relying on motherboard bifurcation vs a PLX chip. I also wanted to test the performance impacts of compression and deduplication on NVME drives in both configurations. Testing was done using FIO, in a mixed read-and-write workload

fio --bs=128k --direct=1 --directory=/mnt/newprod/ --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjobs=16 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based

I hope this helps some folks.

The first set of tests was done on single card with (4) 9A1s in a 2 VDEV mirrored configuration on each of the different cards.

Test Setup	Read BW Min	Read BW max	Read BW Ave	Read BW Std Deviation	Read IOPs Min	Read IOPs max	Read IOPs Ave	Read IOPs Std Deviation	Write BW Min	Write BW max	Write BW Ave	Write BW Std Deviation	Write IOPs Min	Write IOPs max	Write IOPs Ave	Write IOPs Std Deviation
Bifurcation 4x9A1 2xMirrors with Dedupe and Compression	142	7679	1163	103.88	1140	61437	9305	830.99	176	7642	1160	103.45	1414	61139	9282	827.56
PLX 4x9A1 2xMirrors with Dedupe and Compression	87	10065	1110	116.32	698	80264	8885	930.54	116	10064	1109	116.32	928	80264	8874	930.65
Bifurcation 4x9A1 2xMirrors with Compression No Dedupe	952	1967	1334	13.6	7621	15739	10655	95.55	1043	1931	1332	11.95	8346	15452	10655	95.55
PLX 4x9A1 2xMirrors with Compression No Dedupe	693	2031	1114	14.38	5548	16252	8918	115.05	777	2033	1112	13.29	6216	16264	8898	106.33
Bifurcation 4x9A1 2xMirrors No Compression No Dedupe	835	2471	1578	21.13	6686	6686	19770	168.97	857	2387	1579	20.02	6856	19098	12632	160.15
PLX 4x9A1 2xMirrors No Compression No Dedupe	692	1654	1091	13.01	5542	13232	8734	104.04	764	1574	1089	11.58	6114	12598	8716	92.66

The second set of tests was done with two matching cards with (8) 9A1s in a 4 VDEV mirrored configuration. The mirrors span between the cards, so if one entire card were to fail, the pool would remain in tact.

Test Setup	Read BW Min	Read BW max	Read BW Ave	Read BW Std Deviation	Read IOPs Min	Read IOPs max	Read IOPs Ave	Read IOPs Std Deviation	Write BW Min	Write BW max	Write BW Ave	Write BW Std Deviation	Write IOPs Min	Write IOPs max	Write IOPs Ave	Write IOPs Std Deviation
Bifurcation 8x9A1 4xMirrors Cross Cards with Dedupe and Compression	131	7641	1207	106.91	1055	61131	9658	855.27	171	7661	1204	106.68	1372	61294	9636	853.4
PLX 8x9A1 4xMirrors Cross Cards with Dedupe and Compression	285	6266	1273	89.59	2910	50290	10169	713.47	363	6286	1271	89.19	2910	50290	10169	89.19
Bifurcation 8x9A1 4xMirrors Cross Cards with Compression no Dedupe	1063	2092	1496	14.9	8506	16743	11968	108.81	1187	1979	1494	13.6	9500	15834	11959	108.81
PLX 8x9A1 4xMirrors Cross Cards with Compression no Dedupe	1074	2152	1519	14.8	8594	17217	12155	118.4	1241	2009	1518	13.29	9930	16075	12147	106.31
Bifurcation 8x9A1 4xMirrors Cross Cards no Compression no Dedupe	1664	3476	2412	22.98	13316	27809	19298	183.85	1741	3384	2415	172.6	13926	27077	19323	172.6
PLX 8x9A1 4xMirrors Cross Cards no Compression no Dedupe	2010	3718	2811	23.32	16082	29747	22490	186.51	2073	3594	2815	21.73	16588	28758	22524	173.81

Some Bar Graphs:

Some interesting conclusions to be drawn

The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.

Bjorn Smith · Oct 27, 2022

Interesting findings - what I find strange though is that your read bandwidth is higher when you turn off compression - usually its much faster to decompress if you have a reasonable CPU - what kind of compression algo did you use?

nickf1227 · Oct 28, 2022

Bjorn Smith said:
Interesting findings - what I find strange though is that your read bandwidth is higher when you turn off compression - usually its much faster to decompress if you have a reasonable CPU - what kind of compression algo did you use?

ZSTD in it's "default" setting (5?)

Bjorn Smith · Oct 29, 2022

nickf1227 said:
ZSTD in it's "default" setting (5?)

okay, you should try lz4, since thats the default that most would run with - and it should be faster that zstd - only highly compressible data benefits from zstd and it takes a much higher toll on the cpu that lz4, so my guess is that with lz4 you should see increased read/write speeds.

Obviously if you had an archive of mostly text files that is a real archive (mostly written) zstd would probably be the way to go for compression, for but a "real" dataset of data that is written a read a lot - I am certain lz4 is the right compression algo.

But again, thats the beauty of zfs, you can set it per dataset

Search

TrueNAS Scale NVME Performance Testing

nickf1227

Active Member

Bjorn Smith

Well-Known Member

nickf1227

Active Member

Bjorn Smith

Well-Known Member