TrueNAS Scale NVME Performance Testing

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nickf1227

Active Member
Sep 23, 2015
198
129
43
33
Over the past year or so I have been obsessively exploring various aspects of ZFS performance, from large SATA arrays with multiple HBA cards, to testing NVME performance. In my previous testing I was leveraging castoff enterprise servers that were Westmere, Sandy Bridge and Ivy Bridge based platforms. There was some interesting performance variations between those platforms and I was determined to see what a more modern platform would do. It seemed that most of my testing indicated that ZFS was being bottlenecked by the platform it was running on, with high CPU usage being present during testing.

I recently picked up an AMD EPYC 7282, Supermicro H12SSL-I, and 256GB of DDR4-2133 RAM. While the RAM is certainly not the fastest, I now have a lot of PCIE lanes to play with and I don't have to worry as much about which slot goes to which CPUs.

For today's adventure, I tested 4 and 8 Samsung 9A1 512GB SSDs (PCIE Gen4), in 2 PLX PEX8747 (PCIE Gen3) Linkreal Quad M.2 adapter as well as 2 Bifurcation-based Linkreal Quad M.2 adapters in that new platform. My goal was to determine the performance differances between relying on motherboard bifurcation vs a PLX chip. I also wanted to test the performance impacts of compression and deduplication on NVME drives in both configurations. Testing was done using FIO, in a mixed read-and-write workload
fio --bs=128k --direct=1 --directory=/mnt/newprod/ --gtod_reduce=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=randrw --numjobs=16 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
I hope this helps some folks. :)

The first set of tests was done on single card with (4) 9A1s in a 2 VDEV mirrored configuration on each of the different cards.

Test SetupRead BW MinRead BW maxRead BW AveRead BW Std DeviationRead IOPs MinRead IOPs maxRead IOPs AveRead IOPs Std DeviationWrite BW MinWrite BW maxWrite BW AveWrite BW Std DeviationWrite IOPs MinWrite IOPs maxWrite IOPs AveWrite IOPs Std Deviation
Bifurcation 4x9A1 2xMirrors with Dedupe and Compression14276791163103.881140614379305830.9917676421160103.451414611399282827.56
PLX 4x9A1 2xMirrors with Dedupe and Compression87100651110116.32698802648885930.54116100641109116.32928802648874930.65
Bifurcation 4x9A1 2xMirrors with Compression No Dedupe9521967133413.67621157391065595.5510431931133211.958346154521065595.55
PLX 4x9A1 2xMirrors with Compression No Dedupe6932031111414.385548162528918115.057772033111213.296216162648898106.33
Bifurcation 4x9A1 2xMirrors No Compression No Dedupe8352471157821.136686668619770168.978572387157920.0268561909812632160.15
PLX 4x9A1 2xMirrors No Compression No Dedupe6921654109113.015542132328734104.047641574108911.58611412598871692.66


The second set of tests was done with two matching cards with (8) 9A1s in a 4 VDEV mirrored configuration. The mirrors span between the cards, so if one entire card were to fail, the pool would remain in tact.

Test SetupRead BW MinRead BW maxRead BW AveRead BW Std DeviationRead IOPs MinRead IOPs maxRead IOPs AveRead IOPs Std DeviationWrite BW MinWrite BW maxWrite BW AveWrite BW Std DeviationWrite IOPs MinWrite IOPs maxWrite IOPs AveWrite IOPs Std Deviation
Bifurcation 8x9A1 4xMirrors Cross Cards with Dedupe and Compression13176411207106.911055611319658855.2717176611204106.681372612949636853.4
PLX 8x9A1 4xMirrors Cross Cards with Dedupe and Compression2856266127389.5929105029010169713.473636286127189.192910502901016989.19
Bifurcation 8x9A1 4xMirrors Cross Cards with Compression no Dedupe10632092149614.985061674311968108.8111871979149413.695001583411959108.81
PLX 8x9A1 4xMirrors Cross Cards with Compression no Dedupe10742152151914.885941721712155118.412412009151813.2999301607512147106.31
Bifurcation 8x9A1 4xMirrors Cross Cards no Compression no Dedupe16643476241222.98133162780919298183.85174133842415172.6139262707719323172.6
PLX 8x9A1 4xMirrors Cross Cards no Compression no Dedupe20103718281123.32160822974722490186.5120733594281521.73165882875822524173.81

Some Bar Graphs:
1665902661964.png



1665902750434.png



1665902814084.png



1665902863062.png



Some interesting conclusions to be drawn :) The narrower 4 disk pools seem to perform better with the bifurcation based solution, which is likely due to the fact these are PCIE Gen 4 drives. However, as we get wider, the overhead of relying on the mainboard to do the switching seems to grow and the PLX chip solution seems to deliver better performance.
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
482
63
49
r00t.dk
Interesting findings - what I find strange though is that your read bandwidth is higher when you turn off compression - usually its much faster to decompress if you have a reasonable CPU - what kind of compression algo did you use?
 
  • Like
Reactions: nickf1227

nickf1227

Active Member
Sep 23, 2015
198
129
43
33
Interesting findings - what I find strange though is that your read bandwidth is higher when you turn off compression - usually its much faster to decompress if you have a reasonable CPU - what kind of compression algo did you use?
ZSTD in it's "default" setting (5?)
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
482
63
49
r00t.dk
ZSTD in it's "default" setting (5?)
okay, you should try lz4, since thats the default that most would run with - and it should be faster that zstd - only highly compressible data benefits from zstd and it takes a much higher toll on the cpu that lz4, so my guess is that with lz4 you should see increased read/write speeds.

Obviously if you had an archive of mostly text files that is a real archive (mostly written) zstd would probably be the way to go for compression, for but a "real" dataset of data that is written a read a lot - I am certain lz4 is the right compression algo.

But again, thats the beauty of zfs, you can set it per dataset :)