ZFS NVMe performances questions

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RonanR

Member
Jul 27, 2018
47
2
8
Hi,

I'm currently trying to get the best performances I can on FreeBSD with NVMe drives, in order to know how many servers my workflow is going to require (reading a lot of 60MB dpx files per seconds).
For testing purpose, I'm using 4x 1Tb Samsung 990 Pro on a system equipped with a threadripper 3970x and 128GB of DDR4 3200.

All tests are done using iozone, with various block size set according to the recordsize set on the zfs volume, and with 5 files of 60GB each, in order to fill the arc cache.
Here is an example of the command used, with 1M blocks :
iozone -R -l 5 -u 5 -r 1M -s 60g -F /nvme1/tmp1 /nvme1/tmp2 /nvme1/tmp3 /nvme1/tmp4 /nvme1/tmp5

I've stumbled on a bottleneck, and I can't figure out what's the cause. Indeed, the system can't go higher than 13GB/s in read.

When tested separately, creating 4 pools of 1 drive, each NVMe has the same speed : around 6GB/s in read.
When doing a simultaneous test on two pools, performances are doubled, I've got 12GB/s in read.
If I do the same simultaneous test on 3 pools I've got almost the same results, around 13GB/s, as if I reached a limit somewhere.
It seems confirmed as when doing the test on my 4 pools, I still got 13GB/s.
If I look at my CPU loading when doing the 4 tests simultaneously, it's used around 40 percent, with peak at 60 percent.

I also tested with 2 mirror pools and got the same results: tested separately, each got 12GB/s in read. Tested simultaneously, the cumulative bandwidth is limited to 13GB/s.
Same limit when doing a single Z1 pool with my 4 drives, it's stuck at 13GB/s.

Does anyone have any idea what's happening, or can enlighten me on how I can find what's limiting my system ?
 

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
how are these ssds connected to the host? asus hyper m.2 card?
what mainboard and which pcie slot?

~13 GByte/s sounds like an x8 pcie 4.0 slot
 
  • Like
Reactions: T_Minus and abq

RonanR

Member
Jul 27, 2018
47
2
8
how are these ssds connected to the host? asus hyper m.2 card?
what mainboard and which pcie slot?

~13 GByte/s sounds like an x8 pcie 4.0 slot
These SSDs are connected using the motherboard M2 slots (Asrock TRX40 Creator) and PCIe gen4 x16 to single nvme adapters.
I tried 3 SSDs on the motherboards, as it has 3 slots, and one on PCIe, or 2 on the motherboard and 2 on 2 PCIe slots, same thing.


btw. ZFS is NOT a performance oriented file system.
About the fact ZFS is not a performance oriented FS, I'm well aware, no worries. I'm simply searching the limits of the system and try to understand them.
 
  • Like
Reactions: T_Minus

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
btw. ZFS is NOT a performance oriented file system.
ZFS cannot be, reasons
- checksums (more data to process)
- Copy on Write (more fragmentation, more data to process, a change of "house" to "mouse"
is not a single byte replacement but at least a whole ZFS datablock write, min 4k)
- data spread quite even over a pool to achieve constant io performce not best sequential values

Most relevant tuning options are around recsize (depend on use case, low values reduce io but affect ZFS efficiency negatively), pool layout, ram size with arc or writecache settings or ashift values. Active trim is important on Flash beside Optane. Most important aspect remains raw server performance. It is not trivial to achieve several GB/s throughput.
 

ano

Well-Known Member
Nov 7, 2022
654
272
63
do you need more than 20GiBs of 128k writes? if so you need something more than ZFS, allthough multiple volumes helps
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
I doubt your setup will give any usable results, or will you be using 990Pro NVMEs for production later as well?
And dpx files are a lossless digital format and IOzone creates per default test data that is compressible,
not sure how well your IOzone data can mimic that...
 

RonanR

Member
Jul 27, 2018
47
2
8
I doubt your setup will give any usable results, or will you be using 990Pro NVMEs for production later as well?
990Pro Nvme won't be uses in production on this server, they're only purpose here is for testing without any additional hardware.
In fact I also got 12 Samsung PM1643a in SAS, but I discovered one of my SAS card, a Broadcom 9500-8i was defective and I'm waiting for its replacement.
As I'm curious, I wanted to know how far I can push my system, and I stumbled on this strange bottleneck : for some reason, even when using multiple pools, my system is blocked at 13GB/s.
FYI I tried to create one pool with my 4 990Pro drives and another one with my 12 Samsung PM1643a connected to an old Broadcom 9300-16i card
and I got the same results :
Tested separately, I got 13GB/s on the NVMe pool and 6GB/s on the SSD SAS one (limited by the PCIe 3 9300-16i card).
If I launch two test at the same time, one on each pool, the cumulative bandwidth is also capped at 13GB/s. My NVMe pool fluctuate between 6 and 9 GB/S and my SSD one between 4 and 6 GB/s

And dpx files are a lossless digital format and IOzone creates per default test data that is compressible,
not sure how well your IOzone data can mimic that...
In my experience from previous servers, iozone bandwidth is quite accurate for calculating how many dpx streams I will be able to achieve.
FYI, all my tests are done without compression, as it's not efficient at all with dpx files and puts a huge load on the CPU for almost nothing, so it's always disabled on my production servers.
 

RonanR

Member
Jul 27, 2018
47
2
8
do you need more than 20GiBs of 128k writes? if so you need something more than ZFS, allthough multiple volumes helps
Did you already achieve this type of performances ? If yes, can you please share your config and tuning ?
 

ano

Well-Known Member
Nov 7, 2022
654
272
63
genoa 9374 cpu, all 12 ram channels populated, 10'ish fast enterprise gen4 or gen5 nvme, and your there @RonanR

with just usual stuff like lz4, correct ashift etc



of course thoose drives, with fio directly on them are what.. 60GiBs or so ;)
 

RonanR

Member
Jul 27, 2018
47
2
8
genoa 9374 cpu, all 12 ram channels populated, 10'ish fast enterprise gen4 or gen5 nvme, and your there @RonanR

with just usual stuff like lz4, correct ashift etc



of course thoose drives, with fio directly on them are what.. 60GiBs or so ;)
Ok, great to know !
This means something is actually wrong on my system, I have to find out what.
You got that on FreeBSD ?
I'm wondering if using a Threadripper pro 5975WX can be a viable alternative to the Epyc 9374, what do you think ?
 

ano

Well-Known Member
Nov 7, 2022
654
272
63
linux, correct kernel helps

but even a 7402 with 2133 ram will do 12GiBs with thoose drives with zfs
 

gb00s

Well-Known Member
Jul 25, 2018
1,191
602
113
Poland
Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.
 
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.
Yep, first lesson I learned using "fast" intel NVME in ZFS was the consumer drives I had fell flat on their face after xMinutes because they were consumer, and that was their job... burst fast, then slow. On desktop you would rarely hit that, but on servers with VMs once it builds up, unless the consumer drive had aggressive garbage\cleanup it would slow slow slow wayy wayy wayy down. I gave up trying to figure out which good consumer drives had firmware that allowed more aggressive garbage\cleanups, and just started buying used NVME on ebay :D
 

RonanR

Member
Jul 27, 2018
47
2
8
Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.
FYI, my 990 didn't throttle due to temp.
As stated, doing test on a single pool gives me 13GB/s. I can do the test on two pools then on one, and I always got 13GB/s. So I got different behavior when doing do the same test. Even with my SAS SSDs pool I got this bottleneck behavior, so it's not coming from the 990 Pro drives.
 

RonanR

Member
Jul 27, 2018
47
2
8
Yep, first lesson I learned using "fast" intel NVME in ZFS was the consumer drives I had fell flat on their face after xMinutes because they were consumer, and that was their job... burst fast, then slow. On desktop you would rarely hit that, but on servers with VMs once it builds up, unless the consumer drive had aggressive garbage\cleanup it would slow slow slow wayy wayy wayy down. I gave up trying to figure out which good consumer drives had firmware that allowed more aggressive garbage\cleanups, and just started buying used NVME on ebay :D
Don't know with which consumer drives you tested, but I can assure you after hours of testing, the 990 pro don't fell flat at all. I did a more than 1 hour test with TBs of data and never got a single drop of performance on these drives.
If you're referring to write performances, I already know that I got around 1,4GB/s outside their internal cache, so it's not a surprise.
I tested each drive alone, and got a steady ~6GB/s ready and ~1,4GB/s write on the whole drive.
In a mirror pool, I got ~12GB/s read and ~2,3GB/s write.
In case you're wondering, I got these results with compression off.
 

RonanR

Member
Jul 27, 2018
47
2
8
linux, correct kernel helps

but even a 7402 with 2133 ram will do 12GiBs with thoose drives with zfs
Thanks a lot ! Which Linux distribution do you recommend ? I'm mainly using OmniOS and FreeBSD for ZFS, and don't have any experience with Linux kernel for ZFS ?
 

gb00s

Well-Known Member
Jul 25, 2018
1,191
602
113
Poland
So if you have 4x 990 Pro you should reach ~27/29GB/s. Being limited by 13GB/s and, as you stated the NVMes do not throttle at all, then this sounds like you get 50% of the PCIe Gen4 bandwidth you were expecting only. So I suspect your hardware is hiding some secrets (e.g. your Hyper M.2 slots on the board) which tend to switch from Gen4 to Gen3 under some circumstances. And 50% bandwidth of what you expect from Gen4 pretty much looks like Gen3.
These SSDs are connected using the motherboard M2 slots (Asrock TRX40 Creator) and PCIe gen4 x16 to single nvme adapters.
I tried 3 SSDs on the motherboards, as it has 3 slots, and one on PCIe, or 2 on the motherboard and 2 on 2 PCIe slots, same thing.
Did you try without involving your Hyper M.2 ports at all? The board manual does not show any important PCIe diagram, but an important LED diagram.
 
Last edited:
  • Like
Reactions: Psmitty88