ZFS NVMe performances questions

RonanR · Jul 21, 2023

Hi,

I'm currently trying to get the best performances I can on FreeBSD with NVMe drives, in order to know how many servers my workflow is going to require (reading a lot of 60MB dpx files per seconds).
For testing purpose, I'm using 4x 1Tb Samsung 990 Pro on a system equipped with a threadripper 3970x and 128GB of DDR4 3200.

All tests are done using iozone, with various block size set according to the recordsize set on the zfs volume, and with 5 files of 60GB each, in order to fill the arc cache.
Here is an example of the command used, with 1M blocks :
iozone -R -l 5 -u 5 -r 1M -s 60g -F /nvme1/tmp1 /nvme1/tmp2 /nvme1/tmp3 /nvme1/tmp4 /nvme1/tmp5

I've stumbled on a bottleneck, and I can't figure out what's the cause. Indeed, the system can't go higher than 13GB/s in read.

When tested separately, creating 4 pools of 1 drive, each NVMe has the same speed : around 6GB/s in read.
When doing a simultaneous test on two pools, performances are doubled, I've got 12GB/s in read.
If I do the same simultaneous test on 3 pools I've got almost the same results, around 13GB/s, as if I reached a limit somewhere.
It seems confirmed as when doing the test on my 4 pools, I still got 13GB/s.
If I look at my CPU loading when doing the 4 tests simultaneously, it's used around 40 percent, with peak at 60 percent.

I also tested with 2 mirror pools and got the same results: tested separately, each got 12GB/s in read. Tested simultaneously, the cumulative bandwidth is limited to 13GB/s.
Same limit when doing a single Z1 pool with my 4 drives, it's stuck at 13GB/s.

Does anyone have any idea what's happening, or can enlighten me on how I can find what's limiting my system ?

i386 · Jul 21, 2023

how are these ssds connected to the host? asus hyper m.2 card?
what mainboard and which pcie slot?

~13 GByte/s sounds like an x8 pcie 4.0 slot

dandanio · Jul 21, 2023

btw. ZFS is NOT a performance oriented file system.

RonanR · Jul 24, 2023

i386 said:
how are these ssds connected to the host? asus hyper m.2 card?
what mainboard and which pcie slot?

~13 GByte/s sounds like an x8 pcie 4.0 slot

These SSDs are connected using the motherboard M2 slots (Asrock TRX40 Creator) and PCIe gen4 x16 to single nvme adapters.
I tried 3 SSDs on the motherboards, as it has 3 slots, and one on PCIe, or 2 on the motherboard and 2 on 2 PCIe slots, same thing.

dandanio said:
btw. ZFS is NOT a performance oriented file system.

About the fact ZFS is not a performance oriented FS, I'm well aware, no worries. I'm simply searching the limits of the system and try to understand them.

gea · Jul 25, 2023

dandanio said:
btw. ZFS is NOT a performance oriented file system.

ZFS cannot be, reasons
- checksums (more data to process)
- Copy on Write (more fragmentation, more data to process, a change of "house" to "mouse"
is not a single byte replacement but at least a whole ZFS datablock write, min 4k)
- data spread quite even over a pool to achieve constant io performce not best sequential values

Most relevant tuning options are around recsize (depend on use case, low values reduce io but affect ZFS efficiency negatively), pool layout, ram size with arc or writecache settings or ashift values. Active trim is important on Flash beside Optane. Most important aspect remains raw server performance. It is not trivial to achieve several GB/s throughput.

ano · Jul 25, 2023

do you need more than 20GiBs of 128k writes? if so you need something more than ZFS, allthough multiple volumes helps

Oliver Mack · Jul 25, 2023

I doubt your setup will give any usable results, or will you be using 990Pro NVMEs for production later as well?
And dpx files are a lossless digital format and IOzone creates per default test data that is compressible,
not sure how well your IOzone data can mimic that...

ano · Jul 25, 2023

oh repled in wrong thread, 990s are consumer

RonanR · Aug 3, 2023

Oliver Mack said:
I doubt your setup will give any usable results, or will you be using 990Pro NVMEs for production later as well?

990Pro Nvme won't be uses in production on this server, they're only purpose here is for testing without any additional hardware.
In fact I also got 12 Samsung PM1643a in SAS, but I discovered one of my SAS card, a Broadcom 9500-8i was defective and I'm waiting for its replacement.
As I'm curious, I wanted to know how far I can push my system, and I stumbled on this strange bottleneck : for some reason, even when using multiple pools, my system is blocked at 13GB/s.
FYI I tried to create one pool with my 4 990Pro drives and another one with my 12 Samsung PM1643a connected to an old Broadcom 9300-16i card
and I got the same results :
Tested separately, I got 13GB/s on the NVMe pool and 6GB/s on the SSD SAS one (limited by the PCIe 3 9300-16i card).
If I launch two test at the same time, one on each pool, the cumulative bandwidth is also capped at 13GB/s. My NVMe pool fluctuate between 6 and 9 GB/S and my SSD one between 4 and 6 GB/s

Oliver Mack said:
And dpx files are a lossless digital format and IOzone creates per default test data that is compressible,
not sure how well your IOzone data can mimic that...

In my experience from previous servers, iozone bandwidth is quite accurate for calculating how many dpx streams I will be able to achieve.
FYI, all my tests are done without compression, as it's not efficient at all with dpx files and puts a huge load on the CPU for almost nothing, so it's always disabled on my production servers.

RonanR · Aug 3, 2023

ano said:
do you need more than 20GiBs of 128k writes? if so you need something more than ZFS, allthough multiple volumes helps

Did you already achieve this type of performances ? If yes, can you please share your config and tuning ?

ano · Aug 3, 2023

genoa 9374 cpu, all 12 ram channels populated, 10'ish fast enterprise gen4 or gen5 nvme, and your there @RonanR

with just usual stuff like lz4, correct ashift etc

of course thoose drives, with fio directly on them are what.. 60GiBs or so

RonanR · Aug 3, 2023

ano said:
genoa 9374 cpu, all 12 ram channels populated, 10'ish fast enterprise gen4 or gen5 nvme, and your there @RonanR

with just usual stuff like lz4, correct ashift etc

of course thoose drives, with fio directly on them are what.. 60GiBs or so

Ok, great to know !
This means something is actually wrong on my system, I have to find out what.
You got that on FreeBSD ?
I'm wondering if using a Threadripper pro 5975WX can be a viable alternative to the Epyc 9374, what do you think ?

ano · Aug 3, 2023

linux, correct kernel helps

but even a 7402 with 2133 ram will do 12GiBs with thoose drives with zfs

gb00s · Aug 4, 2023

Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.

T_Minus · Aug 4, 2023

gb00s said:
Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.

Yep, first lesson I learned using "fast" intel NVME in ZFS was the consumer drives I had fell flat on their face after xMinutes because they were consumer, and that was their job... burst fast, then slow. On desktop you would rarely hit that, but on servers with VMs once it builds up, unless the consumer drive had aggressive garbage\cleanup it would slow slow slow wayy wayy wayy down. I gave up trying to figure out which good consumer drives had firmware that allowed more aggressive garbage\cleanups, and just started buying used NVME on ebay

RonanR · Aug 7, 2023

gb00s said:
Sorry for my language but I suspect the sh...y 990s are throttling due to temp restrictions of the controller. It's likely because you always do the same test with the same results. Same data, same workload ... So same time you reach the same barrier. Maybe I'm wrong.

Edit: Samsung 990 Pro's are retail drives and for gamers at max. Short read/write workloads, nothing else. Otherwise they throttle quickly.

FYI, my 990 didn't throttle due to temp.
As stated, doing test on a single pool gives me 13GB/s. I can do the test on two pools then on one, and I always got 13GB/s. So I got different behavior when doing do the same test. Even with my SAS SSDs pool I got this bottleneck behavior, so it's not coming from the 990 Pro drives.

RonanR · Aug 7, 2023

T_Minus said:
Yep, first lesson I learned using "fast" intel NVME in ZFS was the consumer drives I had fell flat on their face after xMinutes because they were consumer, and that was their job... burst fast, then slow. On desktop you would rarely hit that, but on servers with VMs once it builds up, unless the consumer drive had aggressive garbage\cleanup it would slow slow slow wayy wayy wayy down. I gave up trying to figure out which good consumer drives had firmware that allowed more aggressive garbage\cleanups, and just started buying used NVME on ebay

Don't know with which consumer drives you tested, but I can assure you after hours of testing, the 990 pro don't fell flat at all. I did a more than 1 hour test with TBs of data and never got a single drop of performance on these drives.
If you're referring to write performances, I already know that I got around 1,4GB/s outside their internal cache, so it's not a surprise.
I tested each drive alone, and got a steady ~6GB/s ready and ~1,4GB/s write on the whole drive.
In a mirror pool, I got ~12GB/s read and ~2,3GB/s write.
In case you're wondering, I got these results with compression off.

RonanR · Aug 7, 2023

ano said:
linux, correct kernel helps

but even a 7402 with 2133 ram will do 12GiBs with thoose drives with zfs

Thanks a lot ! Which Linux distribution do you recommend ? I'm mainly using OmniOS and FreeBSD for ZFS, and don't have any experience with Linux kernel for ZFS ?

gb00s · Aug 7, 2023

So if you have 4x 990 Pro you should reach ~27/29GB/s. Being limited by 13GB/s and, as you stated the NVMes do not throttle at all, then this sounds like you get 50% of the PCIe Gen4 bandwidth you were expecting only. So I suspect your hardware is hiding some secrets (e.g. your Hyper M.2 slots on the board) which tend to switch from Gen4 to Gen3 under some circumstances. And 50% bandwidth of what you expect from Gen4 pretty much looks like Gen3.

RonanR said:
These SSDs are connected using the motherboard M2 slots (Asrock TRX40 Creator) and PCIe gen4 x16 to single nvme adapters.
I tried 3 SSDs on the motherboards, as it has 3 slots, and one on PCIe, or 2 on the motherboard and 2 on 2 PCIe slots, same thing.

Did you try without involving your Hyper M.2 ports at all? The board manual does not show any important PCIe diagram, but an important LED diagram.

ano · Aug 7, 2023

some zfs limits'ish are around that number

ZFS NVMe performances questions

Member

Well-Known Member

Active Member

Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Build. Break. Fix. Repeat

Member

Member

Member

Well-Known Member

Well-Known Member