What kind of R/W perf can I hope for with 4x NVMe drives over SFP28 using TrueNAS..?

TrumanHW · Feb 2, 2021

I was considering using a Dell T630 for an all NVMe array of ...
...4x U.2 4TB (PM983) drives I already have (initially) and picking up another ~4x U.2 NVMe drives to make it an 8x NVMe array.

Is RAIDz1 acceptable with a relatively small NVMe array..?
(given that the limit with spinning drives is the time to rebuild which is substantially shortened with NVMe)

I'm assuming could schedule backups via RSync the differential changes (as TN is journaled) to a spinning array that's RAIDz2
... which would make the risk nominal for my purposes...

For simplicity... what kind of R/W performance of a single transfer of files that are all
• 100MB or larger should I expect..?
• 4KB (iops) can I expect..?

Would an Optane or NVDIMM still be of benefit...?
Would this likely saturate an SFP28 connection in most cases..?
Could it likely saturate a larger pipe..? (or would 8 NVMe drives) or is this an issue of ZFS + TrueNAS limitations for now..?

Thanks

Rand__ · Feb 2, 2021

It depends ...

Basically it totally depends on the amount of parallel processes your are throwing at it.
Wit a single thread (J1/QD1) performance is limited basically to a single disk device, in that case even a raidz array is okish.
I have a lengthy discussion in the TNC forum you might have seen.

Now on the other hand if you can scale up then o/c the limit will be the number of cores/speed of your cpu; basically you can assume 1c/ write device

Can you benefit from optane - of course
Can you benefit from NVDimm (-N) - hell yeah. Although maybe not on a T630, not sure if that runs NVDimm-N. But you could with the correct HW

[Note if that was not clear - I am a NVDimm-N user and enthusiast

)

Can it saturate 25G - Not unless you have tons of parallel reads and enough CPU
Can it saturate 100G - Potentially with the right amount of jobs and the correct HW...

I managed to get to 50G in a very special test using lots of SAS3 SSDs, high Jobs/QD and Chelsio 100G nics with a windows client (->iWarp). You might be able to tweak that a bit, but it was a pretty beefy box. Maybe more/better drives could push it.

TrumanHW · Feb 4, 2021

I do not have many users ...

I'm getting an R630
2x E5-2603v3
32GB DDR4 ECC
HBA 330 (IT mode)

4x PM983 3.84TB NVMe U.2 Drives
Chelsio SFP28 ....

Rand__ · Feb 4, 2021

Well these are certainly odd choices to be honest...

In the end you will not fully utilize 4 nvme drives let alone 8 and certainly will not fill up a 25G NIC.

TrumanHW · Feb 8, 2021

Rand__ said:
Well these are certainly odd choices to be honest...
In the end you will not fully utilize 4 nvme drives let alone 8 and certainly will not fill up a 25G NIC.

I'm not sure if the basis of your assessment is that I didn't mention the HBA (which I actually hadn't yet updated with the advice you provided in another thread I had posted) but does it make a difference that I now intend to add the SuperMicro ReTimer HBA...?

If not the HBA, would you please provide more info as to what's limiting the saturation of SFP25 if more NVMe's aren't the answer?

My earlier statement about an R630 was also wrong, and I've thus listed some systems I'm leaning towards.
However, it's less important what [I think I want] ... than what you or another expert would suggest... (within the parameters of a budget I can afford.)
(If you'll allow me to lean on your knowledge)

My current system + CPU interests aren't rigid (except the approx. budgets)

• T630/R730/R730xd 2p E5-2600 v3 or v4
• T640 2p Bronze or Silver
-- Configuration --
• 4x - 8x PM983 3.84TB U.2 NVMe Drives
• 32GB - 64GB DDR4 ECC
• Chelsio SFP28
• SuperMicro AOC-SLG3 4E4T (ReTimer)

Rand__ · Feb 8, 2021

Let me put it this way,
I am running a 5122 Gold with 4 PM1725a's (not the fastest out there, but not too bad) (2x2 Mirror); I've been running the same CPU with 12 SAS4 HGST SS300s (6x2 Mirror);
Now the CPU is limiting my scale up factor o/c (4 cores/8T), so I cant throw 20 NVME in there (even if I wanted, not enough CPU; if I could it wouldnt help my use case (max performance for very few VMs))
I am running this on a ESXi -> NFS4 share with 2 network connects (56G) with an NVDimm-N backed slog on a 512G memory box and the best I can do (with a bunch of VMs being moved over at the same time) is around 20 to 25G.

Now with what you had presented previously (12 c/24t, low frequency) you will not be able to fully saturate a NVME per core which means you need to scale up parallel processes to multiply total speed which (unless I missed it) is not part of your planned use case.
Also you are planning to run those on Raid Z which gives you write performance of a single device (o/c alleviated due to the fact that a single process will not fully utilize the NVME device's capabilities, so you got headroom), which means that scaling with multiple processes will not work as well as it would with multiple drives in a mirror configuration.

At this point I would suggest to take a step back, explain what you want to do (as precisely as you can as in use cases, amount of users, read/write amount, file types, databases or vms or smb/nfs/afp shares, whatever).
Then state your budget and secondary criteria (minimum amount of space, minimum, desired performance, existing hardware, previous recommendations) and we try to help you find the answers you're looking for.

TrumanHW · May 22, 2023

Rand__ said:
Let me put it this way,
I am running a 5122 Gold with 4 PM1725a's (not the fastest out there, but not too bad) (2x2 Mirror); I've been running the same CPU with 12 SAS4 HGST SS300s (6x2 Mirror); Now the CPU is limiting my scale up factor o/c (4 cores/8T), so I cant throw 20 NVME in there (even if I wanted, not enough CPU; if I could it wouldn't help my use case (max performance for very few VMs))

You presented 12c/24t, low frequency, will be unable to saturate an NVMe per core which means you need to scale up parallel processes to multiply total speed which (unless I missed it) is not part of your planned use case.

Took me a minute to follow (might have given me more credit than deserved).
Is the core performance what determines the max throughput of any single transfer ... and
(Ignoring storage performance limits as we're not close to those)
And aggregated CPU performance (what each core can serve up in a transfer) what determines max Server performance..?

In which a theoretical server with unlimited storage performance so we look only at the server's ability to prep it for smb...

At least I sort of get it. I have an R7415, and testing with 8 NVMe drives over SFP28, couldn't even hit 1GBs on a single transfer.

Whereas my T320 (E5-2460 v2) which can (briefly) hit ~1,200MBs has single thread performance: ~1600
But weirdly, the Epyc 7351p (R7415 w NVMe drives) & hasn't exceeded ~850MBs single thread is: ~1800

I also came across someone (with a slightly slower (same gen) Epyc) using SMB MultiChannel who got up to 35GbE throughput.

And assumed TrueNAS didn't have it ... but it actually does.
Am I at least close to a more realistic understanding of the limits now..? And do you think MultiChannel could help..?

If so, why aren't you using it..? Are there drawbacks to it? Reasons not to?
I'm surprised that there's no synthetic benchmark to check a CPUs SMB performance per core and aggregate.

Again, thank you for the info.

ano · May 22, 2023

zfs 2.1.6 and up = fast

what matters is cpu frequency, memory speed, number of cores. a single "old" 7402 with 2133 mhz ram... will outperform a dual 6354 system with 3200mhz for zfs on nvme for some reason.... which is annnoying.. guess who has a lot of gen3 xeons.... EATING power. talking 100% more for the basic chassis without the jbods.

I can do 19.1GBs 128k 100% random writes with a single 7763/3200mhz ram, or 19.7GBs with dual 75F3 with 3200mhz ram ( on 8x drives, it doesnt matter if you add 10, or even 12 or 14 etc)

your mostly limited by how it handels queueing, or lack of with nvme, pretty much handles them like a SAS/sata

to some extent QD=1 comes in here as well on drives

Rand__ · May 23, 2023

TrumanHW said:
If so, why aren't you using it..? Are there drawbacks to it? Reasons not to?
I'm surprised that there's no synthetic benchmark to check a CPUs SMB performance per core and aggregate.

Not entirely sure what you're asking except "Why is it so slow", but I dont use smb multichannel since it wsa not available when i was looking at it iirc, and I dont need it. I dont run Hyper-V to use it, i run ESXi and that uses iSCSI or nfs. Therefore I run dual connected nfsv4

Now I have no experience with Epycs since I dont need more powerhogs, the Cascade Lake are bad enough (ie total overkill for my use case, but minimum required for NVDimm--N).

But to answer your question - running multichannel NFS significantly helped my aggregated performance so I would assume multichannel SMB would help too *if* networking (latency not bandwith) is your bottleneck. If its client disk it wont help (unless u have multiple clients o/c

) etc.

TrumanHW · May 30, 2023

Rand__ said:
Not entirely sure what you're asking except "Why is it so slow"

You said a CPUs single thread performance is a performance-bottleneck for samba.
Shouldn't that mean that a CPU-type can be benchmarked for samba performance..?

It seems like an 8 dev RAIDz2 array gets about 1x - 2x the performance of a single device-performance.
As I was using 7300 Pro. Hopefully, upgrading to 8x 9300 Pro will yield closer to 3GBs performance...
The 9300 Pro get about 2.8GB/s vs ~1.5GB/s in the 7300 Pro; both of which have power-loss protection.

My friend (previous owner of the R7415) said his single-transfer speeds were ≥2GB/s.
He used 20x DC P3500 (not super fast, at ~2GBs R / W) in (3) x 7 vdevs per RAIDz1.
(BUT, I'm unsure & just asked clarification if they're mirrors, stripes, or just independent vdevs, & if so, it's just a 7 SSD vdev)

He said he got 3.2GB/s until NAND filled up, and performance never dropped below 2GB/s sustained.
Of course, the configuration makes a huge difference so I'll update this once he replies, as there's a world of difference between the configs.

Hopefully, I'll have my array of 9300 Pro array soon and can just report the performance I get.

ano · May 30, 2023

I get much more out of a z2, you just need fast devices, and cpu and fast mem

Rand__ · May 31, 2023

TrumanHW said:
You said a CPUs single thread performance is a performance-bottleneck for samba.
Shouldn't that mean that a CPU-type can be benchmarked for samba performance..?

O/c you can, you can run fio against it for example.
The crux is being able to define your applications profile as exactly as possible to make the measurement give meaning, i.e. is useful to depict your use case

TrumanHW said:
It seems like an 8 dev RAIDz2 array gets about 1x - 2x the performance of a single device-performance.

For single process writes? Not bad
Excluding cache (memory) or including?
Streaming, random etc?

ano · May 31, 2023

I can find our benchmarks and post, but on genoa we managed 15-16GBs randomwrites 128k with 8x nvme in z2

Rand__ · May 31, 2023

with 1t qd1? That would be miraculous

ano · May 31, 2023

I can dig up QD=1 numbers on different blocksizes, its very very important that that is good. then everything else is good.

we ditched mirrored vdevs for only using z2 with lz4 on all new systems/san systems

with the cpus we have aval , its impressive, but even on something basic like a 7402, z2 + lz4 = FAST

Rand__ · May 31, 2023

I totally agree with that, at least on nvme the penalty of using zX over mirrors is negligible since if you have the means to utilize the mirrors (multiple processes) then you also can utilize multiple threads per nvme.
O/c if you got a really busy array (where your disks are at max) then mirrors over zX is an option.

TrumanHW · May 31, 2023

Great info.
I picked up 8x Micron 9300 Pro today (might grab a few more if benchmarks seem to scale // increase as SSD are added).

I plan on testing it with several variations between 3 - 8 SSD (z1) 4x, 6x, 8x with both z1 and z2 ... hopefully there's something to be learned.

I also have 2 pairs of Optanes:
- 2x P5800
- 2x 905P

With 256GB of RAM I doubt I'll need an L2arc.
Or is there even much of a benefit to a ZIL on an NVMe array ...?
My recollection to 'size' a ZIL (with sync enabled) is 5-sec max write speed is adequate (mirrored)...?

TrumanHW · May 31, 2023

ano said:
I can find our benchmarks and post, but on genoa we managed 15-16GBs random writes 128k with 8x nvme in z2

we ditched mirrored vdevs for only using z2 with lz4 on all new systems/san systems

Yes please. I'd def. like to know your benchmarks.

TrumanHW · May 31, 2023

I've only recovered regular RAID volumes (not ZFS). So I don't know how ZFS distributes data.

But often it's much faster to just look through the headers, find a picture ... and experiment with the order + stripe size of data ... and through this process I'm somewhat familiar with how regular RAID distributes data and XOR parity.

When available it's common for a tech to use an image to determine drive order & stripe size.
If the stripes are out of order we can infer whether its L or R sync, and which drives go where.
We can infer stripe size based on whether a picture has area reserved for image data but lacks it or if the picture terminates early.

So, for a 1MB picture on a hypothetical 9-HD array with 128kb block size, a picture will be "striped" across the width of the array in 128kb chunks ... meaning a 9-HD array would have 128kb of data on each of the drives plus parity data on the 9th. And for that hypothetical 1MB files, more than 9 devices cannot yield increased performance ... 128kb blocks. Only larger files could benefit from being striped across more devices.

But, within those limits, a wider array yields greater performance.

Though simplified, is this basically the way ZFS works also..? The wider the array the faster ..?
(ignoring ZFS specific features like checksum, etc)

ano · May 31, 2023

TrumanHW said:
Great info.
I picked up 8x Micron 9300 Pro today (might grab a few more if benchmarks seem to scale // increase as SSD are added).

I plan on testing it with several variations between 3 - 8 SSD (z1) 4x, 6x, 8x with both z1 and z2 ... hopefully there's something to be learned.

I also have 2 pairs of Optanes:
- 2x P5800
- 2x 905P

With 256GB of RAM I doubt I'll need an L2arc.
Or is there even much of a benefit to a ZIL on an NVMe array ...?
My recollection to 'size' a ZIL (with sync enabled) is 5-sec max write speed is adequate (mirrored)...?

rememeber 2.1.6 and higher zfs, use lz4

dont do l2arc

zil.. weeeeeeeeeeeeeeel, maybe, p5800 are fast, I do doubt it, wasnt worth it with p4800

for 8x of those you need cpu, fast cpu, and fast memory also helps then.

with enough cpu, you can almost max out 2x100gig with thoose drives

What kind of R/W perf can I hope for with 4x NVMe drives over SFP28 using TrueNAS..?

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Active Member

Well-Known Member