What kind of R/W perf can I hope for with 4x NVMe drives over SFP28 using TrueNAS..?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TrumanHW

Active Member
Sep 16, 2018
253
34
28
dont do l2arc
Cool, thanks

zil.. weeeeeeeeeeeeeell, maybe with a p5800, They're fast.
It wasnt worth it with p4800
I'll test that out.

With 8x 9300 Pro, with a fast cpu & fast memory can almost max out 2x100GbE with 9300 Pro
CPU: Epyc 7351p: 16C/32T at 2.4GHz | 2.9GHz (turbo)

1st Gen Epyc aren't offered with a significantly faster clock speed
I thought the 2.9GHz Turbo speed would be adequate; do you think otherwise..?

RAM: 256GB of PC4-2400 ECC
Isn't 2400MHz RAM fast enough?
The only bummer re: Epyc is it can't use Optane DIMMs.


Thanks!
 

ano

Well-Known Member
Nov 7, 2022
634
259
63
Cool, thanks



I'll test that out.



CPU: Epyc 7351p: 16C/32T at 2.4GHz | 2.9GHz (turbo)

1st Gen Epyc aren't offered with a significantly faster clock speed
I thought the 2.9GHz Turbo speed would be adequate; do you think otherwise..?

RAM: 256GB of PC4-2400 ECC
Isn't 2400MHz RAM fast enough?
The only bummer re: Epyc is it can't use Optane DIMMs.


Thanks!
your in luck, we tested 2133, 2400, 2660 and 3200 on 2nd 3rd gen (also tested 4th gen)

on 7351, you will se 0 gain with faster memory.

you need something like a 7443 to use 3200, but even then its not much, to really use 3200, you need 75F3 or 7763 with ZFS

its more important to have 8 ram channels than fast memory, so you should be good for memory

you will not be able to fully utilize thoose drives, with your cpu though. but Im guessing it will still be quite nice, also remember testing zfs is very hard, because of arc, most do it wrong.

Im guessing you will get 6 to 10GB/s on 128K random write with psync fio, so.. what is your networking, what s your workload
 
  • Like
Reactions: TrumanHW

TrumanHW

Active Member
Sep 16, 2018
253
34
28
your in luck, we tested 2133, 2400, 2660 and 3200 on 2nd 3rd gen (also tested 4th gen)
on 7351, you will se 0 gain with faster memory.
Good info. I considered an R7515 (instead of the R7415, it was only ~$700 extra) to get PCIe 4.0, but ...
PCIe 4.0 is only valuable if I also got PCIe 4.0 (Enterprise) NVMe (also w Power Loss Protection ...
But buying them would've been much more expensive than the 7.68TB 9300 Pro (I got 8x @ ~$450ea)


You'd need something like a 7443 to use 3200MHz RAM. And even then, it's not much faster.
To really use 3200MHz you'd need 75F3 or 7763 with ZFS.
8 RAM channels is more important than fast memory, so you should be good for memory
8 Channel RAM: check.
An Epyc 75F3 literally costs double what my average R7415 cost was ($1,450)
(I snagged a 2nd unit for $700!! As soon as I saw someone was auctioning!! a server I thought, "that was a mistake ). :)
I was still looking for an R7515 at a steal, bc I thought I could use a 7F32 (Base: 3.7GHz | Turbo: 3.9GHz)

But, it sounds like only an R7615 (Gen 3) would be much different CPU wise, (I got really lucky if so).

But even the 8c and 16c Milan are pricey!

Both Gen2 and Gen3 EPYC have faster clocks ... what is it about Gen-3 (Milan) that makes the big difference..?





you can't fully utilize those drives [bandwidth] with a 7351p, but I'd guess it will still be quite nice
maybe ... 6 to 10GB/s on 128K random write with psync fio, so.. what is your networking, what s your workload
How many 9300 Pro (in z2) would it take to get that 6GB - 10GB (preferably 10Gbs) Writing 128K random..?

And is that benchmark-guess based on a local test via FIO ..?
Or you mean I could get that (via 100Gb obviously) over SMB..?


Remember: Testing ZFS is very hard; because of arc, most do it wrong).
What about testing real-world data transfers ..? As in ...
Copying data and seeing how long it takes to calc SMB..?

What tests do you usually run..? Do you use iPerf..? FIO ..? DD ..?

If it's not too difficult ... I'd love to see the actual command + arguments when doing these sort of synthetic tests ...
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
I can find our benchmarks and post, but on genoa we managed 15-16GBs random writes 128k with 8x nvme in z2
Still hoping you'll dig up those Benchmarks for us. :)
Please include the type of SSD and which Epyc CPU you got those numbers with ...
And, since you mentioned the difficulties getting valid benchmarks, including how you tested it.
 

ano

Well-Known Member
Nov 7, 2022
634
259
63
we use fio for local tests, randomwrite, and large files, larger than ram, we monitor actual drive iops in addition to fio results

both 2nd and 3rd gen will be very fast, and even 1st gen

7402 as an example, is a very cheap, very fast option
 
  • Like
Reactions: TrumanHW

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
If it's not too difficult ... I'd love to see the actual command + arguments when doing these sort of synthetic tests ...
If you happen to speak Perl I have a script that I created for Freenas11 at some point. It fully automates vdev testing (fio and dd iirc) ie create vdevs, dataset run tests, destroy everything and repeat.
It may or may not run on 13 (I think I ran it on 12 too) and I never implemented all the ideas I had, but maybe it could help.I don't do much vdev design these days so have not touched it in a while and never finished it (result of a winter holiday fun project)

If you don't speak perl it might be a bit more difficult in case something is not working as expected though, so not sure if it would be worth the effort
 
  • Love
Reactions: TrumanHW

TrumanHW

Active Member
Sep 16, 2018
253
34
28
We use FIO for local tests, Random Writes of data that exceeds the RAM (arc), and monitor actual drive IOPs as well as FIO results
(smart) that disambiguates the drive IOPs.


Both 2nd and 3rd Gen (and even 1st Gen) are very fast. 7402 as an example, is a very cheap, very fast option
Yeah, I'll keep checking for deals on either a complete R75*5 / R76*5 or the main board for a R65*5 / R66*5 / R75*5 / R76*5 ...
I think all of these units use the same motherboards if I'm not mistaken. (Dell's naming convention makes this easy).
I can even use a DP unit with 1 CPU (if for some reason someone sold one inexpensively)
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
If you happen to speak Perl I have a script that I created for Freenas11 at some point. It fully automates vdev testing (fio and dd iirc) ie create vdevs, dataset run tests, destroy everything and repeat.
It may or may not run on 13 (I think I ran it on 12 too) and I never implemented all the ideas I had, but maybe it could help.I don't do much vdev design these days so have not touched it in a while and never finished it (result of a winter holiday fun project)

If you don't speak perl it might be a bit more difficult in case something is not working as expected though, so not sure if it would be worth the effort
Hey, that'd be very cool. I can just install whichever version of FN it'll work with (nothing setup yet, no data installed).
Hopefully I get lucky and it just works right out the box ... and if not..? No loss.
(who knows, maybe it won't be so difficult that I'll actually learn something trying to debug something).

Too bad I can't install homebrew on TrueNAS ... or install other CLI commands.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
You're in luck insofar that I am considering setting up a low(erish) power TNC system atm so might have a system to play with

(Off-Topic - its damn hard to go low power, I always throw in some parts like a CX3 or Optane or another pair of ssds or wonder if I really can get away with an E3/16G or if I shouldnt use a XeonD or E5 instead. In the end its gonna be about space (ie #drives) and functionality (do I really want to use the "low power NAS" to run VeeAm Backups to? - and if and iif so is it better to run low power 24/7 and dont mind a slow backup or better faster and turn off when not used [or at least at night/automated])
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
I'm utterly dejected now. Received the 8 Micron 9300 Pro...
Not only are twice as many drives that are twice as fast literally the exact same speed...at all of ~650MB/s (wtf..?)
But, now i have to hope the IPMI trick that lets me manually reduce the fan speed still works on Gen 14 PowerEdge...
(they're doing that asshole thing Dell does with drives not offered by them ... making the fans infuriatingly loud at 10,000rpm)

Obviously, I suspected something was wrong even with 4 NVMe as it should've max out the R/W at 1250MB/s (at least).

4 of the Micron 7200 ... which I tested to get ~1.4GBs each ... and I was getting ~650MBs. Now ...
8 of the Micron 9300 ... which I tested to get ~2.4GBs each ... and I'm STILL getting ~650MBs ..?

It's not the switch: This little Mikrotik has throughput >1GBs from my spinning array.
I'd also tested the first 4 Micron 7200 (NVMe) using all SFP28 gear+switch and got the same ~650MBs

Can SFP+ DAC cables be "degraded" if they've had a bend in them?
The Dell R7415 recognizes the Micron 7200 NVMe drives ... so it's not that.
The CPU is hardly utilized ... it didn't use more than a few percent ... so it's not that.

Basically, the performance stats in TrueNAS were "unremarkable" except the piss-poor drive IO.
Aside from the Drive IO reaching that whopping peak of 87MBs, hardly any other specs were used.
It used about 30G of the 256GB of RAM
The CPU was basically idle.
Temps were are all low.



I tried a few variations of FIO ... but as I suspected, I'm inadequately familiar with the variables to execute a useful command.

I.E., I don't know which [ioengine] I should be using: posixaio or libaio

And I probably made other mistakes ... here's the last command I tried:

Code:
fio --filename=/fiotest --rw=write --bs=1Mi --rwmixread=100 --iodepth=8 --numjobs=16 --runtime=60s --name=1024ktest --size=4G
Thanks (if you're able to provide any suggestions or for even reading).
 
Last edited:

ano

Well-Known Member
Nov 7, 2022
634
259
63
set LZ4

what does this give you?

fio --filename=random.fio --direct=1 --rw=randrw --bs=128k --size=100G --numjobs=30 --runtime=120 --group_reporting --name=yeeetiops --rwmixread=5 --thread --ioengine=psync
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I tried a few variations of FIO ... but as I suspected, I'm inadequately familiar with the variables to execute a useful command.
You could run the script I tagged you with, should only take 20 minutes to set all the variables and get it churning away for the night;)

But anyhow, regardless of what you use, a simple fio run with sync off should top your 650MB on sync=off with these drives, so I think something is definitely wrong.

@ano 's command, not really well suited for my HW though with 30 threads;)

E3-1220v6 (4 cores, no HT) with 5 Pm863a's in Z1
Run status group 0 (all jobs):
READ: bw=56.4MiB/s (59.2MB/s), 56.4MiB/s-56.4MiB/s (59.2MB/s-59.2MB/s), io=6771MiB (7100MB), run=120009-120009msec
WRITE: bw=1070MiB/s (1122MB/s), 1070MiB/s-1070MiB/s (1122MB/s-1122MB/s), io=125GiB (135GB), run=120009-120009msec

Pm1725a's, 2x2 mirror, Xeon 5122 (8 cores inc HT)

READ: bw=73.2MiB/s (76.8MB/s), 73.2MiB/s-73.2MiB/s (76.8MB/s-76.8MB/s), io=8790MiB (9217MB), run=120008-120008msec
WRITE: bw=1393MiB/s (1461MB/s), 1393MiB/s-1393MiB/s (1461MB/s-1461MB/s), io=163GiB (175GB), run=120008-120008msec

Shockingly low on the 1725a's I think, but maybe due to the 30 threads that overloads my CPU...
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
...a simple FIO run with sync off should top your 650MB on sync=off with these drives, so I think something is definitely wrong.
These results WERE with sync disabled!! lol.
(I didn't mention it bc again, at <5% of the drive's actual performance, there's obviously a bigger problem.)

You could run the script I tagged you with, should only take 20 minutes to set all the variables and get it churning away for the night;)
Can you please give it to me again..?
(I trawled this thread and couldn't find it or didn't see what I was looking for.)

Also ... "20 minutes to set the variables" ..?
If it takes you 20 min, figure it'll take me 20-days.

And ... why am I running any exhaustive test as if I don't already know something's a problem..?

Let's start with tests that take a few minutes until I'm at least getting results that are somewhat reasonable.

In fact, I know I asked for a better FIO command to run...but on further thought, what for?
(As in, that was a dumb idea of mine).

Understand that I pretty much think EVERY user who's replied to me is not only more knowledgeable than I am, but smarter.
(so I definitely don't mean anything as a condescension).


As in, we're going to exhaustively test an array that's literally getting 5% of the drive's IO bandwidth...
Don't I already know there's a problem here..? How do more significant figures help (it's not diagnostic).

Further benchmarking to even more exactly know how badly it's performing neither will fix nor diagnose the problem...

To rule out whether it's some issue between the hardware and TrueNAS ...

I'll install another OS to get a local benchmark of the drives individually.

For whatever reason (windows didn't give one) my prev. install-attempt of windows failed.

I'll install Ubuntu & try running a benchmark (of those drives) in a different OS.
If each drive benchmarks at ≥2GBs ...
I'll share the drive over the network and see how it performs then.
Maybe I can setup a RAID of some sort in Ubuntu as well as testing it with UnRAID ... etc.
(I've benchmarked in TrueNAS Scale & Core (same results) though, I'm having trouble installing TN Core now)

I'll also try throwing a bunch of SATA SSD in the R7415 and see if they hit the same ~650 - 700MBs limit.
(I hope it's an HBA330 and not an H330 that I installed).

If all drives give me the same dogshit performance, the unit's under warranty through 2026.

Not until I can get local results per drive and collectively that exceed ~100 MBs per SSD do network results matter...

Once I get or confirm good local performance ...
I can test network performance with another OS as a simple share.
- If good, perhaps it's a problem with TrueNAS network stack.
- If bad, perhaps I have a problem with a cable or something stupid ... who knows.

IF ANYONE has suggestions of plausible culprits causing SSD groups to be limited to aggregate of 650MBs ... I'm all ears.

Barring that ... further benchmarking won't help make progress except to see that something causes a change in performance.

Whenever I do get past that limit, I'll definitely need better FIO commands ...
But I'd like to start with quick FIO benchmarks.

If at some point I'm hitting some other limitation that I need to identify...that's when more exhaustive benchmarks will have utility.

I'm sure I'm wrong about something here ... or maybe have an inefficient process of elimination.
Please tell me the ways I'm wrong ...

I'd imagine there are better (more efficient) ways I could test some things, but be aware that I lack proficiency and may be using sub-optimal tactics because of that lack of proficiency.

Again, I truly thank everyone for all of their help and time, especially ...

And again ... it's entirely my fault that I requested a red herring (FIO) ...
When benchmarking already revealed that there [is] something wrong.
 

nexox

Active Member
May 3, 2023
522
200
43
Can SFP+ DAC cables be "degraded" if they've had a bend in them?
I haven't seen anyone address this yet, but yes, twinax cables (such as those used for SFP DACs, external SAS, Infiniband, and other high speed connections,) can be damaged by bending, and the minimum bend radius is larger than most other cables, a couple inches at least, and even as high as 6" isn't uncommon, so unless you know the specs of the particular cable, avoid bending them into a circle less than a foot or so.

You can of course just do a network throughput test with iperf3 or similar to see if the cable is an issue.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I agree that it might be time to take a step back.

1. For later -> https://forums.servethehome.com/ind...tomated-testing-of-drives-in-zfs-pools.40314/
2. For now:

The usual elimination process says
a. different SW (You tried TNC, TNS, next in line is regular Linux or Windows).
b. different HW (You changed drives, next step is different HBA or different box all together if possible [ideally different architecture/config, so not a second identical server. That also has its benefits for some tests but not for the initial elimination process])

If you are at a point where you suspect a fundamental problem then don't worry about testing arrays of drives any more. You start small, with as big numbers as you can get. That means fio or Crystal Disk Mark or whatever tool and set it to 4K blocks, (#of CPU Cores -1 or 2) threads, QD16+, 4G test size.
Start with a single drive, then do a software raid (windows or mdadm) as Stripe (not mirror) with 2+ drives.
You should at least get 50% extra performance initially, further extra disk will diminish returns (unless you add more threads). Note that you can overload drives, i.e. running 128(-2) threads on a single SATA SSD might be too much;) while it would be okish on NVME (especially on newer ones)

Depending on what kind of drives you have/test (sata,sas, nvme) run basic connectivity vs intended tests (onbaoard sata/sas/nvme vs dedicated card vs backplane'd support). Try different HBAs if you can. Try direct attaching NVMe's if you can (even a simple M2 should top your 650MB/s so its an acceptable test candidate at this point).

Try swapping pcie slots for things if possible (make sure you're in x8 slots or better for hba's, make sure they are set to pcie3+)

If connectivity is not the issue (i.e. all tests are slow over all cards, over all drive types, over all operating systems) then call support.
Basically any hardware sold as server today must be able to do 1.5 base performance of a drive in a stripe today (from raw processing power, even some atom can do that if you're looking at one or two drives only)

During tests have a look at cpu utilization (per core), to see if there is a single maxed out core (per thread) or if you got headroom (ie cpu is not the limit). Less threads means higher cpu clocks due to turbo clocks.

Now 650MB/s is not a 'well known' barrier, its above single lane SAS2/SATA3 speeds. Its a range that you use to get with optane 900p slogs (on sync), but you ruled that out. If a core is maxed then that could be it but thats unlikely as I mentioned for recent servers.
 
  • Like
Reactions: TrumanHW

TrumanHW

Active Member
Sep 16, 2018
253
34
28
You can of course just do a network throughput test with iperf3 or similar to see if the cable is an issue.
Of course I can. I totally thought of that. :)

Thank you vm for that info ... Lets hope Im not too retarded to remember how to run that test tomorrow ...

Did you find any other ideas to eliminate layers totally retarded..? I know they aren't the most efficient ways, I just have less experience with other methods and I'm looking for certainty to begin ruling out "layers" to figure out WTF the problem is...you know..?

Do you have any ideas what the heck this is if not the dang cable..?
Bc sadly ... (tho I'm gonna test the cable(s) to be sure) ... the FIO test I ran yesterday also got <100MBs, too.

I mean ... wtf ...right..?? What could possibly cause this..?
All SSD work (tested on other machines). They're actually R & W data accurately
... just slow as molasses (on a unit that's under warranty)
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Whereas my T320 (E5-2460 v2) which can (briefly) hit ~1,200MBs has single thread performance: ~1600
But weirdly, the Epyc 7351p (R7415 w NVMe drives) & hasn't exceeded ~850MBs single thread is: ~1800
If this is still the same basic issue you described earlier (the Eypc not going beyond 650/850) then its clearly an issue with this build
-> you need to swap/single out individual components (HBAs, backplane) to identify the weak point
 

ano

Well-Known Member
Nov 7, 2022
634
259
63
adjust/play with threads accordingly. single 7402 with 8x SAS will do 4500MBs z2 there for comparison nvme double ish (for fast drives)

the 30 is just to MAX the hw and push it

sata likes to have pauses and.. chug along slowly
 
  • Like
Reactions: TrumanHW

TrumanHW

Active Member
Sep 16, 2018
253
34
28
I agree that it might be time to take a step back.

1. For later -> https://forums.servethehome.com/ind...tomated-testing-of-drives-in-zfs-pools.40314/
2. For now:

The usual elimination process says
a. different SW (You tried TNC, TNS, next in line is regular Linux or Windows).
b. different HW (You changed drives, next step is different HBA or different box all together if possible [ideally different architecture/config, so not a second identical server. That also has its benefits for some tests but not for the initial elimination process])

If you are at a point where you suspect a fundamental problem then don't worry about testing arrays of drives any more. You start small, with as big numbers as you can get. That means fio or Crystal Disk Mark or whatever tool and set it to 4K blocks, (#of CPU Cores -1 or 2) threads, QD16+, 4G test size.
Start with a single drive, then do a software raid (windows or mdadm) as Stripe (not mirror) with 2+ drives.
You should at least get 50% extra performance initially, further extra disk will diminish returns (unless you add more threads). Note that you can overload drives, i.e. running 128(-2) threads on a single SATA SSD might be too much;) while it would be okish on NVME (especially on newer ones)

Depending on what kind of drives you have/test (sata,sas, nvme) run basic connectivity vs intended tests (onbaoard sata/sas/nvme vs dedicated card vs backplane'd support). Try different HBAs if you can. Try direct attaching NVMe's if you can (even a simple M2 should top your 650MB/s so its an acceptable test candidate at this point).

Try swapping pcie slots for things if possible (make sure you're in x8 slots or better for hba's, make sure they are set to pcie3+)

If connectivity is not the issue (i.e. all tests are slow over all cards, over all drive types, over all operating systems) then call support.
Basically any hardware sold as server today must be able to do 1.5 base performance of a drive in a stripe today (from raw processing power, even some atom can do that if you're looking at one or two drives only)

During tests have a look at cpu utilization (per core), to see if there is a single maxed out core (per thread) or if you got headroom (ie cpu is not the limit). Less threads means higher cpu clocks due to turbo clocks.

Now 650MB/s is not a 'well known' barrier, its above single lane SAS2/SATA3 speeds. Its a range that you use to get with optane 900p slogs (on sync), but you ruled that out. If a core is maxed then that could be it but thats unlikely as I mentioned for recent servers.

Unfortunately (in preparation to move)
... I gave the R730xd to a friend...
... and already shipped the other R7415 I had.

But, testing with other SATA SSD (particularly bc they are SATA) may yield some info.
And, being under warranty (til 2026) ..? I might as well take advantage of that & let Dell take a stab.
(Of course I'll throw the Micron 7200s I have in to placate any illegitimate concerns with devs that (crisis) aren't on their HCL).

No doubt they'll make me try a few dumb troubleshooting steps, but even if nothing else, those often make me think of other ideas that'll get me on the right path. And, realistically, their guys really do know their servers well. (Hopefully they'll indulge my first criteria that we get IPMI commands to lower the fan speed to keep me from losing my chet).

Anyway ... tempted though I am to stay up late, try the suggestions and reply as meaningfully as you (also) generously have ...
I should really get to bed to have a productive day tomorrow. But truly, thank you very, very much.