PCIe Gen 4 bifurcation risers and NVMe SSDs

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

lunadesign

Active Member
Aug 7, 2013
256
34
28
I picked up the Linkreal LRNV9F14 card that I linked to earlier in this thread plus a pair of Linkreal SFF-8654 8i to 2x SFF-8639 cables and am midway through testing them. I was going to report back when done testing but your post prompted me to chime in.

So far, I've tested with two SSDs (one is PCIe 4, the other is a PCIe 3 Optane drive) and the performance was equivalent to me plugging these SSDs into individual PCIe-to-SFF-8639 cards. The cables are longer than I'd like but so far no issues. I've got two more PCIe 4 SSDs that I'll add to the mix (hopefully this weekend) and report back.
 
  • Like
Reactions: ectoplasmosis

ectoplasmosis

Active Member
Jul 28, 2021
117
53
28
I picked up the Linkreal LRNV9F14 card that I linked to earlier in this thread plus a pair of Linkreal SFF-8654 8i to 2x SFF-8639 cables and am midway through testing them. I was going to report back when done testing but your post prompted me to chime in.

So far, I've tested with two SSDs (one is PCIe 4, the other is a PCIe 3 Optane drive) and the performance was equivalent to me plugging these SSDs into individual PCIe-to-SFF-8639 cards. The cables are longer than I'd like but so far no issues. I've got two more PCIe 4 SSDs that I'll add to the mix (hopefully this weekend) and report back.
Excellent! Thank you for the report.

Would you mind sharing where you bought the card and cables?
 

lunadesign

Active Member
Aug 7, 2013
256
34
28
Excellent! Thank you for the report.

Would you mind sharing where you bought the card and cables?
I bought them direct from Linkreal in China. I originally started with a question to their customer service dept and the customer service rep was incredibly helpful, answering all of my questions in a very responsive manner and then facilitated the purchase. Shipment from China to US was fairly quick and everything arrived in great shape. I've been very impressed by the whole experience so far.

PS - In the process, I learned they have a PCIe 4 retimer card coming out soon. It might be out by now.
 

ectoplasmosis

Active Member
Jul 28, 2021
117
53
28
Linkreal also sells these on Newegg, in case that's easier than buying from them directly:
Ah, thanks! I'm based in the UK however. I'll get in touch with Linkreal direct.
 

lunadesign

Active Member
Aug 7, 2013
256
34
28

ectoplasmosis

Active Member
Jul 28, 2021
117
53
28
I bought them direct from Linkreal in China. I originally started with a question to their customer service dept and the customer service rep was incredibly helpful, answering all of my questions in a very responsive manner and then facilitated the purchase. Shipment from China to US was fairly quick and everything arrived in great shape. I've been very impressed by the whole experience so far.

PS - In the process, I learned they have a PCIe 4 retimer card coming out soon. It might be out by now.
How's the testing with Gen4 drives going?
 

lunadesign

Active Member
Aug 7, 2013
256
34
28
How's the testing with Gen4 drives going?
Sorry....I got delayed by some work projects. Your timing is excellent as I just wrapped up my initial testing today.

Here's my test setup:

Supermicro H12SSL-NT
AMD EPYC 7262 (Rome)
128GB ECC RAM
3 Intel P5510 3.84 TB PCIe 4 SSDs
1 Intel P4801X 100 MB PCIe 3 SSD
1 Intel S4610 SATA SSD (boot drive)
Linkreal LRNV9F14 bifurcation riser (with SFF-8654 to 2x SFF-8639 cables)
Windows Server 2019 Standard

For benchmarking, I used CrystalDiskMark 8.0.2 (using "NVMe SSD" settings) and Anvil 1.1.0.337. These are admittedly synthetic tests so they may not be representative of reality but I figured they'd give me a ballpark idea of how the card holds up under load.

Here are the summary results & some observations:
  1. Connecting all 4 SSDs to the bifurcation riser works fine. The motherboard sees all 4 and negotiates Gen4 speed with the Gen4 drives and Gen3 speed with Gen3 drive. I didn't know if mixing-and-matching was going to work so that was a pleasant surprise. However, the Gen4 drives seemed to perform slightly better when the Gen3 drive wasn't connected.
  2. When testing one Gen4 drive at a time, the performance via the Linkreal card was equivalent to a PCIe-to-SFF-8639 card.
  3. Then I tested two Gen4 drives simultaneously on the Linkreal card. When comparing this vs one-at-a-time:
    • The CDM average results decreased 7-11% for 4 results, were mostly unchanged for 3 results and actually increased 3% for one result (not sure why).
    • The Anvil average summary scores were basically unchanged.
  4. Then I tested three Gen4 drives simultaneously on the Linkreal card. When comparing this vs one-at-a-time:
    • The CDM average results decreased 31-42% for 2 results, 9-13% for 3 results and were mostly unchanged for 3 results.
    • The Anvil average summary scores were down 2-3%.
  5. Interestingly, the CDM Q32T16 tests that use 100% CPU with a single drive were not the ones to decrease the most when using 2 or 3 drives.
Bottom line -- I'm not sure what to make of all of this. Specifically, it's unclear to me if the CDM performance drops when going from one drive to multiple drives is due to:
  • The Linkreal card (if I had more PCIe-to-SFF-8639 cards I could test this)
  • The system (forum user 111alan has previously indicated AMD CPUs aren't great for I/O intensive workloads)
  • A limitation of the CDM benchmark when running multiple instances
Any thoughts? This is my first experience with PCIe 4 and NVMe so I'm not exactly sure what to expect.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
248
43
NH, USA
... However, the Gen4 drives seemed to perform slightly better when the Gen3 drive wasn't connected.
I find this curious. Details? (Most likely due to some random anomaly (an outlier) not eliminated due to hassle of collecting multiple data points.)
... I'm not sure what to make of all of this. Specifically, it's unclear to me if the CDM performance drops when going from one drive to multiple drives is due to:
1. The Linkreal card (if I had more PCIe-to-SFF-8639 cards I could test this)
2. The system (forum user 111alan has previously indicated AMD CPUs aren't great for I/O intensive workloads)
3. A limitation of the CDM benchmark when running multiple instances
Any thoughts?
Definitely not 1.
2. is highly unlikely
3. almost certainly (combination of CDM inefficiencies and OS [syscall] overhead)
Doing similar testing under Linux, using fio, would differentiate.
======
"If you push something hard enough, it will fall over." -- Fudd's First Law
 

lunadesign

Active Member
Aug 7, 2013
256
34
28
I find this curious. Details? (Most likely due to some random anomaly (an outlier) not eliminated due to hassle of collecting multiple data points.)
Sure thing. But first I have to admit that I was hesitant to even mention this since the deltas are pretty small and I know SSD performance can vary minute-to-minute due to background garbage collection algorithms, etc. However, the deltas were consistently in the same direction.

To minimize the variability, I focused on a single test (SEQ1M Q8T1 Read) that's fairly reproducible *and* runs at faster-than-Gen3 speeds. Here are the results:

With 3 Gen4 drives and 1 Gen3 drive plugged in:
  • 3 Gen4 drives tested simultaneously: 4872.77 MB/s avg
  • 2 Gen4 drives tested simultaneously: 6129.70 MB/s avg
  • 1 Gen4 drive tested at a time: 6027.39 MB/s avg
With 3 Gen4 drives plugged in:
  • 3 Gen4 drives tested simultaneously: 5157.19 MB/s avg (5.8% increase)
  • 2 Gen4 drives tested simultaneously: 6246.76 MB/s avg (1.9% increase)
  • 1 Gen4 drive tested at a time: 6248.44 MB/s avg (3.7% increase)
Definitely not 1.
2. is highly unlikely
3. almost certainly (combination of CDM inefficiencies and OS [syscall] overhead)
What makes you say "definitely not" to #1? There was an earlier discussion about cable lengths and the possible need for redrivers/retimers so I thought that *might* be a factor.

With regards to #3, I was thinking after I wrote this about the OS kernel and drivers. I'm not sure if there are any tweaks out there that I would need to apply?

Doing similar testing under Linux, using fio, would differentiate.
I could definitely try this. I have a decent amount of experience with CentOS 7 (but usually in VMs, not bare metal). Do you have any particular fio workloads that you would recommend? From Googling around, I see some people have attempted to create fio workloads that mimic what CDM does on Windows but don't see any that claim parity with CDM version 8. And of course, there are some inherent differences between Windows and Linux filesystems.
 
  • Like
Reactions: vcc3

UhClem

just another Bozo on the bus
Jun 26, 2012
435
248
43
NH, USA
... I was hesitant to even mention this since the deltas are pretty small and I know SSD performance can vary ...
I can understand your apprehension ... but the quantitative trend you "saw" must have been hard to "ignore". Forgive my need to clarify one thing: the two runs (one with the Gen3 connected, and one without it connected) were performed on separate boot-ups? (You didn't perform the without-run, then connect the Gen3 and perform the with-run; all on the same boot-up. An error-generating [but non-fatal] surprise/hot-plug could have degrading side-effect.)
To minimize the variability, I focused on a single test (SEQ1M Q8T1 Read) ...
Sounds like a reasonable methodology ...
Regardless of the with-Gen3 vs without-Gen3 issue, I'm bothered that 3 concurrent runs of that test incurs any total throughput "penalty", let alone ~20% [(60-48)/60].
What makes you say "definitely not" to #1? There was an earlier discussion about cable lengths and the possible need for redrivers/retimers so I thought that *might* be a factor.
The card is completely passive (other than replicating the CLK signal from the x16 to each x4). Since 4x of x4 targets "work", the card works. Regarding the redriver/retimer "boondoggle", see my post [Link] and factor that Gen3 example through the signal budget/deficit #s for Gen3 & Gen4 in the PCISIG slide posted earlier in this thread [Link] and draw your own conclusions. Those redriver/retimer chips are critical, but only in the scenarios that truly require them--highly complex systems that make our (STH) boxes look like pogo-sticks relative to a jetliner. :)

If you want to do further testing, fio is the right tool; even moreso in combination with the new io_uring kernel feature. Unfortunately, I can't offer any assistance with fio because I've never used it (I do extensive performance testing/analysis for fun--I retired 20+ yrs ago--but I've been using the only two progs I've written since 2000 [xft (transfer-test) & skt (seek-test)] so no real motivation to dig into fio. (Mea culpa for the "do as I say. not as I do".)
 

NateS

Active Member
Apr 19, 2021
159
91
28
Sacramento, CA, US
The card is completely passive (other than replicating the CLK signal from the x16 to each x4). Since 4x of x4 targets "work", the card works.
It is actually possible with PCIe for something to appear to work, but under the hood its experiencing enough corrupt packets requiring retransmit that performance suffers, and that could be due to cable length or a not-so-great connector or something. Unfortunately, determining that's happening isn't easy on Windows, but I understand Linux has some capabilities to report that. This page has some more info: PCIe: Is your card silently struggling with TLP retransmits?
 

lunadesign

Active Member
Aug 7, 2013
256
34
28
I can understand your apprehension ... but the quantitative trend you "saw" must have been hard to "ignore". Forgive my need to clarify one thing: the two runs (one with the Gen3 connected, and one without it connected) were performed on separate boot-ups? (You didn't perform the without-run, then connect the Gen3 and perform the with-run; all on the same boot-up. An error-generating [but non-fatal] surprise/hot-plug could have degrading side-effect.)
No worries....good question....always ok to ask away anyway. :)

Yes -- they were definitely on different boot-ups. NVMe drives are expensive enough that I don't feel comfortable hotplugging them without a hotplug-capable enclosure (hoping to acquire one soon).

Sounds like a reasonable methodology ...
Regardless of the with-Gen3 vs without-Gen3 issue, I'm bothered that 3 concurrent runs of that test incurs any total throughput "penalty", let alone ~20% [(60-48)/60].
Agreed. I was a bit surprised too.

The card is completely passive (other than replicating the CLK signal from the x16 to each x4). Since 4x of x4 targets "work", the card works. Regarding the redriver/retimer "boondoggle", see my post [Link] and factor that Gen3 example through the signal budget/deficit #s for Gen3 & Gen4 in the PCISIG slide posted earlier in this thread [Link] and draw your own conclusions. Those redriver/retimer chips are critical, but only in the scenarios that truly require them--highly complex systems that make our (STH) boxes look like pogo-sticks relative to a jetliner. :)
Thanks for the helpful info! I hadn't seen your first link before but I'm thinking that experiment might be tougher to pull off with PCIe 4? It's reassuring to hear you say that redrivers/retimers aren't needed in the typical STH cases.

If you want to do further testing, fio is the right tool; even moreso in combination with the new io_uring kernel feature. Unfortunately, I can't offer any assistance with fio because I've never used it (I do extensive performance testing/analysis for fun--I retired 20+ yrs ago--but I've been using the only two progs I've written since 2000 [xft (transfer-test) & skt (seek-test)] so no real motivation to dig into fio. (Mea culpa for the "do as I say. not as I do".)
No worries whatsoever. Offering a useful lead doesn't obligate you to being an expert in that area. :)
 

lunadesign

Active Member
Aug 7, 2013
256
34
28
It is actually possible with PCIe for something to appear to work, but under the hood its experiencing enough corrupt packets requiring retransmit that performance suffers, and that could be due to cable length or a not-so-great connector or something. Unfortunately, determining that's happening isn't easy on Windows, but I understand Linux has some capabilities to report that. This page has some more info: PCIe: Is your card silently struggling with TLP retransmits?
Thanks! This is a really interesting article and the sort of thing (low level errors hidden from the user) that I was concerned about. I noticed the article is from 2012 and wonder if there are any tools nowadays that monitor that status register automatically?
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
248
43
NH, USA
Thanks! This is a really interesting article and the sort of thing (low level errors hidden from the user) that I was concerned about. I noticed the article is from 2012 and wonder if there are any tools nowadays that monitor that status register automatically?
Note the "update" following the lead paragraph of the article:
Update, 19.10.15: The Linux kernel nowadays has a mechanism for turning AER messages into kernel messages. In fact, they can easily flood the log, as discussed in this post of mine.
It is actually possible with PCIe for something to appear to work ...
Sure. And it could be considered cavalier for me to have stated "if it works, it works", regarding the bifurcating card itself--sloppy traces/connectors etc. always a possibility.

In my Franken-PCIe test scenario, I had equivalent, and consistent, performance results (long vs short); and no kernel messages. I'm curious to see someone try something similar, but with Gen4 host/target.
 
  • Like
Reactions: NateS

NateS

Active Member
Apr 19, 2021
159
91
28
Sacramento, CA, US
Thanks! This is a really interesting article and the sort of thing (low level errors hidden from the user) that I was concerned about. I noticed the article is from 2012 and wonder if there are any tools nowadays that monitor that status register automatically?
Yeah, like UhClem pointed out, Linux will drop some messages in the system log if this is happening. I don't know of any equivalent functionality on Windows though.

That said, I haven't used this methodology myself, as at work* we would generally use (very expensive) hardware PCIe protocol analyzers, or from the drive end we can instrument the firmware, to debug this sort of thing. But I think the linux debug messages should be enough to give you an idea if this is what's happening or not. My gut feeling is that this probably has more to do with the software stack than any hardware problems, so you probably won't find anything, but it makes sense to rule HW out first when it's not too difficult.

* Standard Disclaimer: I work for Intel on SSDs, but I'm not an official spokesperson, and my statements should not be read as official statements by Intel.
 
  • Like
Reactions: TrumanHW

TrumanHW

Active Member
Sep 16, 2018
253
34
28
Retimer as opposed to redriver, assuming you need one. It's not saying you always need a retimer; rather if you're driving long cables or traces, you need a more complicated, PCIe protocol aware retimer, rather than a simple protocol-agnostic redriver as was common with PCIe 3.0.

That website also gives some helpful info for figuring out whether you're likely to need one or not. The end-to-end connection needs to have no more than 36dB of attenuation, and with PCIe 4.0, you're losing about 2.3dB per inch of trace on a standard FR4 PCB. Connectors add a loss of about 1.5dB. So, for a standard AIC, we have 1 connector for the card slot, and let's say two inches to route the traces from the edge of the card to the ASIC, then that means the slot can be a maximum of 13 inches from the CPU. OTOH, if we want to use a NVMe HBA, then we have three or four connectors in the chain (MB to HBA, HBA to cable, cable to drive or backplane, and backplane to drive if a backplane is used), which eats up 6dB of our budget right off the bat. If we assume cables have a similar dB loss per inch as a PCB (which is probably not a great assumption, but I can't find a good number for cables), then we have 13 inches available for CPU->slot, slot->cable connector, and cable->(backplane)->drive. Anything beyond that would need a retimer, which would reset our budget to the full 36dB at the point in the chain where it's inserted.

For PCIe 5 these numbers are much worse, & I expect we'll start seeing optical PCIe links become more common when PCIe 5 HBAs are.

What a SPECTACULAR FIND to have read your post. THANK YOU. I had no idea WTF ReTimer v ReDriver meant before ... though, I was going to buy (which I realize is PCIe 3.0 and thus germane to the thread).

SuperMicro ReTimer -- AOC-SLG3-4E4T

That said, and I apologize -- but I've been on a quest for an answer for about a month of posting questions here & on TrueNAS forums ... as to whether I can or cannot (and if not despite the 80-lanes provided I use cards in slots with lanes) -- then -- what the hell is the limiting factor ?? Because the only things I know of to even think about are CPU lanes vs physical slot mapping, the QPI (which I'd LOVE that to somehow be a "bottleneck" given each is "limited" to 16GB/s, slowest) ... in order to connect 10-12 NVMe (x4) SSDs.

The unit a Dell PowerEdge R730XD
• 2P -- E5-2600v3 or v4
• 10x - 12x NVMe ( x4) drives...

Other than the NVMe drives, SLOG and a Fusion Pool mirror -- just an SFP+ card
(which if it makes a difference & the unit I ordered maps lanes to a mezzanine slot but doesn't include said SFP+ card, I'll just order an SFP+ for it.)

The unit has two candidate PCIe layouts (which I won't know until I receive it unless the seller sent me the service tag):

Option A -- PCIe 3.0 Slots:
  • (1x) x16
  • (6x) x8
Option B -- PCIe 3.0 Slots:
  • (2x) x16
  • (5x) x8

Even version A (1x 16-lane) ... with (1) x16 card + (3) x8 HBA cards is 40-lanes & 10x NVMe drives.
Which'd I thought would leave +3 x8 slots -- worst-case scenario.

Of which, I'd use (2) x8 slots with mirrored AIC Optanes for a mirrored Fusion Pool (for metadata)
And would have plenty of DIMM slots available for a 16GB - 32GB SLOG using NV-DIMMs ... (would 16GB be adequate?)

Dell didn't sell this with more than 4 NVMe drives -- despite their outrageous pricing ...
Intel claims. you can only use this config for up to 4 drives, also ...

But I've used a HighPoint x16 card with my i7-8700K ... which has a total of x16 PCIe lanes ...
And that was with a mirrored pair of NVMe's on the motherboard for the boot drive, a data recovery (PC-3000) card and an SFP+.

I'm just lost as to why everyone (Dell, Intel, etc) shits on the idea of populating it with more NVMe drives ... with 80 lanes??

THANK YOU!
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
PS, not sure anyone would use a HighPoint NVMe device (as they claim they're 'RAID' cards ...

But -- they just came out with an x16 PCIe 4.0 with 4x NVMe (for 8x U.2 x4 drives) ...

I think it retails for about a $1000 ... and the SSD 7120 cards which're marketed as "RAID cards' ... when I contacted HighPoint (as I own a couple of them from before I learned about the ReTimer card) ... they said it'd work as an HBA within FreeNAS (though I haven't tried it -- but MIGHT ... if for no other reason than to check whether I can make 8-12 NVMe drives work on the damned Dell -- though I'm HOPING that it's BS to think it wouldn't... and if it wouldn't work properly (artificially slowed) I'd love to learn why so I can make better decisions.

¯\_(ツ)_/¯

PS -- the OP who found this (QIP4X-PCIE16XB03) card !?? Is a super sleuth!
Bc they don't really exist from US resellers, either.
And -- I've BEEN searching for x16 HBA options (which're FEW) -- let alone!! 4.0 -- and I'd def like seeing more PCIe 4.0 options.

HighPoint SSD 7580


HighPoint SSD 7580.jpg
 

TrumanHW

Active Member
Sep 16, 2018
253
34
28
At work, we'd generally use (very expensive) hardware PCIe protocol analyzers, or instrument the firmware & debug from the drive end.
But the linux debug messages should be enough to give you an idea if this is what's happening or not. My gut feeling is this probably has more to do with the software stack than any hardware problems, so you probably won't find anything, but ruling HW out first (when not too difficult) makes sense.

I work for Intel on SSDs, but am not an official spokesperson. My statements are my own & aren't intel official statements.
That disclaimer makes sense. I was thinking,
"Who TF is this guy that 'work has EXPENSIVE PCIe PROTOCOL ANALYZERS..??"
Because I'd imagine the "CHEAP ONES" are still a half million -- then I say 'intel' , and for ALLLL the criticisms the world can levy against the former Intel CEO, product dev pipelines, missteps of architecture "bah, mobile phones!" ... the one thing BEYOND REPROACH ... are Intel's SSD drives. THANK YOU for WHATEVER ROLE you've played in making Optane exist -- not only that they do, but making everyone else do all they can to make drives worth comparing to Optanes. (now If only I could afford ONE).
 

uldise

Active Member
Jul 2, 2020
209
71
28
SuperMicro ReTimer -- AOC-SLG3-4E4T
i have that card and it works very well on my AMD Epyc system.
each x4 NVME disk needs four lanes. if you see cards, like HighPoint SSD 7580, then it have like a pcie switch chip on it. so, you should ask yourself - what's a goal for all this - just to connect all drives and then suffer from pcie bottleneck? with only 40 lanes per cpu it shoud be very challenging to connect all these drives if you really want 4 lanes to each drive.
 
  • Like
Reactions: TrumanHW