Going insane with BPN-SAS3-826EL1-N4

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jasonsansone

Member
Sep 15, 2020
97
52
18
I recently upgraded from a BPN-SAS3-826EL1 to a BPN-SAS3-826EL1-N4 for the four NVMe slots. The backplane is connected to a Supermicro X11SPL-F motherboard using these cable and these PCIe cards (2x in x8 slots with bifurcation set to x4x4). All four NVMe drive slots are populated with 2x Intel P4600 and 2x Intel 905P. All four drives are visible in UEFI bios. However, no matter what I do, NVMe slot one promptly drops out upon any load after boot. Here is the error from journalctl:

Code:
May 15 19:45:50 maverick kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
May 15 19:45:50 maverick kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
May 15 19:45:50 maverick kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
May 15 19:45:50 maverick kernel: nvme 0000:b5:00.0: Unable to change power state from D3cold to D0, device inaccessible
May 15 19:45:50 maverick kernel: nvme nvme3: Disabling device after reset failure: -19
It "seems" like the backplane is either ejecting the disk or putting it to sleep. That is just conjecture, but the slot itself works fine for SAS/SATA disks. I realize they are electrically different and doesn't rule out a pin problem for the NVMe portion of the slot. If I have been able to get it to work previously if I didn't populate all four slots, which is another reason I don't believe the slot itself is defective.

Things I have tested:
  • Swapped mini-SAS cables with known good and other brands (including SM official).
  • Tested jumper settings on CPU1 or removed altogether as was suggested elsewhere on this forum.
  • Tested a myriad of linux boot parameters (ie pcie_aspm=off and nvme_core.default_ps_max_latency_us=0)
  • Tested other drives in the same slot.
  • Swapped the PCIe card around into different PCIe slots.
  • Swapped around the PCIe cards.
  • Updated all drives firmware
  • Motherboard is not on newest BIOS, but the release notes for 3.9 don't mention anything relevant.


Anyone have any suggestions I may not have thought to try?
 

mattventura

Active Member
Nov 9, 2022
448
217
43
There's a lot of random little issues it could be. As you pointed out, the fact that it works fine with SAS/SATA doesn't mean there isn't an issue with one of the NVMe pins. It could also be an issue of trace length - the first NVMe slot is off on its own whereas the other three are in the same column, so the signal needing to travel further combined with the fact that you're using an adapter card without a redriver or retimer could be the problem. Another possible issue is that the backplane isn't secured enough in the chassis, so it's flexing a little bit and not making good contact with that drive (it being off on its own would also point to that).

I'd try:
1. Remove the backplane from the system and find a way to power it + plug the drives in with the backplane not installed. See if you still have the issue
2. Make absolutely certain it's tightened all the way and it's not flexing or moving at all when you put the drives in. Also try being a bit more firm when inserting the drives - you want to make sure it's really in all the way.
3. Try with a redriver, retimer, or switch instead of a dumb passthrough (AOC-SLG3-2E4 is probably the cheapest option because it uses the same connectors that you've already got).
4. Try with an SFF-8643 -> U.2 cable to bypass the backplane completely.
5. Try on a different motherboard/machine.
 

Stankyjawnz

Member
Aug 2, 2017
50
13
8
35
This happened on Linus tech tips, it ended up being a cpu that was not fully seated. Can't say for sure it is what is happening here but thought j would mention it.
 

CyklonDX

Well-Known Member
Nov 8, 2022
859
283
63
can you see disks from windows? (lets eliminate system as an issue)
Does the nvme/u.2 disk work | how about heat - check its smart log?
// can you connect that nvme/u.2 locally without backplane? (to exclude backplane issue)
 

ano

Well-Known Member
Nov 7, 2022
655
273
63
bios is a lower level than windows, should be able to see it there if it works

also thoose backplane/trays, guessing its a 3.5 to 2.5"? they can be finnicky and "not in all the way"
 

jasonsansone

Member
Sep 15, 2020
97
52
18
I appreciate everyone's thoughts and feedback.

can you see disks from windows? (lets eliminate system as an issue)
System is Proxmox (Debian Bullseye). 0.00% chance I will be testing on a Microsoft product.

I can see all disks using nvme list and this drive (nvme3) remains listed after the dropout. All three other drives are visible.

Does the nvme/u.2 disk work | how about heat - check its smart log?
I have tested the slot by rearranging disks and testing other known-good disks. The problem is isolated to the slot, not the drive. This happens upon boot, even from a cold boot, so I doubt an Optane drive overheated instantly.

bios is a lower level than windows, should be able to see it there if it works
All drives in all four slots, including the problematic one, are visible in BIOS.

also thoose backplane/trays, guessing its a 3.5 to 2.5"? they can be finnicky and "not in all the way"
I have reseated and rearranged a few dozen times. However, it doesn't make much sense how the drive can be detected in BIOS and through the linux kernel boot if it isn't seated properly. The problem is reproduceable and repeatable. Shortly after boot during the ZFS mirror resilver, the drive drops out. See OP log.
 

ano

Well-Known Member
Nov 7, 2022
655
273
63
does the lights turn on ? same time? after each other? instant?
 

CyklonDX

Well-Known Member
Nov 8, 2022
859
283
63
I have tested the slot by rearranging disks and testing other known-good disks. The problem is isolated to the slot, not the drive. This happens upon boot, even from a cold boot, so I doubt an Optane drive overheated instantly.
how many devices on that slot? Whats their wattage usage?
Potentially the slot doesn't give you 75W

have you tried
pci=nocrs
 

jasonsansone

Member
Sep 15, 2020
97
52
18
have you tried
pci=nocrs
Didn't help. Power comes from the backplane, not the PCIe bus.

May 16 17:53:41 maverick kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
May 16 17:53:41 maverick kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
May 16 17:53:41 maverick kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
May 16 17:53:41 maverick kernel: nvme 0000:b5:00.0: Unable to change power state from D3cold to D0, device inaccessible
May 16 17:53:41 maverick kernel: nvme nvme3: Disabling device after reset failure: -19
May 16 17:53:41 maverick kernel: nvme3n1: detected capacity change from 1875385008 to 0
 

jasonsansone

Member
Sep 15, 2020
97
52
18
* MAYBE * solved.

The backplane must be wired so that the drives also present to the OS in the proper order. NVMe drive 1-4 must appear as drive 1-4, which is moderately tricky when using multiple adapter cards and the fact the backplane cabling is ordered 1,4,3,2. After wiring everything to present correctly as 1-4, I haven’t had any issues at all. Tested every other issue and kernel flag humanly imaginable. I think the problem is VPP and hot plugging when the motherboard doesn’t realize which drive is which properly. I’ll report back if this is truly stable or just false hope.
 
Last edited:

Marraz

New Member
Oct 31, 2023
3
2
3
* MAYBE * solved.

The backplane must be wired so that the drives also present to the OS in the proper order. NVMe drive 1-4 must appear as drive 1-4, which is moderately tricky when using multiple adapter cards and the fact the backplane cabling is ordered 1,4,3,2. After wiring everything to present correctly as 1-4, I haven’t had any issues at all. Tested every other issue and kernel flag humanly imaginable. I think the problem is VPP and hot plugging when the motherboard doesn’t realize which drive is which properly. I’ll report back if this is truly stable or just false hope.
I also have a BPN-SAS3-826EL1-N4 connected to a X10DRH-CT mobo, and checked everything as well but couldn't use more than one drive, honestly I was ready to give up but decided to look for a little longer when I found this and give it a shot, connecting it in order as you said finally make it all work, looks like that's the secret sauce for this backplane. On that note, sometimes even the bios doesn't detect the drives, changing the Pcie port deEmphasis from -6.0 dB to -3.5dB fix that for me.
 

ano

Well-Known Member
Nov 7, 2022
655
273
63
I also have a BPN-SAS3-826EL1-N4 connected to a X10DRH-CT mobo, and checked everything as well but couldn't use more than one drive, honestly I was ready to give up but decided to look for a little longer when I found this and give it a shot, connecting it in order as you said finally make it all work, looks like that's the secret sauce for this backplane. On that note, sometimes even the bios doesn't detect the drives, changing the Pcie port deEmphasis from -6.0 dB to -3.5dB fix that for me.
do you have a screenshot where this would be on supermicro?
 

Marraz

New Member
Oct 31, 2023
3
2
3
do you have a screenshot where this would be on supermicro?
Sure, is under Advanced/Chipset Configuration/North Bridge/IIO Configuration/IIO# Configuration/Sockert # Pcie#### - Port ##

1698848258992.png

As you can see, is also stuck in Gen 2 Speed, is probably a bad connection or a bad cable, now that this is working I may buy new cables to see if that give me the full 3 gen but honestly, just having them reliably working is good enough for now
 
  • Like
Reactions: ano