X9DRi and NVME drives

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

azev

Well-Known Member
Jan 18, 2013
769
251
63
CVFT54710015800HGNJ_2.JPG I am having some weird system behavior when deploying nvme on supermicro X9DRi-LN4F motherboard.
Got a few U2 P3700 nvme from Ebay and for some reason I cannot get them all to work at the same time.
All 4 drives are installed on a known working U2 to pci-e adapter that I pulled from an older system that are about to be retired. Anyway during the testing that I did today, it appear that one drive would only work on a specific slot. If I moved the drive around within the 6 slot, sometime it would completely dissapear from the system and sometime it would show up with an exclamation mark (the device cannot start error10).
Intel SSD toolbox shows health 0 (red) and the only way to recover them is to put them back in the slot where it worked before.

Unfortunately I do not have other system to test this on, but I never had this weird behavior in the past working with similar x9 motherboard.

Anyone ever experienced similar problem before ??
 
Last edited:

RageBone

Active Member
Jul 11, 2017
617
159
43
How many CPUs are installed?
Only 3 of the 6 PCIe Slots are bound to CPU0, the other 3 to CPU1, it should be labeled on the board.

If you only have one CPU installed, that is your 3 slots.
The Error 10 Might and working state in some slot, is maybe due to BIOS Settings, i suspect OptionRom settings.

Driver wise, Spupermicro and the platform, shouldn't support boot from it without a bios mod.
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
For more than one M.2/U.2 on a single PCI-e adapter bifurcation support is required.

Check your bios if and on which slot bifurcation can be enabled.
Otherwise you need a PCI-e adapter with its own bifurcation support /PLX chip
 

azev

Well-Known Member
Jan 18, 2013
769
251
63
@RageBone Both CPU are installed so all 6 PCI slot should be connected back to the CPU. I looked for option rom settings in the bios, but it does not look like it has any options I can change. The error10 issue is resolved by updating the driver to intel driver, however as you can see isdct tools shows the status of the drive as xassert, not sure what this mean, but moving them from different pci slot sometime would fix it.

@gea Since all 4 are connected to individual pci-e slot I assume that bifurcation is not really needed. The only option in bios regarding the pci slot is to select which GEN of PCI-E (GEN3 & GEN2)

After further testing I notice that 2 of the drives can work in most of the slot while 1 of them are super picky on which slot it would work on.
The last drive I still unable to get it to work. It would either show up in xassert mode or not shows up at all.

I will see if I can test this on a different system later tonight and see if the behavior is the same.
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
Yes, I asumed a single PCI-e adapter for several M.2/U.2 NVMe.
One PCI-e adapter per NVMe should simply work.
 

acesea

New Member
Oct 7, 2011
8
1
3
We've also observed strange behavior of some of our Intel P36xx and P3700 both U2 and pcie cards on Supermicro X9 (including X9DRi-LN4F) and X10 dual socket boards where infrequently on boot they won't show in bios, or wont show in esxi hardware list, or will show in esxi hardware but wont be recognized as storage.

All the hardware has newest firmwares and bios. In testing system configurations to find stable config we will reboot several times, if problem persists we next switch to another pcie slot etc and haven't found a pattern for the strange behavior. The problem is especially troubling where ie updates requiring reboot will cause esxi to restart and sometimes an nvme is missing.

Our sample size is small with only a few dozen boards and several dozen nvme and we haven't began tracking and recording problems per each serialized nvme drive. Usually the above troubleshooting resolves the problem and most further reboots the nvme come online ok. But some particular mboards and nvme drives more routinely are problematic and the hardware remains in the lab. The behavior is simply something we've noticed across several different boards and drives and once successfully recognized in esxi or OS there are no issues.
 
Last edited:
  • Like
Reactions: azev

azev

Well-Known Member
Jan 18, 2013
769
251
63
@acesea I think I am experiencing similar issues with a pair of the nvme drive. I have 2 that seems to work in most slot but the other 2 is problematic. One of them I can get working by trying different slot, but one of the drive stayed in the xassert mode whenever detected no matter which pci-e slot its on. I opened a ticket with Intel just to see what they say. When the drive was detected and working, smart data shows healthy with 100% life remaining. It only has around 16K hours of usage and around 100Tb writes.

This is definitely a very annoying problem :(
 

Shawn Arcus

New Member
Nov 21, 2018
2
1
3
<< Since all 4 are connected to individual pci-e slot I assume that bifurcation is not really needed. The only option in bios regarding the pci slot is to select which GEN of PCI-E (GEN3 & GEN2) >>

No, that would require bifurcation, or a card that has plx chip on it.
Upgrade to recent bios for that motherboard (v3.3), and bifurcation is available...under CPU settings.
 
  • Like
Reactions: mineblaster

azev

Well-Known Member
Jan 18, 2013
769
251
63
@Shawn Arcus I didnt realized supermicro release a new bios for this platform, I will try updating the bios over the weekend and see if that fixed the issue.

I opened a ticket with Intel and they almost immediately offer me to RMA the one drive that is undetectable.
 

azev

Well-Known Member
Jan 18, 2013
769
251
63
2018-12-19.pngSo, today I decided to add a new HBA to the server and move the 3x p3700 nvme drive to a different pci slot.
Before I did this I deleted all the pools just in case, and also upgraded the freenas from 11.1u6 to 11.2.
Initially everything seems to work well; I recreated the pool and zvol and connect my vmware cluster to the new freenas.
During svmotion back to the nvme pool the system crashed and rebooted. During reboot it got stuck initializing one of the nvme.
So I pulled all the nvme out and put it in my windows test machine, and guess what, looks like another drive bites the dust.
Another barely used P3700 drive died in the same server with similar error (attached). I have not even received the replacement of the 1st drive that failed, and another one decided to take a nose dive as well. I am not sure what is going on here, I am curious whether using these cheap u2 pci-e converter is the culprit. The same adapter was used with samsung/oracle nvme drive for over 6 mo with no problem before I decided to replace them with the intels hoping to squeeze abit more iops.

Have anyone experienced something similar before ?? It seems intel nvme P3700 is not very reliable for some reason.
It made me nervous to continue using nvme with this server, maybe I should just use SSD instead.