esxi 6.5u2 - nvme disk only shows up IF another disk exists (ie sata or otherwise)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

james23

Active Member
Nov 18, 2014
441
122
43
52
EDIT: final update- supermicro support confirmed they are able to reproduce this issue, but will not be fixing it, as the p3700 is "EOL" (it is, but i disagree, to see full info see my last post on page 2) - tks for all the great replies and help from STH!


So ive come across a werid issue in my new test system (just a test setup, esxi 6.5u2 , host only, no vCenter):

I have a 2.5" p3700 nvme disk attached to my supermicro x10 based system (its direct attached via a sm riser card= AOC-2UR8N4-i2XT , which has 4x nvme ports , which goes to a sm nvme backplane, i dont think this HW is relevant though).

I had been running the system great for about 2 weeks, i had a datastore created direct on the nvme disk with several VMs running from it (during most of this time i also had a raid0 data store also created, but unused, from a LSI 2308 raid card , at some point i passed this entire card directly through to a VM, so it no longer appears to esxi). - i had rebooted the esxi host maybe 3 times, never had any issues.

Yesterday my power flickered and when it came back my nvme data store was gone, and under storage->adaptors the nvme "hba" wouldnt come up. I tried removing and reinserting the nvme disk, trying a different nvme bay, rebooting. nothing.

However, this entire time i could see the disk via gui, manage->HW->PCI devices (see image) and also via ssh LSPCI:
0000:08:00.0 Mass storage controller: Intel Corporation DC P3700 SSD [2.5" SFF]

assuming the flash was fried or something, i boot the sys into a ubuntu live CD, and under the disks utility, there is the p3700 disk, with its proper 1.6tb VMFS partition intact.

I then update/patch from 6.5U2 (may2018) to 6.5u2 (latest patches ~ nov1 2018). Reboot, still no nvme.

I attach a random sata disk (to MB sata port), format it as a datastore (VMFS), then reboot, and BOOM the nvme datastore is back! if i remove the sata disk, reboot (so that the nvme is the only attached disk appear to esxi), the NVMe again wont appear
!

So it seems that as long as i have some kind of other disk attached (and visible to esxi), my nvme appears properly.


Any ideas what this is about?

this is how lspci + HW tab looks now (with nvme working, bc i have a random sata disk attached with a datastore on it)
0000:07:00.0 Non-Volatile memory controller: Intel Corporation DC P3700 SSD [2.5" SFF] [vmhba3]

HWtab.JPG

vs how lspci + HW tab looks wo the sata disk attached (and thus nvme wont appear) - (note i added the 08:00 to the image, as i have this screen up on a offline laptop, but cant get the image over to this pc to upload)

0000:08:00.0 Mass storage controller: Intel Corporation DC P3700 SSD [2.5" SFF]
08Capture.JPG


thanks!
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
No idea why,
but I'd move the vms off the p3700, wipe it, reinit the datastore and then see what happens...
 
  • Like
Reactions: james23

james23

Active Member
Nov 18, 2014
441
122
43
52
thanks, im still having this issue but have the following new info:
(btw, my exact supermicro system is on the vmware HCL for 5.5u3 through 6.7u1 - 6028U-E1CNRT+)

1- ive fully wiped the drive (both with cleanall in 2012r2 , and then after using the intel Datacenter tools to run this command:
isdct start -intelssd 0 -NVMeFormat LBAformat=0 SecureEraseSetting=0 ProtectionInformation=0 MetaDataSettings=0
(using 512 sector size)

2- ive tried fresh installs (ie current release isos from vmware, no config just install and boot up) of 6.5u2 , 6.5u2 (then patch to current patches) , 6.7 (no patches).

With either 6.5u2 , the issue is the worst, sometimes reboot or shutdown will get it to show up (but rarely), only way is by attaching other sata disks (and even then its not always, takes a few reboots / shutdowns).

with 6.7- its not as bad, ie: if i shutdown , then power up, it will be gone. however if i then reboot it, it will appear. its only the 1st boot up , after a shutdown that it will be gone. i can reboot 5 times in a row, it will always appear.

(ALWAYS, under either 6.5 or 6.7, it always shows up under manage-> devices like in my image above).

I have tried re-flashing/load defaults of my SM 3.1 bios (current, and what ive always been using), no effect.

(issue never happens with any other os, ie never with ubuntu, 2012r2 nor win10, drive always shows up and is accessible)

I DID FIND THIS though, WHICH HAS TO BE THE CAUSE / SOURCE (not sure how to address this though):
cat /var/log/*.log | grep 0000:06:00.0
(note: 0000:06:00.0 is the DC p3700 SSD [2.5" SFF] as seen in the Hardware tab of esxi)

addy.JPG
Code:
0:00:00:04.597 cpu0:2097152)VMKAcpi: 1098: Handle already exists in hash table for 0000:06:00.0
0:00:00:08.638 cpu0:2097152)PCI: 2161: 0000:06:00.0: Device is disabled by the BIOS, Command register 0x0
0:00:00:08.638 cpu0:2097152)PCI: 478: 0000:06:00.0: PCIe v2 PCI Express Endpoint
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x2 (Virtual Channel)
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x4 (Power Budgeting)
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0xe (Alternative Routing-ID Interpretation)
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x3 (Device Serial Number)
0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x19 (Secondary PCI Express)
0:00:00:08.638 cpu0:2097152)PCI: 141: Found physical slot 0x7 from ACPI _SUN for 0000:06:00.0
0:00:00:08.638 cpu0:2097152)PCI: 413: 0000:06:00.0: PCIe v2 PCI Express Endpoint
0:00:00:08.638 cpu0:2097152)PCI: 1067: 0000:06:00.0: probing 8086:0953 8086:3703
0:00:00:08.638 cpu0:2097152)PCI: 405: 0000:06:00.0: Adding to resource tracker under parent 0000:00:03.0.
0:00:00:08.638 cpu0:2097152)WARNING: PCI: 453: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) - out of resources on parent: 0000:00:03.0
0:00:00:08.638 cpu0:2097152)WARNING: PCI: 476: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) status: Limit exceeded
0:00:00:08.641 cpu0:2097152)PCI: 1624: 0000:06:00.0 8086:0953 8086:3703 unchanged, done with probe-scan phase already.
0:00:00:08.641 cpu0:2097152)PCI: 1282: 0000:06:00.0: registering 8086:0953 8086:3703
0:00:00:08.641 cpu0:2097152)PCI: 1301: 0000:06:00.0 8086:0953 8086:3703 disabled due to insufficient resources orbecause the device is not supported: Not supported
0:00:00:08.641 cpu0:2097152)WARNING: PCI: 679: 0000:06:00.0: Unable to free BAR[0] (MEM64 f=0x4 0x0-0x4000): Limit exceeded
0:00:00:08.638 cpu0:2097152)WARNING: PCI: 453: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) - out of resources on parent: 0000:00:03.0
0:00:00:08.638 cpu0:2097152)WARNING: PCI: 476: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) status: Limit exceeded
0:00:00:08.641 cpu0:2097152)WARNING: PCI: 679: 0000:06:00.0: Unable to free BAR[0] (MEM64 f=0x4 0x0-0x4000): Limit exceeded
Any ideas? I do have a hgst nvme 2.5" on the way to test with, but i dont think that will help. I am only running 1x CPU (one slot empty), but none of my other pcie cards are affected. (i have removed all pcie cards, and 3x of the nvme to supermicro riser card, cables, so that its just this 1x nvme im testing, and nothing else connected). It i really were out of pcie lanes, or some other resources, than why does it go away after 1 reboot, and also why does it not affect other OSs (baremetal) ?
tks
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
The error messages very much remind me of earlier attempts with SM Risers & nvme on a X10SDV I had. IIrc I also had intermittent issues, including psod and lots of these "BAR" errors. This might very well also have been on 6.5, I'd have to check but it's been a while.
I had discussed that with SM support back then to no avail - in the end I removed the riser and it worked find again (O/c that made the x10sdv unusable for me but thats another story).

What riser card is it? Can you attach the drive directly to the board?
 

james23

Active Member
Nov 18, 2014
441
122
43
52
crap.. that would really be unfortunate, if its this riser / AOC:
AOC-2UR8N4-I2XT

its what gives me 2x 10gbit nic's (sowhat) , but more importantly it provides the 4x NVME ports that connect to my SM backplane ( BPN-SAS3-826EL1-N4 - that has 12x bays, 4x of which are NVMe).

this is my sys (on vmw hcl):
Supermicro | Products | SuperServers | 2U | 6028U-E1CNRT+

there has to be something to the fact that it is much more reliable with 6.7 vs 6.5u2 (and that no other OSs show this issue at all for me).

would it be worth it for me to try the 300$ a hit VMware support (per incident) ? - im guessing they are just going to say its HW fault call SM (which will be a waste im sure)

FWIW, this is that same log above, but when it works (ie when i reboot, vs shutdown and power up).

Code:
cat /var/log/*.log | grep 0000:06:00.0
2019-01-13T03:06:01Z shell[2099499]: [root]: cat /var/log/*.log | grep 0000:06:00.0
0:00:00:04.590 cpu0:2097152)VMKAcpi: 1098: Handle already exists in hash table for 0000:06:00.0
0:00:00:08.631 cpu0:2097152)PCI: 478: 0000:06:00.0: PCIe v2 PCI Express Endpoint
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x2 (Virtual Channel)
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x4 (Power Budgeting)
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0xe (Alternative Routing-ID Interpretation)
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x3 (Device Serial Number)
0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x19 (Secondary PCI Express)
0:00:00:08.631 cpu0:2097152)PCI: 141: Found physical slot 0x7 from ACPI _SUN for 0000:06:00.0
0:00:00:08.631 cpu0:2097152)PCI: 413: 0000:06:00.0: PCIe v2 PCI Express Endpoint
0:00:00:08.631 cpu0:2097152)PCI: 1067: 0000:06:00.0: probing 8086:0953 8086:3703
0:00:00:08.631 cpu0:2097152)PCI: 405: 0000:06:00.0: Adding to resource tracker under parent 0000:00:03.0.
0:00:00:08.635 cpu0:2097152)PCI: 1624: 0000:06:00.0 8086:0953 8086:3703 unchanged, done with probe-scan phase already.
0:00:00:08.635 cpu0:2097152)PCI: 1282: 0000:06:00.0: registering 8086:0953 8086:3703
2019-01-13T02:56:56.775Z cpu11:2097591)PCI: 1254: 0000:06:00.0 named 'vmhba2' (was '')
2019-01-13T02:56:59.235Z cpu11:2097591)VMK_PCI: 914: device 0000:06:00.0 pciBar 0 bus_addr 0xfb110000 size 0x4000
2019-01-13T02:56:59.235Z cpu11:2097591)VMK_PCI: 764: device 0000:06:00.0 allocated 2 MSIX interrupts
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Since the riser is part of the system and the chassis is on VMWare HCL you have a fair chance to resolve with SM as they would have validated both interoperability with chassis/board and also with esxi.

This definitely is different than my case (random board+riser), maybe its pure chance that the issues are similar, but maybe the riser has taken a hit.
It has 2x 10Gbe + 4 x nvme + 2x8 in x32 slot ? Sounds like it might have a plx chip (or lane sharing), so might very well be the cause. Probably a timing problem on cold boot, maybe 6.5 is susceptible to that...
 

james23

Active Member
Nov 18, 2014
441
122
43
52
i just noticed and tested some thing interesting (and def makes this point to a SM issue)-

(to answer you question first- im not so sure its using a plx, as that one AOC is using ALL of CPU1's pcie lanes (ie the 5x other slots / risers elsewhere in chasis/board are all CPU2) + if you look at the block diagram in the SYS manual, it does show 40x lanes from cpu1 , all going to that one AOC. (it does have a crazy long and unique slot it drops into). so x16 on the 4x nvme's , x8 on a open pcie slot, x8 on another open pcie slot, and then x8 on the NIC (i have the 2x 10g, they have same card with 4x 10g , but you loose one of the 2 pcie slots). - plx or not, it may not
matter:



With my plans to run ESX ion here, I've never messed with or looked at the NVMe OProm settings (they are set to EFI by default).

At some point I noticed when I went into the BIOS there was now a menu entry which had my p3700 info, if I click that it showed me some basic info on it. This must be for booting from the nvme, as it's affected by if you set the NVMe OProm to Legacy / EFI / disabled. (see images)
nvme in SM Capture.JPGnvmedfdf Capture.JPG

The great part is that in the Bios, that entry for this nvme drive I described above, only appears when I reboot the machine! If I shutdown the machine and then power it up and go into the BIOS it's not there. Exactly like the issue in ESXi.

if i get that nvme entry to show up in the bios (then exit bios "discarding changes", so that it does not cut power), then the drive/datastore will show up in esxi properly (regardless of 6.5 or 6.7). I just tested it about 6x times (some on 6.5 , some on 6.7, so far its consistent).
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
any other components that seem to be gone (on pcie slots or nics)? Does it matter which port the nvme drive is on?
 

james23

Active Member
Nov 18, 2014
441
122
43
52
no other issues- the 2x 10g nics (on that same AOC) are rock solid (they are how i pull up esxi web each time so i would notice).
i also had a lsi 2108 raid card in there, also rock solid (and was on a pcie slot on that same AOC).

In all my testing today, i had every item removed / pulled too (and only 1x of the 4x nvme cables plugged in).

this only affects the nvme, and only on a cold boot / cold power up.

So im going to call SM Monday and see what they say. It really seems like a bios issue (or something that can be fixed w bios patch/update) - or maybe an older bios? (hard to track down SM older Bios file sometimes w no archive on their site)

otherwise it looks like i may have to see if there is some way i can script something to reboot esxi every time it powers up (but only once? rough w a RO os).

btw, in the past 2 hours i have prob rebooted esxi 30 times, just to confirm my conclusion above, and it holds (6.5 and 6.7).

I did find this (guy has same backplane as me, but has 2x 2 port pcie nvme input cards, not my AOC setup):
FAQ Entry | Online Support | Support - Super Micro Computer, Inc.
but in my case there is no way to put some of the nvme plugs on CPU2 (and it shouldn't matter either)

will update
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
So it does not matter which nvme port either?
And it shouldnt be a bios problem if it occurred after a power issue unless you reset bios afterwards and configured potentially differently
 

james23

Active Member
Nov 18, 2014
441
122
43
52
no, the nvme port (nor slot/bay) matter. it was happening before the power issue. the power issue just cased a power cycle which caused me to notice the issue (but it had happened before, and i just assumed i had pulled the nvme as i was moving disks around, but i infact had never pulled it). I have only had this server for about 40 days now too btw.

how could it not be a bios issue, if its not showing up in the bios alone (unless warm reboot) + there was that message in the vmkernel log saying device disabled in bios (not as important i think or maybe wrong message).
 

james23

Active Member
Nov 18, 2014
441
122
43
52
well, here is some more info:

so i put a pcie p3700 400gb nvme into one of the AOC pcie slots (same/only AOC w the 4x nvme ports + 2x pcie slots).
And it DOES show up in the bios after a a cold startup (while the 2.5" p3700 1.6tb via the nvme ports on same AOC, does NOT show up on a cold startup).
imgCapture.JPG

of course if i reboot (ie warm reboot, not a power cycle/cold start), then both will show up in the BIOS.

I do have a HGST 2.5" nvme on the way this week, i do think it will show the same problem, but will be interesting to see if it does.

EDIT- you are right, it is wrong of me to assume this is 100% a bios issue. it very well could be the BPN is taking too long to startup/init the 2.5" nvme on cold start, or something with the AOC on coldstart (or could be the bios).

i just now ordered a sasminiHD to u.2 cable (w power) so i can bypass the backplane and go direct to the AOC
 
Last edited:

james23

Active Member
Nov 18, 2014
441
122
43
52
thanks,

no, I don’t have a way currently to run the 2.5” nvme in a different machine.
I don’t think it’s the P3700, but could be. I do have a mini SAS to u.2 cable coming, which will allow me to bypass the backplane.

I also have an HGST NVME 2.5 inch on the way (to rule out the p3700).

(and a call into supermicro support)

so with the three items above, I’m hoping for a resolution, or at minimum at least to ID the exact culprit

i’m starting to think it’s a timing/init timeout thing, like at cold startup the backplane isn’t initializing the NVME quick enough for the bios to catch (and only esxi requires the drive to be visible to the bios/efi as opposed to other OSs). when I try a PCIe P3700 (card not 2.5”) , directly into a PCIe slot, the issue does not exist ( which is making me think backplane or something specific in that area).

Keep in mind, this is a pretty unique supermicro backplane, in that it is an SAS3 expander, for all 12 of 12 bays, but 4 of the 12 bays can be used as either NVME or sata/sas
( I have ofcourse tried removing all disks and only having one NVME attached to the backplane w no luck)
 
Last edited:

james23

Active Member
Nov 18, 2014
441
122
43
52
ANNNNDDDD... the HGST nvme drive i just got in today, when put in place of the p3700, when cold booted.....



(DOES show up always! so its unique to the 2.5" p3700! ( or unique to the p3700 + BPN + AOC combo maybe?)

i say 2.5" above, as my PCIe p3700 works fine (both 2.5" and pcie have same, latest intel firmware. 2.5" = 1.6tb , pcie=400gb).

good news i guess...

I did get in a u.2 to sashd cable (so i could connect the p3700 direct to one of the sashd nvme "ports" on the AOC (the same 4x ports used for the backplane). but i think something is wrong with the cable or how it powers the drive, bc the drive would never show up at all (but drive did feel warm only 1 of out 3 trys). it was a cheap 25$ cable from amzn, so i have a better one on the way to test with.

provided the u.2 cable works, i'll be able to narrow this down a bit more (ie i can try power on p3700 like 5s before i startup the system , to see if its all just a timing/ startup issue on the p3700's part).

i did find this, maybe related, maybe not (btw, there is a entry in teh SM bios for "delay to wait for DEL key to enter bios" , not exactly what is described ni post below, but i did try setting it to 35s , didnt help issue).

Forums

FWIW- SM support is going with did you check the cables? please reseat the pcie nvme disk...
(ie they arent even reading my problem/ticket) :(
(you will note the new strikethrough'd text in my sig. it pains me to do that, but its been a rough 8-12 months for me and supermicro products)
 

james23

Active Member
Nov 18, 2014
441
122
43
52
oh wow, what a coincidence. So if you cold boot, the drive wont show up, but if you warm boot/restart it does, correct? (just want to be sure)

do you have any other 2.5" nvmes to test against/confirm?
also how are you connected to the 2.5" nvme (i see via your AOC w 4x nvme ports, but is the 2.5" in a backplane?)

what OS are you using?
(i know my post/thread is long, but im pretty sure that with ubunutu or win oses on baremetal, i was NOT having the issue- it was only an issue with esxi. fyi)

ill update when i get that new u.2 cable in ofcourse, as that should provide some good data, assuming this new cable works.

thanks!
 

sth

Active Member
Oct 29, 2015
379
91
28
I had two attached which exhibited slightly different behavior.... a reset/software reboot would cause them to more likely appear than not, but it still wasn't 100% reliable. I messed around with this so much I suspect I might have killed one of them as one never appears now. I don't have a second system to verify and ran out of time to debug further.

All my cables, including some spares, were all Intel / Amphenol.

My drives are connected directly to my bifurcated ports HBA card, no backplane in use.It was irrelevant what OS I was using, the BIOS issues were the root cause and is my immediate problem to solve. I don't consider the OS to be relevant to my problems which are at a more fundamental hardware/BIOS related level.

I don't have other NVMe drives for testing but given this thread I'll order some now for testing.
 

james23

Active Member
Nov 18, 2014
441
122
43
52
interesting... in regards to my issue NOT affecting ubntu or windows, i almost feel like you- i have tested this SO much that i may have messed up and was infact rebooting (and not cold starting/power cycling) into win or ubuntu for (just) that part of my test (the part where i say those oses are NOT affected by this). i will retest that part again. but im pretty sure i have it correct in that those oses are not affected and my drives always show up there.

i can say this- at one point, i did go through every bifurcation option in the bios, for the port that my AOC plugs into (i say port, as its a weird, riser type port on my mb, as my AOC is different from yours, in that its a riser too, with 2x pcie slots + the 4x nvme ports + 2x 10g copper nics). but on my end, none of the bi-fur options changed anything related to my problem (so i set it back to AUTO as was the default).

on that p3700 you think might be dead- you may want to try booting esxi 6.5 and see if it shows up in the manage->devices menu (i have an image posted above). (mine always show up there, even if i cold boot). (however that page in esxi, may be equal to lspci or dmidecode output in linux)

Also (on the "dead" one) maybe see if it feels warm after having the system on for an hour or so. I find that even if these p3700's are idle they will still get slightly warm to the touch (ie as a way to verify its getting power, or that maybe it really is blown/dead)