Supermicro X10DRD-iNPT - NVMe disabled > Weird power failure messages

gb00s · May 26, 2022

Hi,

I started to receive weird power failure messages from Slot(12) which seems to be one of the 2x NVMe ports below.

and

The correct error message from SYSLOG is ...

May 27 01:06:32 proxwncs kernel: [ 424.760894] pcieport 0000:00:01.1: pciehp: Slot(12): Button cancel
May 27 01:06:34 proxwncs kernel: [ 426.780888] pcieport 0000:00:01.1: pciehp: Timeout on hotplug command 0x12e1 (issued 2020 msec ago)
May 27 01:06:34 proxwncs kernel: [ 426.780899] pcieport 0000:00:01.1: pciehp: Slot(12): Action canceled due to button press
May 27 01:06:34 proxwncs kernel: [ 426.780906] pcieport 0000:00:01.1: pciehp: Slot(12): Attention button pressed
May 27 01:06:34 proxwncs kernel: [ 426.780909] pcieport 0000:00:01.1: pciehp: Slot(12): Powering off due to button press

The other issue is if I boot the machine without connecting both ports and plug in one port, I receive the same message for Slot(11) nomatter where this is plugged in (1) or (2). Researching the internet gives not a lot of hints with any real solutions. Some are suggesting the one port is unable to reserve enough PCIe lanes. I believe, I calculated that with the hardware installed 2x Intel P4510, Dell H330, Sandik ioDrive2, Mellanox CX3 and HP 4port ethernet card, I shall have enogh lanes available with both CPU sockets populated. Also, with all the other PCIe cards taken away, I also get the same error. So ruled out by my logic.

Others are suggesting a kernel issue. I did not a test on another OS other than my Proxmox installation here.

Can this be a hardware failure of the port or on the board itself? Unfortunately, I have never experienced this message. Any hint would be much aprpecited.

Thanks

RolloZ170 · May 26, 2022

what is connected to this cables ? looks to me the devices are doing this actions.
have you seen this ?

pciehp issue with SuperMicro AOC-4e2p wih 4x nvme drives

Hi, I have one machine I built with a Supermicro X10 SRL-F motherboard with the above addon card connected to 4x nvme drives in a raidz10 setup. The addon card is x8 using a PLX switch to connect 4x 4-lane drives. Under any I/O load, dmesg gets spammed with: [ 3.367998] pciehp...

forum.proxmox.com

gb00s · May 27, 2022

RolloZ170 said:
what is connected to this cables ? looks to me the devices are doing this actions.
have you seen this ?

Yes, found the thread tonight after I opened this thread. Will test and try to finda solution from this direction this evening. On the other hand, if if you disable PCIe features then you disable ASPM. But ASPM is needed for SR-IOV which I wanted to use with another multi-port network card. Ok lets see what can be done. Thanks a lot.

RolloZ170 · May 27, 2022

check cables, at least power cables of the NVMe devices.
firmware update NVMe devices
BIOS update.

gb00s · May 27, 2022

So I have 3 machines with this board to set up and all have the same issue. I was chasing the issue by playing with kernel messages etc. Nothing. All machines have NVMe's (Intel P4510's), same cables etc. Exchanged everything from machine to machine.

Checking syslog I received this:

May 27 22:33:44 proxwncs kernel: [ 18.467732] blk_update_request: I/O error, dev nvme0n1, sector 3907028992 op 0x0READ) flags 0x80700 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468556] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0READ) flags 0x0 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468563] blk_update_request: I/O error, dev nvme0n1, sector 3907028992 op 0x0READ) flags 0x0 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468570] nvme0n1: detected capacity change from 3907029168 to 0
May 27 22:33:44 proxwncs kernel: [ 18.468709] Buffer I/O error on dev nvme0n1, logical block 488378624, async page read
:
:
:
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/nvme1, NVMe Identify Controller failed
May 27 22:34:45 proxwncs smartd[1478]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 76 to 77
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Micron_M500DC_MTFDDAK240MBB-163413C9E9A1.ata.state
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.Micron_M500DC_MTFDDAK240MBB-163413C9EE74.ata.state
May 27 22:34:45 proxwncs systemd[1]: Started Self Monitoring and Reporting Technology (SMART) Daemon.
May 27 22:34:45 proxwncs kernel: [ 79.729376] nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
May 27 22:34:45 proxwncs kernel: [ 79.730118] nvme nvme1: Removing after probe failure status: -19
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_alarm: I/O error
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_min: I/O error
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_max: I/O error
May 27 22:34:45 proxwncs sensors[1501]: power_meter-acpi-0
May 27 22:34:45 proxwncs sensors[1501]: Adapter: ACPI interface
May 27 22:34:45 proxwncs sensors[1501]: power1: 4.29 MW (interval = 4294967.29 s)
:
:
:
May 27 22:34:45 proxwncs kernel: [ 79.749464] nvme1n1: detected capacity change from 3907029168 to 0
:
:
:
May 27 22:34:51 proxwncs kernel: [ 86.220836] pci 0000:02:00.0: Removing from iommu group 56

which directs me into checking any kernel issues. OS here is Proxmox.So lets test with another OS/kernel.

T_Minus · May 27, 2022

Please be sure to keep us updated

gb00s · May 28, 2022

Investigated the issue with several Linux distributions (CentOS Stream, Debian, Alpine, Arch) and kernel releases without going into kernel configs specifically. But it seems there are some possible issues with:

Energy 'saving' settings in the kernel causing the power issues and shutting dowen the controller or
Race conditions in the kernel (resets) causing I/O issues and shutting down the controller or
firmware issues in the drive itself.

Without rescanning for any new controllers, these NVMe drives will never show up. The workaround to get both controllers for all 2 NVMe drives is a simple

echo 1 > /sys/bus/pci/rescan

which brings us up both controllers

01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]

and makes both NVMe drives visible/accessible

[root@fed ~]# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1 /dev/ng1n1 BTLJ039300222P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV1046X
/dev/nvme0n1 /dev/ng0n1 BTLJ818008J52P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV1046X

Unfortunately this workaround does not survive a reboot. Therefor I just added the command to a script an put it in /etc/crontab

@reboot root /usr/bin/activate_nvme.sh

Adding this as 'systemd service' did not work. I'm still trying to find the cause of the whole issue, power off and reset/shutdown of the controller.

gb00s · Aug 11, 2022

UPDATE:

So short update from here. After @Rand__ was very helpful here I may found one issue I had with the power failure issue. I was talking to Supermicro and 'Intel' regarding this issue. Supermicro was helpful enough to provide unreleased firmware in unreleased bios to get to the ground of the issue. No success. Intel denied having a power and/or hot-plug issue with P4510s. The P4510 in 2TB and 4TB were the only NVMEs I had in U.2 format. So nobody could help me. Grub modifications for pce_aspm or others just did not help. No matter what I did, I had a power failure issue. I bought different cables. Nothing. So I thought may my P4510s just don't play nice with the Supermicro X10DRD-iNPT due to a firmware issue somewhere and somehow. Supermicro mentioned that all my boards shall be defective.

I got 2 other drives. A Huawei HSSD-D5223PM5D00 (4TB) and a brand new Samsung PM1725e (3.2TB) for testing. I disconnected the Intel NVMe's and connected the drives to the new Supermicro cable I bought directly from a Supermicro reseller here in PL. No power issue anymore. Hmmm ... I connected them to the cables I had the Intel NVMe's connected. Oh, power failure issues were back again. I connected the new Supermicro cables, but again power failures. The issue is/was I had a Molex Y-cable connected to the cables connected with the NVMe's. I tested 9 different Molex Y-cables. All have the same issue.

At least /var/log/syslog is not flooded anymore with 50k messages per day

My question here:

Molex has a power rating of nearly 60W. Both NVMe's together shall use 40W at max. Why are Molex Y-cables not working here?

EDIT 1: It's not a specific Y-cable that's broken. I have 3 boards now and all have the same issue. Whenever I add Y-cables and connect the NVMe's to it, I have these power issues. Different PSUs, and power boards. Doesn't matter. So it's a power issue with Y-cables per se.

EDIT 2: It did not come to my mind to just mix the NVMEs. So even with or without Y-cable, there is a power issue until .... you mix the NVMe's. I mixed INTEL with HUAWEI together and voila, no power issue anymore. Both are connected to a Y-cable. Just works.

So why 2x same NVMe's causing power issues?

But the other issue is that the NVMe's are recognized with an

echo 1 > /sys/bus/pci/rescan

still exists while connected to these 2 ports right on the board. No matter if the Intel P4510's or the Huawei or Samsung NVMe's are connected, Linux just doesn't see them without the command push.

Does somebody have an idea on that issue?

Search

Supermicro X10DRD-iNPT - NVMe disabled > Weird power failure messages

gb00s

Well-Known Member

RolloZ170

Well-Known Member

pciehp issue with SuperMicro AOC-4e2p wih 4x nvme drives

gb00s

Well-Known Member

RolloZ170

Well-Known Member

gb00s

Well-Known Member

T_Minus

Build. Break. Fix. Repeat

gb00s

Well-Known Member

gb00s

Well-Known Member