Supermicro X10DRD-iNPT - NVMe disabled > Weird power failure messages

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gb00s

Well-Known Member
Jul 25, 2018
1,188
600
113
Poland
Hi,

I started to receive weird power failure messages from Slot(12) which seems to be one of the 2x NVMe ports below.

X10DRD-iNTP_(1).png

and

X10DRD-iNTP_(2).png

The correct error message from SYSLOG is ...
May 27 01:06:32 proxwncs kernel: [ 424.760894] pcieport 0000:00:01.1: pciehp: Slot(12): Button cancel
May 27 01:06:34 proxwncs kernel: [ 426.780888] pcieport 0000:00:01.1: pciehp: Timeout on hotplug command 0x12e1 (issued 2020 msec ago)
May 27 01:06:34 proxwncs kernel: [ 426.780899] pcieport 0000:00:01.1: pciehp: Slot(12): Action canceled due to button press
May 27 01:06:34 proxwncs kernel: [ 426.780906] pcieport 0000:00:01.1: pciehp: Slot(12): Attention button pressed
May 27 01:06:34 proxwncs kernel: [ 426.780909] pcieport 0000:00:01.1: pciehp: Slot(12): Powering off due to button press
The other issue is if I boot the machine without connecting both ports and plug in one port, I receive the same message for Slot(11) nomatter where this is plugged in (1) or (2). Researching the internet gives not a lot of hints with any real solutions. Some are suggesting the one port is unable to reserve enough PCIe lanes. I believe, I calculated that with the hardware installed 2x Intel P4510, Dell H330, Sandik ioDrive2, Mellanox CX3 and HP 4port ethernet card, I shall have enogh lanes available with both CPU sockets populated. Also, with all the other PCIe cards taken away, I also get the same error. So ruled out by my logic.

Others are suggesting a kernel issue. I did not a test on another OS other than my Proxmox installation here.

Can this be a hardware failure of the port or on the board itself? Unfortunately, I have never experienced this message. Any hint would be much aprpecited.

Thanks
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,322
1,605
113
what is connected to this cables ? looks to me the devices are doing this actions.
have you seen this ?
 
  • Like
Reactions: gb00s

gb00s

Well-Known Member
Jul 25, 2018
1,188
600
113
Poland
what is connected to this cables ? looks to me the devices are doing this actions.
have you seen this ?
Yes, found the thread tonight after I opened this thread. Will test and try to finda solution from this direction this evening. On the other hand, if if you disable PCIe features then you disable ASPM. But ASPM is needed for SR-IOV which I wanted to use with another multi-port network card. Ok lets see what can be done. Thanks a lot.
 
  • Like
Reactions: T_Minus

RolloZ170

Well-Known Member
Apr 24, 2016
5,322
1,605
113
check cables, at least power cables of the NVMe devices.
firmware update NVMe devices
BIOS update.
 

gb00s

Well-Known Member
Jul 25, 2018
1,188
600
113
Poland
So I have 3 machines with this board to set up and all have the same issue. I was chasing the issue by playing with kernel messages etc. Nothing. All machines have NVMe's (Intel P4510's), same cables etc. Exchanged everything from machine to machine.

Checking syslog I received this:
May 27 22:33:44 proxwncs kernel: [ 18.467732] blk_update_request: I/O error, dev nvme0n1, sector 3907028992 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468556] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468563] blk_update_request: I/O error, dev nvme0n1, sector 3907028992 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
May 27 22:33:44 proxwncs kernel: [ 18.468570] nvme0n1: detected capacity change from 3907029168 to 0
May 27 22:33:44 proxwncs kernel: [ 18.468709] Buffer I/O error on dev nvme0n1, logical block 488378624, async page read
:
:
:
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/nvme1, NVMe Identify Controller failed
May 27 22:34:45 proxwncs smartd[1478]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 76 to 77
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Micron_M500DC_MTFDDAK240MBB-163413C9E9A1.ata.state
May 27 22:34:45 proxwncs smartd[1478]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.Micron_M500DC_MTFDDAK240MBB-163413C9EE74.ata.state
May 27 22:34:45 proxwncs systemd[1]: Started Self Monitoring and Reporting Technology (SMART) Daemon.
May 27 22:34:45 proxwncs kernel: [ 79.729376] nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
May 27 22:34:45 proxwncs kernel: [ 79.730118] nvme nvme1: Removing after probe failure status: -19
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_alarm: I/O error
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_min: I/O error
May 27 22:34:45 proxwncs sensors[1501]: ERROR: Can't get value of subfeature temp1_max: I/O error
May 27 22:34:45 proxwncs sensors[1501]: power_meter-acpi-0
May 27 22:34:45 proxwncs sensors[1501]: Adapter: ACPI interface
May 27 22:34:45 proxwncs sensors[1501]: power1: 4.29 MW (interval = 4294967.29 s)
:
:
:
May 27 22:34:45 proxwncs kernel: [ 79.749464] nvme1n1: detected capacity change from 3907029168 to 0
:
:
:
May 27 22:34:51 proxwncs kernel: [ 86.220836] pci 0000:02:00.0: Removing from iommu group 56
which directs me into checking any kernel issues. OS here is Proxmox.So lets test with another OS/kernel.
 
  • Like
Reactions: T_Minus

gb00s

Well-Known Member
Jul 25, 2018
1,188
600
113
Poland
Investigated the issue with several Linux distributions (CentOS Stream, Debian, Alpine, Arch) and kernel releases without going into kernel configs specifically. But it seems there are some possible issues with:
  1. Energy 'saving' settings in the kernel causing the power issues and shutting dowen the controller or
  2. Race conditions in the kernel (resets) causing I/O issues and shutting down the controller or
  3. firmware issues in the drive itself.
Without rescanning for any new controllers, these NVMe drives will never show up. The workaround to get both controllers for all 2 NVMe drives is a simple
echo 1 > /sys/bus/pci/rescan
which brings us up both controllers
01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
and makes both NVMe drives visible/accessible
[root@fed ~]# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1 /dev/ng1n1 BTLJ039300222P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV1046X
/dev/nvme0n1 /dev/ng0n1 BTLJ818008J52P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV1046X
Unfortunately this workaround does not survive a reboot. Therefor I just added the command to a script an put it in /etc/crontab
@reboot root /usr/bin/activate_nvme.sh
Adding this as 'systemd service' did not work. I'm still trying to find the cause of the whole issue, power off and reset/shutdown of the controller.
 
  • Like
Reactions: T_Minus

gb00s

Well-Known Member
Jul 25, 2018
1,188
600
113
Poland
UPDATE:

So short update from here. After @Rand__ was very helpful here I may found one issue I had with the power failure issue. I was talking to Supermicro and 'Intel' regarding this issue. Supermicro was helpful enough to provide unreleased firmware in unreleased bios to get to the ground of the issue. No success. Intel denied having a power and/or hot-plug issue with P4510s. The P4510 in 2TB and 4TB were the only NVMEs I had in U.2 format. So nobody could help me. Grub modifications for pce_aspm or others just did not help. No matter what I did, I had a power failure issue. I bought different cables. Nothing. So I thought may my P4510s just don't play nice with the Supermicro X10DRD-iNPT due to a firmware issue somewhere and somehow. Supermicro mentioned that all my boards shall be defective.

I got 2 other drives. A Huawei HSSD-D5223PM5D00 (4TB) and a brand new Samsung PM1725e (3.2TB) for testing. I disconnected the Intel NVMe's and connected the drives to the new Supermicro cable I bought directly from a Supermicro reseller here in PL. No power issue anymore. Hmmm ... I connected them to the cables I had the Intel NVMe's connected. Oh, power failure issues were back again. I connected the new Supermicro cables, but again power failures. The issue is/was I had a Molex Y-cable connected to the cables connected with the NVMe's. I tested 9 different Molex Y-cables. All have the same issue.

At least /var/log/syslog is not flooded anymore with 50k messages per day :rolleyes: :oops:

My question here:

Molex has a power rating of nearly 60W. Both NVMe's together shall use 40W at max. Why are Molex Y-cables not working here?

EDIT 1: It's not a specific Y-cable that's broken. I have 3 boards now and all have the same issue. Whenever I add Y-cables and connect the NVMe's to it, I have these power issues. Different PSUs, and power boards. Doesn't matter. So it's a power issue with Y-cables per se.

EDIT 2: It did not come to my mind to just mix the NVMEs. So even with or without Y-cable, there is a power issue until .... you mix the NVMe's. I mixed INTEL with HUAWEI together and voila, no power issue anymore. Both are connected to a Y-cable. Just works.

So why 2x same NVMe's causing power issues?

But the other issue is that the NVMe's are recognized with an
echo 1 > /sys/bus/pci/rescan
still exists while connected to these 2 ports right on the board. No matter if the Intel P4510's or the Huawei or Samsung NVMe's are connected, Linux just doesn't see them without the command push.

Does somebody have an idea on that issue?
 
Last edited: