Hi everyone,
I have a strange issue with a Supermicro AS-8125GS-TNHR server (HGX H100 platform).
The operating system (Linux) sees all Mellanox/NVIDIA ConnectX-7 NICs (8 cards total),
all PCIe links are up and fully functional. However, the BMC shows only ONE NIC —
the one installed in PCIe Slot 6.
All remaining PCIe NICs have completely disappeared from the BMC (Network / AOC info).
No sensors, no MAC addresses, no temperature, nothing. Only Slot 6 is visible.
What has been tested so far:
– The OS sees ALL NICs, all PCIe links are good.
– The BMC shows only the card in Slot 6.
– An identical server with the same hardware and firmware shows ALL NICs correctly.
– SDR clear → no effect.
– BMC cold reset → no effect.
– Full AC power removal (disconnecting both power cables for 5 minutes) → no effect.
– BIOS version matches the working server.
– BMC firmware matches the working server.
– `mlxconfig` shows that MCTP/SMBus parameters are *missing* for all “invisible” NICs.
– Slot 6 is the ONLY NIC that responds to MCTP (according to mlxconfig), and the only one visible to the BMC.
This strongly suggests:
– MCTP/SMBus path from PCIe NICs to the BMC works only for Slot 6 (direct I2C connection)
– MCTP/SMBus routing via the PLX/Switchtec PCIe fabric is not working
– the SMBus multiplexer or I2C buffer on the main PCIe switchboard might have failed
– or the I2C link between the main switchboard and the motherboard/BMC is dead
**Questions:**
1. Has anyone seen a case where only one PCIe NIC (the direct SMBus slot) shows up in the BMC,
while all NICs behind PLX chips disappear?
2. Is there any OEM way to reset or reinitialize the PLX SMBus/MCTP fabric on HGX systems?
3. Does the AS-8125GS-TNHR have any hidden IPMI raw commands for resetting the MCTP routing table?
(commands like `0x30 66` / `0x30 52` appear unsupported in this firmware)
4. Could this be a failed I2C/SMBus cable between the main switchboard and the motherboard?
(I have left, right, and main switchboards available to test)
5. Should I directly request an RMA for the main switchboard (4×PLX)?
It seems the most likely failure point given the symptoms.
Any advice, debugging steps, or similar experience would be greatly appreciated.
I can provide `i2cdetect`, `lspci`, PLX dumps, MCP logs, etc.
Thanks in advance!
I have a strange issue with a Supermicro AS-8125GS-TNHR server (HGX H100 platform).
The operating system (Linux) sees all Mellanox/NVIDIA ConnectX-7 NICs (8 cards total),
all PCIe links are up and fully functional. However, the BMC shows only ONE NIC —
the one installed in PCIe Slot 6.
All remaining PCIe NICs have completely disappeared from the BMC (Network / AOC info).
No sensors, no MAC addresses, no temperature, nothing. Only Slot 6 is visible.
What has been tested so far:
– The OS sees ALL NICs, all PCIe links are good.
– The BMC shows only the card in Slot 6.
– An identical server with the same hardware and firmware shows ALL NICs correctly.
– SDR clear → no effect.
– BMC cold reset → no effect.
– Full AC power removal (disconnecting both power cables for 5 minutes) → no effect.
– BIOS version matches the working server.
– BMC firmware matches the working server.
– `mlxconfig` shows that MCTP/SMBus parameters are *missing* for all “invisible” NICs.
– Slot 6 is the ONLY NIC that responds to MCTP (according to mlxconfig), and the only one visible to the BMC.
This strongly suggests:
– MCTP/SMBus path from PCIe NICs to the BMC works only for Slot 6 (direct I2C connection)
– MCTP/SMBus routing via the PLX/Switchtec PCIe fabric is not working
– the SMBus multiplexer or I2C buffer on the main PCIe switchboard might have failed
– or the I2C link between the main switchboard and the motherboard/BMC is dead
**Questions:**
1. Has anyone seen a case where only one PCIe NIC (the direct SMBus slot) shows up in the BMC,
while all NICs behind PLX chips disappear?
2. Is there any OEM way to reset or reinitialize the PLX SMBus/MCTP fabric on HGX systems?
3. Does the AS-8125GS-TNHR have any hidden IPMI raw commands for resetting the MCTP routing table?
(commands like `0x30 66` / `0x30 52` appear unsupported in this firmware)
4. Could this be a failed I2C/SMBus cable between the main switchboard and the motherboard?
(I have left, right, and main switchboards available to test)
5. Should I directly request an RMA for the main switchboard (4×PLX)?
It seems the most likely failure point given the symptoms.
Any advice, debugging steps, or similar experience would be greatly appreciated.
I can provide `i2cdetect`, `lspci`, PLX dumps, MCP logs, etc.
Thanks in advance!