Enabling SR-IOV on an HP DL380P G8 makes it "red screen"

thetoad

Active Member
Feb 10, 2021
236
97
28
So, I tried enabling SR-IOV on my HP DL380-P G8's and they red screen during post (somewhere after the thermal test) with the illegal opcode message. anyone have any idea why this might occur?

some bad PCI-E card that causes problems with SR-IOV? something else? Driving me nuts.
 

thetoad

Active Member
Feb 10, 2021
236
97
28
I should be on what I think is the latest bios (date 05/24/2019). or latest that I could find. actually trying right now.
 

thetoad

Active Member
Feb 10, 2021
236
97
28
what logs should I be looking for? this is what happens when its doing the power and thermal configuration

the only way I know to fix this is to pull the machine out, flip system management switch 6, boot it up, turn it off, flip switch back, and then its gone
 

Attachments

thetoad

Active Member
Feb 10, 2021
236
97
28
my only idea that its some "supposed" incompatability with a pci-e card (I do have an Oracle F80 and an Oracle (flashed to stock) CX3 QSFP card in them). I think I'm going to experiment by pulling it out and removing those cards (heck, might just pull out both risers as a first test)
 

thetoad

Active Member
Feb 10, 2021
236
97
28
ILO log can give you hints thought.
what logs is my Q, I dont see anything obvious.

edit: the only log I see in integrated management log is "POST Error: 1615-Power Supply Input Failure in Bay 1" and that's because its not plugged in (or in my case I even pulled out the unused power supply)
 

thetoad

Active Member
Feb 10, 2021
236
97
28
and with both risers removed, it didn't red screen. :/ will try it with the HP dual 10G cards I had in riser 1 (cards I added were on riser 2)

though I should note: it also couldn't boot, as I had OS installed on Oracle F80)
 
  • Like
Reactions: RolloZ170

thetoad

Active Member
Feb 10, 2021
236
97
28
worked with first riser with 2 2x HP 10GBe Intel cards installed.

now to try it with second riser and just the mellanox cx3 qfsp card:

edit1: that failed. so flip with the f80 and see it that can boot.
 
Last edited:

thetoad

Active Member
Feb 10, 2021
236
97
28
this is interesting, removed the cx3 card as described above, boots, but gives this warning

(need to figure out which card that is referring to)
 

Attachments

thetoad

Active Member
Feb 10, 2021
236
97
28
so 2 thoughts

1) https://www.hpe.com/psnow/doc/c04123238?jumpid=in_lit-psnow-red

pcie slot 3 isn't on processor, but chipset (see page 5), perhaps its an issue (wondering if I try to have cx3 card in without anything from riser 1, which I don't need)

so stangely, it seems both cards 10gbe intel cards come up with VFs, but one comes up with 7 (if that's what I specified) and one only comes up with 5 (per lspci) and spits out 2 errors. perhaps related to bios message I pasted above

2) https://community.mellanox.com/s/ar...ive_content_id_I_Enable_SRIOV_on_the_Firmware

wondering if SRIOV needs to be enabled on the cx3 before I enable cx3 in the bios.

so now disabling it in bios, sticking cx3 back in and will see what its settings are per above (and changing them, then changing bios if not set right) and see what happens, but that will have to wait to tomorrow or sunday.
 
Last edited:

thetoad

Active Member
Feb 10, 2021
236
97
28
fixed.

I wasn't on the most recent firmware (seems the cards I stuck into these HP Boxes I didn't update, I know I did the others, but these weren't done.

so I updated firmware and set the SRIOV_EN variable as described in the link.

I don't know if both were needed, but it deffinitely works, no red screen of death and VFs are enumerated as expected (once one changes default module args, also as described on that page).

this was such a big headache annoying me for a very long time.
 

thetoad

Active Member
Feb 10, 2021
236
97
28
and more followup, on my other DL380P, it seems the SRIOV_EN was already enabled to be true, which points to it being just being a cx3 firmware issue.

so I upgraded the firmware again on this one from

Code:
Current FW version on flash:  2.35.6302
New FW version:               2.42.5000
rebooted. modified it settings

Code:
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2 BOOT_OPTION_ROM_EN_P1=false BOOT_OPTION_ROM_EN_P2=false LEGACY_BOOT_PROTOCOL_P1=0 LEGACY_BOOT_PROTOCOL_P2=0 SRIOV_EN=1
(yes, have to enable SRIOV_EN again, as it seems updating the firmware set it to false)

rebooted again and enabled SRIOV in bios, and no red screen/illegal opcode.

TLDR: cx3 firmware 2.35.6302 is incompatible with SRIOV on (at least) my DL380P Gen8s
 

RolloZ170

Well-Known Member
Apr 24, 2016
2,663
678
113
55
i have read somewhere that just reboots every 4 minutes(to fast) can cause a red screen of death if the boot device is USB or card.
just unpower the whole system for 30 min should solve that.