Infiniband PCIe card preventing boot in one server but not another

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

alltheasimov

Member
Feb 17, 2018
59
12
8
33
Just saw this thread. I have a similar(-ish) problem with a Dell-branded ConnectX-3 (312 dual port). It works in an Supermicro X11SSi-ln4f but not on a gigabyte G87 UD3H. Machine wouldn't even boot. Crossflashing to the latest Mellanox firmware screwed up the LED's but alas no improvement in booting on the Supermicro. I wasn't aware of the SMbus pin masking. Maybe I'll give that a shot.
Which motherboard does it not work on? Did you try increasing the BAR space size?
 

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
Just saw this thread. I have a similar(-ish) problem with a Dell-branded ConnectX-3 (312 dual port). It works in an Supermicro X11SSi-ln4f but not on a gigabyte G87 UD3H. Machine wouldn't even boot. Crossflashing to the latest Mellanox firmware screwed up the LED's but alas no improvement in booting on the Supermicro. I wasn't aware of the SMbus pin masking. Maybe I'll give that a shot.
Did you ever solve this problem? I think I'm encountering the same thing with an HP branded CX3 (MCX354A-QCBT) in an Asus Z97-Pro motherboard. I've tried enabling "Above 4G Decoding" but I haven't been able to POST once with the CX3 installed.
 
Last edited:

fohdeesha

Kaini Industries
Nov 20, 2016
2,729
3,081
113
33
fohdeesha.com
if you think BAR size is the issue, this is configurable on the card with mlxconfig (same utility I recommend for setting ports to ethernet or IB etc). It's saved in the cards flash so you can configure it on the working PC, then move the card to the non working one:



I would start by setting it to 0, like:

Code:
mst start

mlxconfig -d /dev/mst/mt4099_pci_cr0 set LOG_BAR_SIZE=0
Then move it to the other PC and see if it boots. If it does, you could try increasing it by 1 each time until it doesn't. If it doesn't boot even at 0, try disabling SR-IOV if it's enabled, like "mlxconfig -d /dev/mst/mt4099_pci_cr0 set SRIOV_EN=0" the total BAR size is a function of the BAR size setting above times the number of virtual functions, so if you have SR-IOV enabled with a bunch of VF's, that will get big really fast.

if it still absolutely won't boot, try disabling SR-IOV in your motherboard BIOS
 
Last edited:

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
I think I'm going to have to concede defeat on this one, 10 hours of troubleshooting is enough.

Here's what I've tried today:
  • LOG_BAR_SIZE=0..3
  • NUM_OF_VFS=1..8
  • SRIOV_EN=0 and SRIOV_EN=1
  • Downgrading as far back as 2.10.2280 + Flex-3.3.650 (from 2012)
  • Nuking the bootrom on the card entirely
Unless I'm missing something incredibly stupid I'm pretty sure this is an Asus bios problem. This is the second Asus board I've owned that flat out refused to POST with a certain PCIe device installed, I think it's going to be my last.

In case anyone finds this via Google in the future here's a synopsis:
Asus z97 Pro with a Mellanox MCX354A-FCBT ConnectX-3 VPI adapter (HP 544QSFP 649281-B21) refuses to POST with Q-Code 40 displayed on the board. "Above 4G decoding" is on in the bios, I don't have any options for IOMMU or SR-IOV available to change.
 

fohdeesha

Kaini Industries
Nov 20, 2016
2,729
3,081
113
33
fohdeesha.com
well damn, that's no good. one of my workstations upstairs is an asus z97 based board (sabertooth something or other), and I've been meaning to throw a connectx3 in it - I guess I'll see how that goes. Out of curiosity, how did you nuke the bootrom on the card? last time I tried with the brom flint commands it wouldn't allow it, unless you mean disabling boot options with mlxconfig
 

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
well damn, that's no good. one of my workstations upstairs is an asus z97 based board (sabertooth something or other), and I've been meaning to throw a connectx3 in it - I guess I'll see how that goes. Out of curiosity, how did you nuke the bootrom on the card? last time I tried with the brom flint commands it wouldn't allow it, unless you mean disabling boot options with mlxconfig
Flint will stop you from issuing drom or brom commands if the fw image flashed to the card contained a versioned copy of Flexboot but the option flag --allow_rom_change will let you override that.

Let me know what happens when you throw a CX3 into that machine, I'm seriously stumped by this. I think I'll contact Asus support tomorrow and see what they say, I was really looking forward to using the card.

edit: I'm going to go out on a limb and try one last thing later tonight. Q-code 40 is something like "System waking up from S4 sleep state" ... so I guess I'll shut off all power management and see if that makes a difference.
 
Last edited:
  • Like
Reactions: fohdeesha

fohdeesha

Kaini Industries
Nov 20, 2016
2,729
3,081
113
33
fohdeesha.com
Amazing, after all my dicking with MFT not sure why that didn't cross my mind - "flint -d /dev/mst/mt4099_pci_cr0 --allow_rom_change drom" did indeed delete the bootrom and now I can stop seeing the useless flexboot shit at boot. sweeeet
 

Hindsight

Member
Mar 28, 2016
55
14
8
42
One of my notes on crossflashing has this as well.

Code:
#turn off bootrom crap
mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P1=false
mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P2=false
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P1=0
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P2=0
 

gzorn

Member
Jan 10, 2017
76
14
8
@arglebargle - Unfortunately, I never did additional testing, since it works (mostly) in one of my servers.

On a different note, did you try masking the SMbus pins on the PCIe connector?
 
  • Like
Reactions: arglebargle

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
One of my notes on crossflashing has this as well.

Code:
#turn off bootrom crap
mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P1=false
mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P2=false
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P1=0
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P2=0
That leaves the bootrom intact and configures the card not to attempt PXE booting -- the bootrom still loads at boot time. What we're doing is erasing the bootrom off of the card so it can't load at boot, I was hoping it was the bootrom that was causing issues with my BIOS but unfortunately it wasn't.

There's a section in the Mellanox firmware tools manual under Flint titled "Managing an Expansion ROM Image" with details. The man page for flint is pretty thorough too, that's where I found the override flag.
 

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
@arglebargle - Unfortunately, I never did additional testing, since it works (mostly) in one of my servers.

On a different note, did you try masking the SMbus pins on the PCIe connector?
Oh man, I've never run into a situation where I needed to do this so it didn't even cross my mind. I'll try it this afternoon, thanks!
 
  • Like
Reactions: gzorn

fohdeesha

Kaini Industries
Nov 20, 2016
2,729
3,081
113
33
fohdeesha.com
Oh man, I've never run into a situation where I needed to do this so it didn't even cross my mind. I'll try it this afternoon, thanks!
a longshot here, but is the card configured with one port infiniband, and one port eth? I remember when mixing it like this, I think it showed up as 2 diff pci devices, if that's the case maybe that's what the asus board doesn't like? I dunno, if that's how it's configged, try setting both to eth using mlxconfig in another box
 

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
Wow, taping off the smbus pins got the system through POST!

Untitled.png

Alright, time to go test out a few cables and see what these cards can do!

Huge thanks guys, I dumped about 10 hours into trying to diagnose this.
 

fjv

New Member
Jan 30, 2020
1
0
1
Similar issue with a D3644 Fujitsu motherboard and a HP branded connectx-3 Pro

System boots but DIMM SLOT 3 (closest to CPU) simply does not recognise any RAM. Shows empty in BIOS.
The other 3 DIMM slots work perfectly. No warnings on boot, no eventlog entries in bios.

MCX354A-FCCT VPI PRO FDR/40/56GBE 2 QSFP14 3.0 x8 HPE 764284-B21

Tried everything (even flashing to stock mellanox, getting rid of bootrom etc)
In the end taping the 2 pins solved the issue.

(Sorry for reviving an old thread but this was the only reference i could find regarding infiniband and this type of issue. So in case anyone else searches for this specific card and mobo in relation to weird dimm issues ...)