Infiniband PCIe card preventing boot in one server but not another

Discussion in 'Networking' started by alltheasimov, May 12, 2018.

  1. alltheasimov

    alltheasimov Member

    Joined:
    Feb 17, 2018
    Messages:
    58
    Likes Received:
    10
    Which motherboard does it not work on? Did you try increasing the BAR space size?
     
    #21
  2. alltheasimov

    alltheasimov Member

    Joined:
    Feb 17, 2018
    Messages:
    58
    Likes Received:
    10
    I created a separate thread with the performance tests of the two firmwares. There was no significant difference.
     
    #22
  3. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    Did you ever solve this problem? I think I'm encountering the same thing with an HP branded CX3 (MCX354A-QCBT) in an Asus Z97-Pro motherboard. I've tried enabling "Above 4G Decoding" but I haven't been able to POST once with the CX3 installed.
     
    #23
    Last edited: Aug 5, 2018
  4. fohdeesha

    fohdeesha Kaini Industries

    Joined:
    Nov 20, 2016
    Messages:
    765
    Likes Received:
    568
    if you think BAR size is the issue, this is configurable on the card with mlxconfig (same utility I recommend for setting ports to ethernet or IB etc). It's saved in the cards flash so you can configure it on the working PC, then move the card to the non working one:

    [​IMG]

    I would start by setting it to 0, like:

    Code:
    mst start
    
    mlxconfig -d /dev/mst/mt4099_pci_cr0 set LOG_BAR_SIZE=0
    Then move it to the other PC and see if it boots. If it does, you could try increasing it by 1 each time until it doesn't. If it doesn't boot even at 0, try disabling SR-IOV if it's enabled, like "mlxconfig -d /dev/mst/mt4099_pci_cr0 set SRIOV_EN=0" the total BAR size is a function of the BAR size setting above times the number of virtual functions, so if you have SR-IOV enabled with a bunch of VF's, that will get big really fast.

    if it still absolutely won't boot, try disabling SR-IOV in your motherboard BIOS
     
    #24
    Last edited: Aug 5, 2018
  5. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    I think I'm going to have to concede defeat on this one, 10 hours of troubleshooting is enough.

    Here's what I've tried today:
    • LOG_BAR_SIZE=0..3
    • NUM_OF_VFS=1..8
    • SRIOV_EN=0 and SRIOV_EN=1
    • Downgrading as far back as 2.10.2280 + Flex-3.3.650 (from 2012)
    • Nuking the bootrom on the card entirely
    Unless I'm missing something incredibly stupid I'm pretty sure this is an Asus bios problem. This is the second Asus board I've owned that flat out refused to POST with a certain PCIe device installed, I think it's going to be my last.

    In case anyone finds this via Google in the future here's a synopsis:
    Asus z97 Pro with a Mellanox MCX354A-FCBT ConnectX-3 VPI adapter (HP 544QSFP 649281-B21) refuses to POST with Q-Code 40 displayed on the board. "Above 4G decoding" is on in the bios, I don't have any options for IOMMU or SR-IOV available to change.
     
    #25
  6. fohdeesha

    fohdeesha Kaini Industries

    Joined:
    Nov 20, 2016
    Messages:
    765
    Likes Received:
    568
    well damn, that's no good. one of my workstations upstairs is an asus z97 based board (sabertooth something or other), and I've been meaning to throw a connectx3 in it - I guess I'll see how that goes. Out of curiosity, how did you nuke the bootrom on the card? last time I tried with the brom flint commands it wouldn't allow it, unless you mean disabling boot options with mlxconfig
     
    #26
  7. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    Flint will stop you from issuing drom or brom commands if the fw image flashed to the card contained a versioned copy of Flexboot but the option flag --allow_rom_change will let you override that.

    Let me know what happens when you throw a CX3 into that machine, I'm seriously stumped by this. I think I'll contact Asus support tomorrow and see what they say, I was really looking forward to using the card.

    edit: I'm going to go out on a limb and try one last thing later tonight. Q-code 40 is something like "System waking up from S4 sleep state" ... so I guess I'll shut off all power management and see if that makes a difference.
     
    #27
    Last edited: Aug 6, 2018
    fohdeesha likes this.
  8. fohdeesha

    fohdeesha Kaini Industries

    Joined:
    Nov 20, 2016
    Messages:
    765
    Likes Received:
    568
    Amazing, after all my dicking with MFT not sure why that didn't cross my mind - "flint -d /dev/mst/mt4099_pci_cr0 --allow_rom_change drom" did indeed delete the bootrom and now I can stop seeing the useless flexboot shit at boot. sweeeet
     
    #28
  9. Hindsight

    Hindsight Member

    Joined:
    Mar 28, 2016
    Messages:
    55
    Likes Received:
    9
    One of my notes on crossflashing has this as well.

    Code:
    #turn off bootrom crap
    mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P1=false
    mlxconfig -d /dev/mst/mt4099_pci_cr0 set BOOT_OPTION_ROM_EN_P2=false
    mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P1=0
    mlxconfig -d /dev/mst/mt4099_pci_cr0 set LEGACY_BOOT_PROTOCOL_P2=0
    
     
    #29
  10. gzorn

    gzorn Member

    Joined:
    Jan 10, 2017
    Messages:
    67
    Likes Received:
    12
    @arglebargle - Unfortunately, I never did additional testing, since it works (mostly) in one of my servers.

    On a different note, did you try masking the SMbus pins on the PCIe connector?
     
    #30
    arglebargle likes this.
  11. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    That leaves the bootrom intact and configures the card not to attempt PXE booting -- the bootrom still loads at boot time. What we're doing is erasing the bootrom off of the card so it can't load at boot, I was hoping it was the bootrom that was causing issues with my BIOS but unfortunately it wasn't.

    There's a section in the Mellanox firmware tools manual under Flint titled "Managing an Expansion ROM Image" with details. The man page for flint is pretty thorough too, that's where I found the override flag.
     
    #31
  12. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    Oh man, I've never run into a situation where I needed to do this so it didn't even cross my mind. I'll try it this afternoon, thanks!
     
    #32
    gzorn likes this.
  13. gzorn

    gzorn Member

    Joined:
    Jan 10, 2017
    Messages:
    67
    Likes Received:
    12
    @arglebargle Just fair warning, I've never done it myself. I think it was mentioned in the OP of this thread.
     
    #33
  14. fohdeesha

    fohdeesha Kaini Industries

    Joined:
    Nov 20, 2016
    Messages:
    765
    Likes Received:
    568
    a longshot here, but is the card configured with one port infiniband, and one port eth? I remember when mixing it like this, I think it showed up as 2 diff pci devices, if that's the case maybe that's what the asus board doesn't like? I dunno, if that's how it's configged, try setting both to eth using mlxconfig in another box
     
    #34
  15. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    294
    Likes Received:
    83
    Wow, taping off the smbus pins got the system through POST!

    Untitled.png

    Alright, time to go test out a few cables and see what these cards can do!

    Huge thanks guys, I dumped about 10 hours into trying to diagnose this.
     
    #35
    alltheasimov, Tha_14 and fohdeesha like this.
Similar Threads: Infiniband PCIe
Forum Title Date
Networking C6100 Infiniband Mezzanine Card Help Oct 2, 2018
Networking Sun Infiniband 36 Gateway Switch Sep 7, 2018
Networking No Infiniband devices after 5 minutes? Jul 23, 2018
Networking 10gbe IPoIB (Infiniband) bridge Jul 8, 2018
Networking How to flash Asus PEM-FDR Infiniband (Mellanox OEM ?) to work with other motherboards ? Mar 29, 2018

Share This Page