Mellanox ConnectX-4 or newer & Bluefield, Tips & Tricks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
Over the course of the last several years, I've noticed that on the forum, there are similar questions about both ConnectX NICs and Bluefields. Therefore, I've written a huge meta post that collects information about them.
I have had some experience working with ConnectX-4, 5, 6, and 7 NICs, as well as Bluefield-1 and 2. Most of my experience is with pure Linux, but that shouldn't matter for most things, and I'll be glad to accept any changes. To do so, please add them as a suggestion in this Google Doc. If you want to be credited, don't forget to include your username on the forum (or any other way you want to be credited in the Authors section). If you can't, PM me.
As Bluefield is essentially a ConnectX NIC with an ARM CPU (and more), there are many similarities in working with them.

Last update: 13 May 2025

Changelog:
  1. 4 May 2025 Initial document
  2. 5 May 2025 extra note about Dell's CX-4 Lx and LED.
  3. 5 May 2025 mention that there is a 3rd workaround for MBF2M345A-VENOT_ES
  4. 10 May 2025 added MTUSB-1 section 6
  5. 13 May 2025 Clarify the section about cross-flashing vendor cards
1. Authors & Contributors
  • Civiloid
  • pimposh
  • jpmomo
2. Generic
NOTE: All server NICs require some airflow. You might be able to get away with little to no airflow for ConnectX-4 Lx and ConnectX-6 Lx (though this is already questionable), but for all other models, you must provide active airflow of some form; otherwise, even in idle mode, the NIC will overheat and shut down.

To get the full potential out of your Mellanox/nvidia NIC, you need to install MLXOFED. Since December 2024, it has been distributed as part of the Host DOCA SDK bundle. As of writing, you can download it from this section of nvidia's website. To download the latest DOCA, go to the website, click on "Host-Server", select "DOCA-Host", choose your OS, and follow the provided instructions.

DOCA 2.10 (latest as of writing) supports all NICs starting with ConnectX-4 and Bluefield-2 and newer (see below). For unsupported cards, you need to use OFED that still supports them. The archive is located here.

To reflash NICs, you should use open-source mstflint. Most Linux distros come with flint prepackaged, but its version might be too old. The latest version can be downloaded from here.

Unofficial list of Mellanox and Nvidia NICs
3. ConnectX
i. General
Working with ConnectX cards is mostly straightforward. There are a few common tools and commands that people use:

Code:
# query current configuration
mlxconfig -d ${PCI_ID} q
Code:
# Change port 1 to Ethernet and port2 to Infiniband
mlxconfig -d ${PCI_ID} set LINK_TYPE_P1=2 LINK_TYPE_P2=1
Code:
# Enable SR-IOV and Change number of virtual functions to 8:
mlxconfig -d ${PCI_ID} set SRIOV_EN=1 NUM_OF_VFS=8
Code:
# Link-aggregation mode, queue affinity
mlxconfig -d ${PCI_ID} s LAG_RESOURCE_ALLOCATION=0
Code:
# Link-aggregation mode, hash mode
mlxconfig -d ${PCI_ID} s LAG_RESOURCE_ALLOCATION=1
Code:
# Enable aggressive CQE Compression - that might help with small packet performance, however, might make performance worse in some cases
mlxconfig -d ${PCI_ID} s CQE_COMPRESSION=1

Some NICs have so-called "White" and "Black" Connectors. Those are for the "Socket Direct" adapter. Those can be used to connect the NIC to PCIe lanes from 2 different sockets. You connect cable that is labeled "white" to the "WHITE" connector and the cable labeled "black" to a connector that have "BLACK" written near it (note: color of the cable can be any - both white, white and black, black and black - that is normal). It might help in terms of performance when you have a multi-socket system and want to assign queues from the NIC to both CPUs without data passing via a slow cross-socket link (same is applicable for Sub-NUMA nodes, like NPS in AMD Epyc).

a. Troubleshooting
Some cards, especially those running older firmware (from approximately 2018-2019), have severe compatibility issues with modern systems, to the extent that they won't initialize and are therefore inaccessible from within the OS. The solution is to put it in flash recovery mode and reflash or find a compatible system (I personally use a Gigabyte MC12-LE0-based system as a reflashing machine)
ii. Updating firmware
There is a fully official automated way to update firmware for supported non-OEM cards:
Code:
mlxfwmanager
It will update the NIC firmware to the one bundled in the version of DOCA you have installed. That doesn't work for OEM cards (e.x, Dell, HP, Cisco, …) and won't allow you to cross-flash cards. They have different PSIDs allocated to them that don't start with MT_ but rather with company-allocated prefixes (DEL, HP_, etc).

If you want to flash specific firmware for the same PSID (one of the unique identifiers of a particular card):
Code:
sudo mstflint -d "${PCI_ID}" -i "${NEW_FIRMWARE_BIN}" burn
a. Flashing / cross-flashing / recovering from "flash recovery" state
WARNING: That is a dangerous operation. While ConnectX-4 and newer are rather resilient and most of the time can be restored, in some cases, it would be hard or might cause permanent damage to the NIC or your computer. Do it at your own risk. You have higher chances to succeed if PCB of the cards mostly matches, and you must be sure they use the same chip (it is a bad idea to flash ConnectX-4 Lx with ConnectX-4 EN firmware). At least do backups.

Newer cards (ConnectX-6, 7, 8, Bluefield-2, Bluefield-3) can run signed firmware (all CX-7 and BF-3 run signed firmware), some of them can still be cross-flashed, but it is not guaranteed they'll be able to boot; in some cases, overwriting VPD helps.

To cross-flash a card with another PSID's version for the same-vendor non-signed card or older ConnectX-5 and 4s.

Note: Crossflashing Dell variant of ConnectX-4 Lx (DPN: 20NJD for short bracket, MRT0D long bracket) with stock Mellanox FW render built-in LEDs (uplink/traffic) disabled hence it’s advised to stay on latest Dell FW 14.32.20.04).

To check if your card runs signed firmware:
Code:
mstflint -d ${PCI_ID} query full
If it has "Security Attributes: secure-fw," - then it is signed and you need to cross-flash it from within recovery mode. Otherwise, the normal way would work:
Code:
sudo mstflint -d "${PCI_ID}" -i "${NEW_FIRMWARE_BIN}" -allow_psid_change burn
In rare cases, you need to pass '--no_fw_ctrl' option for some of the ConnectX-5 cards.

I personally like to use something like that:
Code:
echo "You are about to flash your Mellanox card to a different firmware without any validation. That can cause irreversible damage to your network card or PC Please STOP IF YOU ARE NOT SURE YOU KNOW WHAT YOU ARE DOING, AND YOU TAKE FULL RESPONSIBILITY FOR WHAT IS ABOUT TO HAPPEN."
sleep 60
sudo apt install mstflint gawk
NEW_FIRMWARE_BIN="<set this to the path to your new unzipped firmware, bin file>"
# This will get the only ID of the first card, you should modify that if you need another one
PCI_ID=$(sudo lspci | gawk '($0 ~ /ConnectX/ && $1 ~ /\.0$/){print $1}' | head -n 1)
mkdir -p "mellanox_${PCI_ID}_backup"
sudo mlxconfig -d "${PCI_ID}" q > "mellanox_${PCI_ID}_backup"/mlxconfig.txt
sudo mstflint -d "${PCI_ID}" query full > "mellanox_${PCI_ID}_backup"/query_full.txt
sudo mstflint -d "${PCI_ID}" hw query > "mellanox_${PCI_ID}_backup"/hw_query.txt
sudo mstflint -d "${PCI_ID}" ri "mellanox_${PCI_ID}_backup"/orig_firmware.bin
sudo mstflint -d "${PCI_ID}" dc "mellanox_${PCI_ID}_backup"/orig_firmware.ini
sudo mstflint -d "${PCI_ID}" -i "${NEW_FIRMWARE_BIN}" -allow_psid_change burn
sudo mstfwreset -d "${PCI_ID}" reset
For newer cards, if the card runs signed or encrypted firmware, you won't be able to take a backup from normal mode and won't be able to use the 'allow_psid_change' flag in normal mode. You need to put it into flash-recovery mode. To do that, turn off your server, remove the card, and short JP1/JP2/J7 (sometimes labeled "FNP") - that is a 2-hole unsoldered connector with just "JPx" written next to it. I personally use just a wire, but be careful not to short other boards around it.



Code:
sudo mstflint -d "${PCI_ID}" -i "${NEW_FIRMWARE_BIN}" -ocr --nofs --allow_psid_change burn
For ConnectX-6/Bluefield-2, you can also try flashing another vendor's firmware by overwriting the VPD. You will need to put the card into flash recovery mode, and you'll need to use flint_oem from mft-oem package provided by nvidia. You can get one from here:
Note: reflashing VPD would set your MAC and GUID to 0, and you'll need to restore it from a pre-saved file:
Code:
sudo mstflint -d "${PCI_ID}" -ocr query full > query_full.txt
sudo mstflint -d "${PCI_ID}" -ocr hw query > hw_query.txt
# Make a backup for the firmware, just in case.
sudo mstflint -d "${PCI_ID}" -ocr ri orig_firmware.ini
sudo mstflint -d "${PCI_ID}" -ocr dc orig_firmware.ini
# Remove flash write protection. Open-source Flint doesn't support hw queries (might need to be compiled specially)
sudo flint_oem -d 09:00.0 -ocr hw set Flash0.WriteProtected=Disabled
sudo flint_oem -d "${PCI_ID}" -i "${NEW_FIRMWARE_BIN}" -ocr --nofs --allow_psid_change  --ignore_dev_data --use_image_ps burn

# For me, a reboot was required; until then, the GUID and MAC stayed 0, even though it said otherwise
reboot

PCI_ID=$(sudo lspci | gawk '($0 ~ /ConnectX/ && $1 ~ /\.0$/){print $1}' | head -n 1) 
GUID=$(gawk '($1 == "Base" && $2 == "GUID:"){print $3}' query_full.txt)
MAC=$(gawk '($1 == "Base" && $2 == "MAC:"){print $3}' query_full.txt)
sudo mstflint -d "${PCI_ID}" -guid ${GUID} -mac ${MAC} -ocr sg
After that, you can turn off your server and remove the jumper. It should boot normally, and the card should change the PSID and all displayed information.

TODO: Check if Diolan U2C works with the cards as well, as it seems to use the same chip and USB ID as mtusb-1.

Note on ConnectX-7 cards:
There are some ConnectX-7 Engineering Samples on the market that have pre-production unencrypted firmware (their production date is earlier than April 2022). In the "mstflint -d ${PCI_ID} query full" you will also see:
Code:
Life cycle:            PRODUCTION
<...>
Encryption:            Disabled
Nvidia's documentation lists a path to make those cards run Production firmware: you need to obtain a special jump firmware version 28.98.2406, flash it, and after reboot, flash to production firmware. However, there is no known way to obtain 28.98.2406 firmware, so it was never tested if that upgrade path works. For whatever reason, some cards were shipped to actual people with firmware version 28.98.xxxx flashed, as there are multiple mentions of that on the internet, though.

If you have a card with firmware from the 28.98 range, please dump it and share for preservation purposes.

iii. Specialized SKUs
a. Innova
Those are ConnectX-4 Lx (Innova) or ConnectX-5 (Innova-2) NICs with on-board FPGAs for crypto acceleration. There isn't much information available, though, mostly just news from STH.
I haven't tried using Innova NICs, but there seems to be a lot of information available in one of the GitHub repos (have a look at that account, it has way more details about Innova-2) or there is a good longread on GitHub Gist.

b. Branded cards
Mellanox/Nvidia also customizes cards for big customers. One notable example is CX71343DAC-WEBF, a custom SKU for Facebook, which features a single QSFP-DD that can be split into two 200G QSFP56s. That card has a custom PSID that starts with 'FB_' and therefore, firmware can't be easily updated.
iv. MacGyvering firmware
It is possible to MacGyver your own firmware, but currently, the working methods are limited to ConnectX-5 and 6, as they use the FS4 firmware format, which is unencrypted. It is theoretically possible to explore other firmware formats, but no tooling exists, and newer firmwares are also encrypted, so there isn't much you can do with them, as there is no known method to work around the encryption.

There are some reasons why you might want to do that - for example, if you want to change a PSID to your custom one, or have a custom vendor card that works somewhat better (e.x, ATTO-branded ConnectX-5s for Mac that work with their driver) but the vendor stopped updating the firmware and EOLed the product. However, note that if the firmware is secure and signed, the number of changes it accepts is limited.
One example is to add PCIe Gen4 initialization to the ATTO FastFrame N312 (ConnectX-5), which would utilize all available bandwidth in Thunderbolt 5/USB4 v2 enclosures.

The available tooling is limited in terms of functionality.

Main tools are:
  1. GitHub - irisc-research-syndicate/mlx5fw: Tool for manipulating ConnectX-5 firmware (FS4?) - open-ish (unlicensed) tool that can extract ITOC sections of firmware (but not DTOC) and replace them if the size of the section hasn't changed.
  2. Open-source mstflint has a command to verify firmware; it will print most sections along with their offsets.
4. Bluefields
i. General
All documentation assumes you have both mstflint and DOCA-Host installed on your host system.

Reflashing of the ARM part goes via rshim (you need to start rshim service if you are using Linux), it would create devices in /dev/rshim${n}/ that can give you serial console (/dev/rshim${n}/console, 115200n8) or allow you to control the card (/dev/rshim${n}/misc).

Some useful commands:
Code:
#reboot ARM CPU
echo "SW_RESET 1" > /dev/rshim${n}/misc
#increase verbosity of the console logs
echo "DISPLAY_LEVEL 2" > /dev/rshim${n}/misc
The OS for the ARM part comes packaged into bfb files (bluefield firmware bundle). To flash them, you should use bfb-install from the rshim package.

a. Working with DOCA SDK
NOTE: An example is based on the assumption that you are flashing DOCA 2.9.1, and all the file names are based on that.

To flash DOCA SDK to your card, you should use something similar to that command:
Code:
bfb-install --rshim rshim0 --bfb ./bf-bundle-2.9.1-30_24.11_ubuntu-22.04_prod.bfb --config ./bf.cfg
That would flash DOCA SDK 2.9.1 (that you've downloaded) to the NIC and also would use the provided config file with extra parameters.

In that config file, you can provide some useful parameters:
Code:
ubuntu_PASSWORD=<hash of the password>
grub_admin_PASSWORD=’<grub2 pbkdf2 hashed password>'
Password hash can be generated with something like 'openssl passwd -5', for example, you can try to use other hash types that the OS of your choice supports. The official documentation suggests using type 1 for whatever reason.

It supports multiple options; refer to the documentation for a list of available parameters or a detailed process description.

Another interesting part is the default network interface configuration, which manages the settings for all interfaces. Most notably,, that way you can change the default IP address: Deploying BlueField Software Using BFB from Host

b. Accessing Bluefield OS
If you load rshim service, it will create tmfifo_net# interface (where # - number corresponding to /dev/rshim folder for the card). By default, Bluefield OS is available at 192.168.100.2 and would try to use 192.168.100.1 as a default gateway and nameserver.

By default, all of the tmfifo interfaces on all cards would have a static MAC address 00:1A:CA:FF:FF:01 and therefore should be accessible over a link-local IPv6 address: fe80::21a:caff:feff:ff01. That can be changed via one of the parameters in bf.cfg.

Alternatively, it would try to obtain an IP address over OOB RJ45 via DHCP (v4).
c. Bluefield Applications
Nvidia provides good information about sample applications of Bluefield. I suggest referring to the relevant section of the documentation.

One of the notable use-cases for Bluefield is to emulate an NVMe drive while actually performing NVMe-oF (e.x, NVMe over TCP): DOCA Storage Zero Copy

Another use case is a GPU Packet Processing application - that way you can abstract the GPU over the network: DOCA GPU Packet Processing Application Guide


ii. Bluefield-1
NOTE: Avoid those cards, unless they are extremely cheap or you want one for your collection. They are unsupported by nVidia and ARM-part requires old DOCA (1.3.0 is the last one that officially supported BF1, but DOCA 1.4.0 still contains firmwares for those and might or might not work) and have bugs that were never fixed. If you have one, you can still install the latest DOCA on the host for the newest versions of OFED.

One of the common problems on Bluefield-1 systems is PCIe compatibility. That is the same as with ConnectX-5, which is running older firmware. Still, there is no proper fix for Bluefields available and even with newest available firmware it won't be able to run in some of the newer systems (I personally had no problems with compatibility with Sapphire Rapids/Emerald Rapids server systems, but in SP3/SP5 AMD Epyc machines NIC failed to initialize). In theory, it might be possible to MacGyver fixed firmware by transplanting the PCIe init section from ConnectX-5. However, that was never tried and might brick the card, rendering it beyond repair.
a. General

There are a few different SKUs in Bluefield-1 generation available, and it is important to verify which one you got before inserting the card.

All the cards use the same ConnectX-5 En NIC as the ConnectX-5 NIC. That might be important in terms of performance.

Also, all the cards, including 2x25G, are PCIe Gen4 versions. Therefore, it should be possible to reach full speed on a PCIe Gen4 x4 port.

Mostly full list of Bluefield-1 SKUs

There is a bit of logic in how they are named, mainly in the suffix of the card.

You would see the suffixes like AENAT, ASNAT, ASCAT, CSNAT, etc.
The first letter represents the number of ports on a card. A - 2x25, C or E - 2x100.
The second letter represents the number of cores - E is 8 ARM cores, S - 16.
The third letter represents crypto support - N - no crypto, C - crypto enabled.
"T" is just always there, I don't know if it has any special meaning, and it was never used on the older firmware download page.

  1. BF1500 (e.x. MBF1L516B-CSNAT) or 2x25G bluefield (all MBF1M332A) - those are normal NICs with an ARM on board. The ARM CPU is slow, as it is either an 8-core or 16-core Cortex A72 CPU that runs at just 800 MHz. 2x25G cards have a built-in fan that is rather loud.
  2. BF1600 (e.x. MBF1M606A-CSNAT or MBF1M636A-CSNAT) - those are special SKUs for JBoF systems - instead of being PCIe client cards, they are actually PCIe host and therefore should not be inserted into a normal system - according to the docs, it might damage the motherboard and/or the card. They are also rare and usually extremely overpriced. They have different CPUs, but all of them are 16-core and either 1.1 GHz or 1.3 GHz. All but the lowest-end (606) have a 'PCIe Aux connector' that allows you to use a second PCIe slot to use all 32 PCIe lanes that those NICs have. It is not clear if it is a special card or if a Socket Direct adapter can be used. Some of the BF1600 cards have a SODIMM DDR4 slot for RAM, instead of soldered RAM.


Firmware updates for the cards are no longer available as standalone files, but the latest firmware is included in DOCA 1.3.0. On some cards (e.x, MBF1L515B ones), you can install the SDK only using a mini-USB to USB cable as it doesn't expose rshim services over PCIe.

Follow the guide for DOCA 1.3.0 to install it on the card, as that is the latest version of DOCA that works.
iii. Bluefield-2
a. General

Even 2x25G SKUs are based on ConnectX-6 Dx; therefore, in terms of power consumption and performance, they should be faster than ConnectX-6 Lx if that matters. As a drawback, none of the cards supports ASPM.

Mostly complete official list of bluefield-2s

As with BF1, there are separate SKUs that are JBoF, and just like with BF1, you cannot use those cards on a normal computer or server. Those SKUs are called BF2500 or BF2 VPI DPU Controllers. Unfortunately, it is unclear if they have different PCBs or share the same one with other Bluefield-2s, as I wasn't able to obtain them, and there are no reports on any successful attempts to flash cards with BF2500 firmware. Firmware for BF2500 can be found inside DOCA 1.3.0 and 1.4.0, but it is old.

Some SKUs are not mentioned in the documentation, but older versions of DOCA have firmware for those SKUs. For example, MBF2H536C-CEUO has Secure Boot, but UEFI and Crypto are disabled.
b. Special versions
Other special versions of Bluefield-2 exist, but I haven't personally seen them. Those are bluefields with integrated GPU - Bluefield-2X, which consists of 3 production models:
  • AX800 - actually Bluefield-3 with A100 class chip
  • A100X - Bluefield-2 with A100
  • A30X - Bluefield-2 with A20

On the internet, you can also find the BF A10X Engineering Sample for sale, but it is not clear if it exists outside of Engineering Samples. It is unclear what kind of silicon the card uses or if it actually exists and not just a typo in the listing (it is hard to verify, as firmware for BF-2X distributed only under NDA after approx. DOCA 1.4.0)
c. NC-SI Interface
Bluefield-2 has an NC-SI connector that can be used for debugging/recovery purposes. The connector is different depending on the card. Rule of thumb: if the card is an Engineering Sample, it will have a 30-pin connector for a flat cable. For production cards, a 20-pin Molex 5011892010 connector is used.

They all have UART pins exposed, which would provide UART to the BMC.

There is slightly more information in the manuals:
  1. Bluefield BMC 23.09
  2. Bluefield2 DPU User Guide for pin description
d. [IMPORTANT] Updating the DOCA
It is important not to try updating a Bluefield from within the pre-installed DOCA SDK if the version is not recent enough. That would result in a soft-bricked NIC that you won't be able to even detect on the PCIe bus, and the only way to bring it back to life would be to access BMC (via SSH, if it was enabled, or via NC-SI serial console). See below for the recovery instructions.

If you have a custom SKU for which DOCA doesn't have a firmware, at the end of the firmware update process, you'll get a message:
Code:
INFO[MISC]: NIC firmware update failed
That is normal; it means that DOCA didn't have any firmware for that card.
e. Note on MBF2M345A-VENO Engineering Samples
Those are pre-production SKUs. They usually come with pre-production firmware 24.33.0356 or with an updated firmware version 24.40.1000. It is not possible to just cross-flash those cards to production SKU like MBF2M345A-HECO or HESO because there is physical difference between those cards - VENO uses 2-bank 8gbit RAM chips, while HE*O cards are using 1-bank 16-gbit RAM chips and if you just cross-flash the card, ARM cores will fail to boot, complaining to ddr_init training errors.



There are several workarounds available. A temporary workaround involves modifying the bfb file and replacing the RAM configuration on the flash. You need to replace the config within the BFB file each time you are performing bfb-install, and you need to persist your changes on the eMMC. You can try to make it persistent - ini is located in /dev/mmcblk0boot0 and a copy is in /dev/mmcblk0boot1. Please make a full backup of a working boot0 before making any changes (and refer to the section on BFB structure for more information).
The second workaround is to modify the firmware of the NIC. If you want a modified firmware, please send me a PM, as I don't want it to be widely available for now, as that probably would increase the prices of the cards.
The third workaround is to find newer firmware in other places (there are multiple versions available). For detailed instructions on where to find it please send a PM (same reasons as for the second workaround).

The boot partition can be updated with a special bfb file, and the steps are officially described in BlueField BSP documentation.

Those cards don't have a BMC chip and instead run on a BMC simulator. You might soft-brick the card if you attempt to change any BMC parameters (like assigning an IP address). If you do that, you will need to install the latest DOCA that still supports that card (DOCA 1.4.0). You can install it with bfb-install. That will downgrade the EFI firmware and allow you to fix the BMC Simulator parameters.

TLDR: You need to prepare a bfb file that contains the bootloader and all the configuration files you want to change, and then you need to flash it from within Bluefield's OS using the /opt/mellanox/scripts/bfrec script.

Do not try to hexedit files manually; some of them are signed, and all of them have a CRC attached. In the event of a CRC mismatch, the system will not boot.
iv. BFB structure
You can use mlx-mkbfb script to make your own BFB or to extract existing ones.

Code:
mlx-mkbfb -x ./bf-bundle-2.10.0-147_25.01_ubuntu-22.04_prod.bfb
would extract bfb from bf-bundle-2.10.0-147_25.01_ubuntu22.04_prod.bfb file.

A script like that can be used to repack partitions into bfb file.

Current firmware has the following partitions that are mostly self-descriptive (if partitions have v1 and v2 - that is for bluefield-2 and bluefield-3, respectively):
  • bl2-cert-v1
  • bl2-cert-v2
  • bl2r-cert-v1
  • bl2r-v1
  • bl2-v0
  • bl2-v1
  • bl2-v2
  • bl31-cert-v1
  • bl31-cert-v2
  • bl31-key-cert-v1
  • bl31-key-cert-v2
  • bl31-v0
  • bl31-v1
  • bl31-v2
  • bl32-cert-v1
  • bl32-cert-v2
  • bl32-key-cert-v1
  • bl32-key-cert-v2
  • bl32-v0
  • bl33-cert-v1
  • bl33-cert-v2
  • bl33-key-cert-v1
  • bl33-key-cert-v2
  • bl33-v0
  • boot-acpi-v0
  • boot-args-v0
  • boot-args-v2
  • boot-desc-v0
  • boot-path-v0
  • capsule-v0
  • ddr_ate_dmem-v1
  • ddr_ate_dmem-v2
  • ddr_ate_imem-v1
  • ddr_ate_imem-v2
  • ddr-cert-v1
  • ddr-cert-v2
  • ddr_ini-v1
  • ddr_ini-v2
  • image-v0 – kernel
  • initramfs-v0
  • psc-app-v2
  • psc-bl-v2
  • psc-certs-v2
  • psc-fw-v2
  • snps_images-v1
  • snps_images-v2
  • trusted-key-cert-v1
  • trusted-key-cert-v2

All installation is done from within initramfs.

The main script is located in scripts/initrd-install, and it then calls the installation script for the OS.
a. Extracting firmware from the bf-bundle
There are sample scripts that will try to extract firmware updates from bf-bundle: GitHub - Civil/bfb-extract-fw

Scripts are simplistic and might fail if anything changes in the format. They will produce a few directories where firmware would be named by a combination of model and PSID for all cards supported by the bfb file.

That works with older DOCAs, like 1.3.0 as well, but you need to modify the script to point to the correct URLs and files.
v. Troubleshooting
a. Failed setting eswitch to offloads
Full message:
Code:
[  183.852908] mlx5_core 0000:03:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[  191.032911] mlx5_core 0000:03:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[  192.322688] mlx5_core 0000:03:00.1: mlx5_cmd_out_err:833:(pid 1249): CREATE_FLOW_GROUP(0x933) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x201c1c), err(-22)
[  192.340343] mlx5_core 0000:03:00.1: mlx5_rdma_enable_roce_steering:71:(pid 1249): Failed to create RDMA RX flow group err(-22)
[  192.356593] mlx5_core 0000:03:00.1: mlx5_rdma_enable_roce:164:(pid 1249): Failed to enable RoCE steering: -22
Code:
# That helps if you have strange errors in dmesg about ESWITCH
# PF_TOTAL_SF: maximum number of scalable functions you wish to configure for the given PF/ECPF. 252 is the max.
# PF_SF_BAR_SIZE: size of each SF at the BAR2. The size is in powers of 2 in KB. 12 seems to be the default.
# NUM_OF_PF: number of physical ports exposed to the host. NOTE: if you set it here to 0, no ports will be visible on the host and you'll need to log in to ARM to re-enable them
mlxconfig -d ${PCI_ID} s PF_TOTAL_SF=252 PF_SF_BAR_SIZE=12 NUM_OF_PF=${AMOUNT_OF_PORTS}
Source: 1, 2

b. BMC on vendor cards
In some cases, it is disabled because Vendor Field Mode is enabled. It can be reenabled from within the OS on the card: Vendor Field Mode
c. The card is not detected as a PCIe device.
You can attempt to recover the card by logging in to BMC's serial console. For that, you need to have either a working BMC (responding on web interface/API) or, if that is not an option (card in VFM), you can get a physical serial console over an NC-SI Connector. On production cards, those are Molex 5011892010, pico-clasp. Cables from various sources work fine with those cards.

TODO: Brick one of the cards I have and provide instructions on how to recover it, step-by-step.

5. MTUSB-1

This device is used to modify the Mellanox NICs as an option when putting them into flash recovery does not work. It accesses the NIC via the I2C interface.
This only seems to be necessary on some OEM CX5 NICS and the CX6 NICs. It doesn’t seem to be necessary on the CX7 NICs.

The device looks like the following:


Another pic of how the foot bone is connected to the knee bone!

You will need to get a couple of extra parts that don’t come with the kit.
That includes the gender changer shown below:


I also needed to rig the 3 pin connector that will go through the 3 holes on the nic:


And another shot:

I originally thought that green wire would go to the G for ground on the nic ….but the white wire goes to the G on the nic!


Also, take note of the order of the 3 holes on a CX5 NIC may be different than the CX6 NIC!

Please note that this did not work in vmware esxi. I did get it to work in linux and windows.

What it looks like without the mtusb-1 connected:


With the mtusb-1 connected:



Now you can run any of the normal commands. You just need to specify the mtusb-1 as the device:



Some additional pics of the complete setup on a benchtop:













6. Useful links
Mellanox OFED cheat sheet - MLXOFED Cheatsheet
bfscripts/mlx-mkbfb at master · Mellanox/bfscripts
NVIDIA BlueField-2 Ethernet DPU User Guide
bfscripts/mlx-mkbfb at master · Mellanox/bfscripts
Configuring NVIDIA BlueField2 SmartNIC
39. NVIDIA MLX5 Ethernet Driver — Data Plane Development Kit 25.03.0 documentation
Levente Csikor – Medium - very good series of articles on working with Bluefield, however, it requires an account on Medium.
 
Last edited:

mha

New Member
Feb 6, 2021
13
4
3
Very nice post and summary.

I've got ConnectX-6 DX NICs in the MT2892 / MCX623106A series (Dell SKU). Looking at the PCBs they all look identical. However, looking at NVidia's resource page we quickly see the following sub models:

MCX623106AC-CDAT
MCX623106AN-CDAT
MCX623106AS-CDAT

I've booted the card into recovery mode and tried flashing MCX623106AC-CDAT to the card. While this works great and the card boots, the crypto feature is simply missing no matter using kernel drivers, DOCA or OFED.

As you seem to have some experience, and components on the cards do look identical down to the resistor, would conducting any cross flashing be able to enable the hw crypto offloading?

Original card data before flash:
Code:
Image type:            FS4
FW Version:            22.38.1002
FW Release Date:       3.8.2023
Part Number:           0F6FXM_08P2T2_Ax
Description:           Mellanox ConnectX-6 Dx Dual Port 100 GbE QSFP56 Network Adapter
Product Version:       22.38.1002
Rom Info:              type=UEFI version=14.31.20 cpu=AMD64,AARCH64
                       type=PXE version=3.7.201 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             XXX        8
Base MAC:              XXX            8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  DEL0000000027
Security Attributes:   secure-fw
Default Update Method: fw_ctrl
Life cycle:            GA SECURED
Secure Boot Capable:   Enabled
EFUSE Security Ver:    0
Image Security Ver:    0
Security Ver Program:  Manually ; Disabled
KTLS status:

Code:
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
tls-hw-record: off [fixed]
Interestingly, despite cross-flashing the card, lspci is still showing the old card data. What does overwriting the VPD as you mention mean and how is it done? Would this help to completely transition the card to a "pure" mlx?

Code:
                Read-only fields:
                        [PN] Part number: 0F6FXM
                        [EC] Engineering changes: A05
                        [MN] Manufacture ID: 1028
 

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
As you seem to have some experience, and components on the cards do look identical down to the resistor, would conducting any cross flashing be able to enable the hw crypto offloading?
I've actually never tried crypto offloading on my cards, but I assume that functionality can be just fused out on non-crypto cards. In theory, flashing normally, flashing via recovery mode and flashing via mtusb are 3 different code paths and if you use mtusb you should be able to overwrite some of the readonly fields, but I don't have any working mtusbs and they are usually extremely expensive on eBay (there is one in a box for 350 on ebay and I think it is unreasonable for it). In theory it should be possible to use Diolan U2C (USB to I2C), unless mtusb had firmware modified just for them in incompatible maner, but I haven't checked if it works.
 

Freebsd1976

Active Member
Feb 23, 2018
418
76
28
maybe use mlxconfig to compare setting will give some hint ?
Very nice post and summary.

I've got ConnectX-6 DX NICs in the MT2892 / MCX623106A series (Dell SKU). Looking at the PCBs they all look identical. However, looking at NVidia's resource page we quickly see the following sub models:

MCX623106AC-CDAT
MCX623106AN-CDAT
MCX623106AS-CDAT

I've booted the card into recovery mode and tried flashing MCX623106AC-CDAT to the card. While this works great and the card boots, the crypto feature is simply missing no matter using kernel drivers, DOCA or OFED.

As you seem to have some experience, and components on the cards do look identical down to the resistor, would conducting any cross flashing be able to enable the hw crypto offloading?

Original card data before flash:
Code:
Image type:            FS4
FW Version:            22.38.1002
FW Release Date:       3.8.2023
Part Number:           0F6FXM_08P2T2_Ax
Description:           Mellanox ConnectX-6 Dx Dual Port 100 GbE QSFP56 Network Adapter
Product Version:       22.38.1002
Rom Info:              type=UEFI version=14.31.20 cpu=AMD64,AARCH64
                       type=PXE version=3.7.201 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             XXX        8
Base MAC:              XXX            8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  DEL0000000027
Security Attributes:   secure-fw
Default Update Method: fw_ctrl
Life cycle:            GA SECURED
Secure Boot Capable:   Enabled
EFUSE Security Ver:    0
Image Security Ver:    0
Security Ver Program:  Manually ; Disabled
KTLS status:

Code:
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
tls-hw-record: off [fixed]
Interestingly, despite cross-flashing the card, lspci is still showing the old card data. What does overwriting the VPD as you mention mean and how is it done? Would this help to completely transition the card to a "pure" mlx?

Code:
                Read-only fields:
                        [PN] Part number: 0F6FXM
                        [EC] Engineering changes: A05
                        [MN] Manufacture ID: 1028
 

mha

New Member
Feb 6, 2021
13
4
3
I've actually never tried crypto offloading on my cards, but I assume that functionality can be just fused out on non-crypto cards. In theory, flashing normally, flashing via recovery mode and flashing via mtusb are 3 different code paths and if you use mtusb you should be able to overwrite some of the readonly fields, but I don't have any working mtusbs and they are usually extremely expensive on eBay (there is one in a box for 350 on ebay and I think it is unreasonable for it). In theory it should be possible to use Diolan U2C (USB to I2C), unless mtusb had firmware modified just for them in incompatible maner, but I haven't checked if it works.
Thank you! I think having any external flasher would be taking things to the extremes, especially as it comes down to two cards :) I appreciate your input. Let me know if anything else does come to mind which I can do in the OS and/or with the ability to boot it in recovery mode.



maybe use mlxconfig to compare setting will give some hint ?
I have been looking through all possible settings there to no avail. For instance, these two caught my eye, but making changes to them did not do anything;

TLS_OPTIMIZE False(0)
CRYPTO_POLICY UNRESTRICTED(1)

It would be nice to get a full config dump of a crypto enabled card though just to verify. I did also conduct a full config reset after crossflash. Unfortunately don't have one at hand.
 

nasbdh9

Active Member
Aug 4, 2019
190
123
43
MBF2M345A-VENOT_ES_Ax
24.45.1016
mlnx-fw-updater_25.04-0.6.0.0_amd64
 
Last edited:

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
MBF2M345A-VENOT_ES
24.44.1036
mlnx-fw-updater_25.01-0.6.0.0_amd64
Yeah, I haven't checked all of them, and it is not present in the latest LTS releases, but can be found in other files.

Not sure if mentioning that is good idea in terms of prices on the cards though (right now they can be purchased for less than 150$ in China or about 200$ on ebay with international delivery, partially because firmware is semi-working and hard to update)
 

nasbdh9

Active Member
Aug 4, 2019
190
123
43
Yeah, I haven't checked all of them, and it is not present in the latest LTS releases, but can be found in other files.

Not sure if mentioning that is good idea in terms of prices on the cards though (right now they can be purchased for less than 150$ in China or about 200$ on ebay with international delivery, partially because firmware is semi-working and hard to update)
If really need to buy a bf2 dpu, the better choice at the moment is dell oem
These models are fully functional, the only difference is the PSID
Code:
DEL0000000033    0JNDCM_Dx    NVIDIA Bluefield-2 Dual Port 25 GbE SFP Crypto DPU
DEL0000000034    0PXDVR_Ax    NVIDIA Bluefield-2 Dual Port 100 GbE QSFP Crypto DPU
1746467937522.png
 
  • Like
Reactions: nexox and Civiloid

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,827
1,206
113

Code:
 declare -A PRODUCT_FAMILY_MAP=(
        ["EVB"]="MBF2-EVB"
        ["Camelot"]="(MBF2(H|M)5(26A|16A|25A|15A))|03GX455|SN37A98074"
        ["Atlantis"]="(MBF2(H|M)(322A|332A))|03GX454|SN37A98073"
        ["Arcadia OCP 3.0"]="MBF2M(922|912)A"
        ["Asgard"]="MBF2H51(5|6)B"
        ["BlueForce"]="BLUEFORCE_IPN"
        ["BlueSphere"]="(MBS2M512(A|C)|BS2M512A)"
        ["PRIS"]="699210020215"
        ["Camelantis"]="MBF2H5(1|3)2C"
        ["Aztlan"]="(MBF2(H51|M51|H53)6C)|0PXDVR|(30-100(299|300)-01)"
        ["OVH"]="SSN4MELX100200"
        ["Dell-Camelantis"]="0JNDCM"
        ["El-Dorado"]="MBF2M3(4|5)5A"
        ["Roy"]="6992100402(05|06|30|31)"
        ["ZAM"]="699140280000"
    )
    PART_NUMBER=$(bfhcafw flint q full | grep "Part Number:" | cut -f 2 -d ":" | xargs)

elif [ "$bfversion" = $BF3_PLATFORM_ID ]; then

    # map product family names to an appropriate regular expressions
    declare -A PRODUCT_FAMILY_MAP=(
        ["EVB"]="MBF3-(DDR4-EVB|EVB-SKT|EVB)"
    ["Moonraker"]="(900-9D3(B|C|L)|SN37B36732|SN37B75411|8217991|8225672|P66102)"
    ["Goldeneye"]="(900-9D3D|P66584)"
    ["Roy"]="699-21014-0230"
    ["Zhora"]="800-11012-0000-000"
    )
    PART_NUMBER=$(lspci -s "$(bfhcafw bus)" -vv | grep PN)
Nice codenames :D
 
  • Like
Reactions: Civiloid

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
DEL0000000033 0JNDCM_Dx NVIDIA Bluefield-2 Dual Port 25 GbE SFP Crypto DPU
Yeah, I have few of those. They are indeed rather cheap.


DEL0000000034 0PXDVR_Ax NVIDIA Bluefield-2 Dual Port 100 GbE QSFP Crypto DPU
Those I haven't seen for any reasonable price. I think the cheapest I saw was about 700$. Compared to VENOT_ES, which costs 150 and does the job just well enough... :) Also VENOT is mostly functional, only problem that I've noticed is the lack of BMC.
 

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
If really need to buy a bf2 dpu, the better choice at the moment is dell oem
My current use case for the BF2 1x200G NICs is to use them as cheaper, glorified ConnectX-6s. I currently use them as packet generators, where the x86-64 host runs Trex and loads a VPP test box. For my use case, it is better to have 1x200G (compared to 2x100), as some of the NICs that I have are ConnectX-7s. I currently have about 20 of those cards, and so far, they are good enough for the task and, in general, were cheaper than the same quantity of CX6s.

I have some plans to offload part of the process to the NIC, but I doubt that even a P-series CPU would be enough. I plan to dig into that in a future, but that requires way more time & effort than digging into firwmares :)
 

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
Yes, this is strange, even cheaper than connectx-5 :oops:
because of LLM, the price of connectx-5 is now at least 30% more expensive than before
Yeah. For connectx-5 & 6 if you are patient enough, you can find them cheaper. Every couple of months or so 1 or 2 nics can be grabbed on auctions (or buy it now) for less than 250$ (I think I saw cx5 2x100 to go as cheap as 180 and cx6 2x100 as low as 250).

But even then, some of the bf2s are cheaper.
 

Freebsd1976

Active Member
Feb 23, 2018
418
76
28
Many GPU cluster use InfiniBand, BF2 1x200G only support ETH, maybe this is the reason.
Yes, this is strange, even cheaper than connectx-5 :oops:
because of LLM, the price of connectx-5 is now at least 30% more expensive than before
 

aosudh

Member
Jan 25, 2023
66
16
8
Many GPU cluster use InfiniBand, BF2 1x200G only support ETH, maybe this is the reason.
BF2s are actually ConnectX-6 VPI combined with ARM cores, so they also support InfiniBand. The only reason why those Artificial Intelligence Data Centers (AIDC) don't use these Engineering Sample (ES) network cards is that the total quantity of these cheap ES network cards is limited (there are only about 1,000 open-boxed ones manufactured by Advantech on behalf of others approximately). Therefore, large-scale data centers don't think highly of them, while small-scale data centers don't have the technical capability to use these cards.
 
  • Like
Reactions: nexox

Civiloid

Active Member
Jan 15, 2024
147
106
43
Switzerland
so they also support InfiniBand
In hardware - yes. VENO is limited to ETH-only in firmware. And if you just cross-flash it with HECO/HESO firmware, ARM cores won't boot anymore, because HECO/HESO have different RAM chips. There are extra steps required to make that work, but its possible.