Debugging Mellanox ConnectX-4 Lx failures

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

compuserve

New Member
May 30, 2026
2
1
3
This is my second card that has died on me. These are used Dell OEM. First was basically DoA right out of the box but it would go in to recovery mode, I returned that one after trying a recovery and it still didn't work (likely cracked BGA solder). This current failure is a new one. The card had been working for some months then on a power-off restart it never came back and doesn't seem to even be able to get to recovery mode.

What's the current wisdom for debugging failures? Anyone know what JP1, JP2, and JP3 are? JP1 looks like a serial header, either TTL/RS232 or SDA/SCL. JP2 looks like an actual jumper, on/off (recovery mode?). JP3 is near the PCI slot and I have no idea.

It does not show up on the PCI bus at all. I've combed through every device in lspci and it's not there in normal or recovery mode. I've had other cards die and they would always go in to recovery eventually but not this one I'm working on. I tried the JP2 jumper but no change and I checked to see if there was 3.3V serial data on JP1 but nothing (I tried swapping RX/TX and JP2 or not JP2, nothing changed).

I doubt it's a firmware issue but I'm going to read the contents directly from the flash chip to make sure.

I feel like this might be a power supply problem. I'm in the process of setting up a test rig so I can check voltages on the card. I know about the heat cycling BGA cracking on these cards. I could try a reflow or re-ball but those aren't high on my list of fun things because the amount of heat required for that is likely to kill the ASIC. Any other ideas?

FWIW, I keep them cool, I know about that failure point. They don't run in the same systems so unlikely an external PSU or similar issue. Looking for guidance.

I have another on order but I like rebuilding broken electronics if it's possible.card1.jpg
 

Attachments

TRACKER

Active Member
Jan 14, 2019
340
147
43
Did you have proper cooling on these?
they run pretty hot if no good airflow is provided - like 105-115°C.