Linux gets fatal PCIe error with HBA card

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

octogonapus

New Member
Jul 8, 2022
3
0
1
I've been troubleshooting a fatal PCIe error on Linux that occurs when booting with an HBA card installed. I have gone through a lot of debugging steps; there's too much to dump here, so here's the cliff notes:

- My goal is to run Proxmox. I've ran it on this system before, just without the HBA card.
- I have an R720 server.
- The HBA card is a NetApp 111-00341+F2.
- I'm using the latest Proxmox installer (Linux version 5.15.30).
- Someone else has used this card on Ubuntu 18.04.2 successfully (on a different server).
- There is a thread here linked from here where someone claiming to be a NetApp dev states the cards should work in FreeBSD and Linux.

The PCIe error I get when booting Linux is:
Code:
PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
    device [8086:0e04] error status/mask=00004000/00318000
Unfortunately, I can't find any instances of this error on search engines. Plenty of similar errors, but the exact error severity and error type are critical here.

OS's and configs I've tried:
- Proxmox: boots into the above fatal PCIe error.
- Ubuntu 22.04 Desktop: boots into the above fatal PCIe error.
- Ubuntu 16.04.7 Desktop: installer can't boot; kernel panic and PCI13120 error on the front panel screen.
- Arch installer v20220701: boots into the above fatal PCIe error.
- Arch was the fastest to boot so I also tried it these PCI kernel options (none of them booted successfully): conf1, conf2, nommconf, noearly.
- FreeBSD 13.1 installs and boots successfully. All disks connected to the HBA appear. I didn't try much else with FreeBSD but it seems to work.

I think that FreeBSD working demonstrates that the server and card are both working and compatible. Now the question is, what's different between how FreeBSD and Linux handle PCI? What other dials do I have to tweak Linux's PCI behavior? In general, how do I proceed debugging this problem?
 

vl1969

Active Member
Feb 5, 2014
634
76
28
The reason is , that dell r720 is very picky about what controller is supports. I had a chance to get 2 r720 and r730 for almost free,a few years back and pass it on it specifically do to the hba support on Linux.
Even with windows you may have an issue with some cards.
OP needs to find a Dell PERC cards compatible with those machines, but if flashed to IT mode they too may be problematic, maybe not as bad but still...

It's a Dell thing.

PS.. try finding a manual for the server and check what slot you should plug the hba into. Sometimes it also makes a difference
 

vl1969

Active Member
Feb 5, 2014
634
76
28
According to my research it should work. OP should find a flashed Dell PERC hba. I have seen a few on eBay. Something like flashed h310 or h710
 

octogonapus

New Member
Jul 8, 2022
3
0
1
I did end up giving up and grabbing a Dell PERC H810. Flashed it to LSI FW. It's a shame but probably not worth spending the time to debug why Linux is unhappy when FreeBSD works just fine.