SXM5 (H100) over PCIE Nightmare

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

MildHotSauce

New Member
Mar 7, 2023
24
8
3
So as some of you know, before I did the SXM2 over PCIE board with 4 V100 gpus and it was a success, that @CyklonDX initially theorized and helped a lot with. Now I pushed my cards all in and bought a H100 from a secondary market (ebay) and a SXM5 to PCIE adapter from a reputable source, Jungle on Fiverr. I can get lspci to show the H100 as a 3D controller but that is it. I tried several combinations of Drivers to no luck. Does anyone have any insight in how NVIDA handles it's H100 drivers? I tried the server class ones that are available to no luck. Do you need NVIDIA Enterprise?
 
Last edited:
  • Like
Reactions: vv111y

MildHotSauce

New Member
Mar 7, 2023
24
8
3
System Setup:
  • GPU: NVIDIA H100 (GH100) on an SXM5-to-PCIe adapter
  • Motherboard: Pro WS WRX90E-SAGE SE
  • PCIe Slot: Using a PCIe 5.0 x16 slot
  • Power Configuration: Adapter card powered by two 8-pin CPU power connectors (Per Jungles recomendations)
  • Operating System: Ubuntu 24.04 LTS
  • Kernel Version: 6.11.0-17-generic
  • NVIDIA Driver Version: 565.57.01 (server open variant installed)
Steps Taken to Diagnose and Fix Issues:
PCIe and BAR Configuration:
  • Checked PCIe Link Speed and Width:
    • Initially forced PCIe Gen 3, later changed back to PCIe Gen 5 in BIOS.
    • lspci -vv -s e1:00.0 | grep -i "LnkSta" confirmed PCIe Gen 5 (32GT/s, x16).
  • Checked Resizable BAR Support:
    • Resizable BAR enabled in BIOS.
    • BAR2 was detected at 128GB, but BAR1 was stuck at 4GB.
  • Tried to Resize BAR1:
    • Attempted to manually modify BAR1 settings using setpci, but received errors.
    • Attempted to modify /sys/bus/pci/devices/0000:e1:00.0/resource1_resize, but got permission denied.
  • Checked for Above 4G Decoding Option in BIOS:
    • Could not find this option in BIOS.
  • Updated BIOS
GPU Detection & Driver Issues:
  • Confirmed GPU is Detected by PCIe Bus:
    • lspci -nn | grep -i nvidia shows the H100 is detected as a 3D controller.
  • Checked NVIDIA-SMI:
    • nvidia-smi does not recognize the H100.
    • Later, nvidia-smi shows only the RTX 3090, but not the H100.
  • Checked dmesg Logs:
    • dmesg | grep -i nvidia showed errors in initializing the H100, specifically:
      • nvidia-drm: Failed to allocate NvKmsKapiDevice
      • nvidia-drm: Failed to register device
  • Reinstalled NVIDIA Driver:
    • Installed nvidia-driver-565-server-open (tried both open and proprietary).
    • No change; the H100 still fails to initialize.
  • Checked Kernel Modules:
    • lsmod | grep nvidia confirms the NVIDIA drivers are loaded.
    • dmesg logs still show the H100 is failing to initialize.
Power and PCIe Errors:
  • Checked Power Status:
    • Used nvidia-smi to check power draw; however, the H100 does not show up.
    • Tried checking cat /sys/bus/pci/devices/0000:e1:00.0/power_state, which reports D0 (fully on). (could still be partial?)
  • Checked PCIe Error Logs:
    • dmesg | grep -i pcie showed PCIe error containment messages.
    • dmesg also contained DL_ActiveErr and Hardware PCIe Errors.
BIOS Settings Checked & Modified:
  • Enabled Resizable BAR
  • Set PCIe link speed manually (tried Gen 3 and Gen 5)
  • Checked various PCIe and NBIO settings (Hot Plug Handling, ARI Support, etc.)
  • Enabled SR-IOV
  • Looked for Above 4G Decoding but couldn't find it
  • Checked BIOS PCI Subsystem Settings
Current Issues:
  1. H100 is detected by PCIe but fails to initialize in nvidia-smi
  2. Driver fails to allocate resources for the H100
  3. BAR1 is stuck at 4GB despite BIOS supporting resizable BAR (BAR 2 now goes to 128?)
  4. dmesg shows PCIe error containment logs (possible PCIe instability)
  5. Power status of the H100 is unclear; can't confirm if it's drawing sufficient power
Questions for the Forum:
  1. How can I force BAR1 to resize beyond 4GB?
  2. Is there a way to verify if the H100 is getting enough power?
  3. What other BIOS settings should I check for compatibility (e.g., Above 4G Decoding, MMIO, PCIe settings)?
  4. Has anyone successfully run an H100 on an SXM5-to-PCIe adapter, and if so, what steps did you take?
  5. What could be causing the NVIDIA driver to fail at initializing the H100?
 

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
ok, i didnt see that it had been modified/upgraded to allow 12v input. i know on his listings he says that 48v was needed.

but about the drivers, did you try using the proprietary drivers from nvidia and not the drivers from the ubuntu repositories?I always use the runfile installation (not with H100 though).
 

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
some things to check. just spitballing some general troubleshooting things.

BIOS:
-look again in all the menus for the Above 4G setting. seems Asus TR boards usually have this setting in the PCI Subsystem settings menu. check if there are any other submenus in this menu where it might be hiding. but check around other menus too
-are you on the latest BIOS version for this motherboard? maybe there's a bug in this BIOS that's hiding your above 4g setting from being set. you might ask ASUS support to either point you to the location, or file a bug report with them to fix the bios and enable this setting. (maybe a longshot)
-try toggling the IOMMU settings also
-you said you enabled SR-IOV, what if you disable it?


are you using another GPU to run the display? if so which one and is it compatible/supported by the drivers you're attempting to install?
 

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
yeah i tried the .run file after a good purge, still no go.
no-go in which sense?

the drivers would not install? or they installed but wouldnt initialize the GPU after a reboot?

have you tried older driver variants? i've had random issues getting certain drivers to load under certain kernel configurations.and i'm not sure how much updating 565 branch is getting anymore.

try both 550.144 and the latest 570.124
 
Last edited:

MildHotSauce

New Member
Mar 7, 2023
24
8
3
some things to check. just spitballing some general troubleshooting things.

BIOS:
-look again in all the menus for the Above 4G setting. seems Asus TR boards usually have this setting in the PCI Subsystem settings menu. check if there are any other submenus in this menu where it might be hiding. but check around other menus too
-are you on the latest BIOS version for this motherboard? maybe there's a bug in this BIOS that's hiding your above 4g setting from being set. you might ask ASUS support to either point you to the location, or file a bug report with them to fix the bios and enable this setting. (maybe a longshot)
-try toggling the IOMMU settings also
-you said you enabled SR-IOV, what if you disable it?


are you using another GPU to run the display? if so which one and is it compatible/supported by the drivers you're attempting to install?
The PCI subsystem only has resize rebar support and sriov support. turning sriov support on and off does nothing different. I've tried poking around and I just asked ASUS where above 4g encoding is at, they are escalating my question and will get back in 2 days or so. I will look at the IOMMU settings. BIOS is updated to the latest.
 

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
oh actually i think i might have found your exact issue. i was skimming through the release notes of the drivers:


This version of the GPU driver will fail to initialize on systems with Hopper GPUs subrevision = 3 and VBIOS versions older than 96.00.68.00.xx. Please ensure the system is using a VBIOS version 96.00.68.00.xx or newer before upgrading to this version of the driver.
i would try the 550.xx branch drivers,as they don't seem to have this problem listed.
 

MildHotSauce

New Member
Mar 7, 2023
24
8
3
no-go in which sense?

the drivers would not install? or they installed but wouldnt initialize the GPU after a reboot?

have you tried older driver variants? i've had random issues getting certain drivers to load under certain kernel configurations.and i'm not sure how much updating 565 branch is getting anymore.

try both 550.144 and the latest 570.124
I've tried both open and proprietary. really picking them at random now, kind of like whack a mole at this point seeing if one works. I have to use an additional GPU because my Threadripper doesn't have integrated graphics, Looking at the NVIDIA driver descriptions, it should support 3090 and H100. The drivers install but does not initialize the H100.
 

MildHotSauce

New Member
Mar 7, 2023
24
8
3
oh actually i think i might have found your exact issue. i was skimming through the release notes of the drivers:




i would try the 550.xx branch drivers,as they don't seem to have this problem listed.
oooh i'll give that a go
 
  • Like
Reactions: gsrcrxsi

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
you can try the newer ones too *(NVL and then PCIE)
the new drivers have the same Hopper initialization issue.

only way to use these GPUs with the latest drivers looks to be if you can update the VBIOS with a newer version. might be difficult to get the new VBIOS for an H100 SXM5 model. I'm going to guess that's only something you can get through some kind of support contract. (it's not available on the TPU VBIOS database)
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,572
530
113
the new drivers have the same Hopper initialization issue.

only way to use these GPUs with the latest drivers looks to be if you can update the VBIOS with a newer version. might be difficult to get the new VBIOS for an H100 SXM5 model. I'm going to guess that's only something you can get through some kind of support contract. (it's not available on the TPU VBIOS database)
This could be a limit on your motherboard size.

You can only modify bar size at post. Can you try windows?
oh and try removing 3090 *(there might be conflicts between arch)

you may want to try to disable rebar.
 

MildHotSauce

New Member
Mar 7, 2023
24
8
3
This could be a limit on your motherboard size.

You can only modify bar size at post. Can you try windows?
oh and try removing 3090 *(there might be conflicts between arch)

you may want to try to disable rebar.
I can't not use the 3090 this threadripper doesn't have onboard graphics. I think the motherboard should be able to handle it, its a server class board WRX90E SAGE
 

aosudh

Member
Jan 25, 2023
66
16
8
System Setup:
  • GPU: NVIDIA H100 (GH100) on an SXM5-to-PCIe adapter
  • Motherboard: Pro WS WRX90E-SAGE SE
  • PCIe Slot: Using a PCIe 5.0 x16 slot
  • Power Configuration: Adapter card powered by two 8-pin CPU power connectors (Per Jungles recomendations)
  • Operating System: Ubuntu 24.04 LTS
  • Kernel Version: 6.11.0-17-generic
  • NVIDIA Driver Version: 565.57.01 (server open variant installed)
Steps Taken to Diagnose and Fix Issues:
PCIe and BAR Configuration:
  • Checked PCIe Link Speed and Width:
    • Initially forced PCIe Gen 3, later changed back to PCIe Gen 5 in BIOS.
    • lspci -vv -s e1:00.0 | grep -i "LnkSta" confirmed PCIe Gen 5 (32GT/s, x16).
  • Checked Resizable BAR Support:
    • Resizable BAR enabled in BIOS.
    • BAR2 was detected at 128GB, but BAR1 was stuck at 4GB.
  • Tried to Resize BAR1:
    • Attempted to manually modify BAR1 settings using setpci, but received errors.
    • Attempted to modify /sys/bus/pci/devices/0000:e1:00.0/resource1_resize, but got permission denied.
  • Checked for Above 4G Decoding Option in BIOS:
    • Could not find this option in BIOS.
  • Updated BIOS
GPU Detection & Driver Issues:
  • Confirmed GPU is Detected by PCIe Bus:
    • lspci -nn | grep -i nvidia shows the H100 is detected as a 3D controller.
  • Checked NVIDIA-SMI:
    • nvidia-smi does not recognize the H100.
    • Later, nvidia-smi shows only the RTX 3090, but not the H100.
  • Checked dmesg Logs:
    • dmesg | grep -i nvidia showed errors in initializing the H100, specifically:
      • nvidia-drm: Failed to allocate NvKmsKapiDevice
      • nvidia-drm: Failed to register device
  • Reinstalled NVIDIA Driver:
    • Installed nvidia-driver-565-server-open (tried both open and proprietary).
    • No change; the H100 still fails to initialize.
  • Checked Kernel Modules:
    • lsmod | grep nvidia confirms the NVIDIA drivers are loaded.
    • dmesg logs still show the H100 is failing to initialize.
Power and PCIe Errors:
  • Checked Power Status:
    • Used nvidia-smi to check power draw; however, the H100 does not show up.
    • Tried checking cat /sys/bus/pci/devices/0000:e1:00.0/power_state, which reports D0 (fully on). (could still be partial?)
  • Checked PCIe Error Logs:
    • dmesg | grep -i pcie showed PCIe error containment messages.
    • dmesg also contained DL_ActiveErr and Hardware PCIe Errors.
BIOS Settings Checked & Modified:
  • Enabled Resizable BAR
  • Set PCIe link speed manually (tried Gen 3 and Gen 5)
  • Checked various PCIe and NBIO settings (Hot Plug Handling, ARI Support, etc.)
  • Enabled SR-IOV
  • Looked for Above 4G Decoding but couldn't find it
  • Checked BIOS PCI Subsystem Settings
Current Issues:
  1. H100 is detected by PCIe but fails to initialize in nvidia-smi
  2. Driver fails to allocate resources for the H100
  3. BAR1 is stuck at 4GB despite BIOS supporting resizable BAR (BAR 2 now goes to 128?)
  4. dmesg shows PCIe error containment logs (possible PCIe instability)
  5. Power status of the H100 is unclear; can't confirm if it's drawing sufficient power
Questions for the Forum:
  1. How can I force BAR1 to resize beyond 4GB?
  2. Is there a way to verify if the H100 is getting enough power?
  3. What other BIOS settings should I check for compatibility (e.g., Above 4G Decoding, MMIO, PCIe settings)?
  4. Has anyone successfully run an H100 on an SXM5-to-PCIe adapter, and if so, what steps did you take?
  5. What could be causing the NVIDIA driver to fail at initializing the H100?
At what price did you purchase your H100? If you bought the card at a price far lower than the market price, it can be assumed that the card has problems to varying degrees. I have once handled more than 500 disassembled A100 SXM4 chips and their matching baseboards. The reason for disassembling most of them was that they had problems to varying degrees, especially some products with extremely low prices.
 

aosudh

Member
Jan 25, 2023
66
16
8
Generally speaking, it is rare for these GPU devices to encounter software or hardware incompatibility issues. When such issues are encountered and remain unresolved for a long time, it is advisable to suspect whether the problem lies with the GPU.
 

gsrcrxsi

Active Member
Dec 12, 2018
423
144
43
tried both drivers, you can see the H100 up top there from the previous lspci command

why such an old 550 version? 550.54.15 is about a year old at this point. not sure if it matters, but i would stick the latest version from that branch to avoid the possibility of edge case issues like kernel issues or anything like that.

and was this a open source version or proprietary version? i would stick to the proprietary for all attempts.