Anyone have experience installing ROCm?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Andrew_Carr

Member
Mar 18, 2022
37
11
8
Kinda desperate for help here from someone who knows what they're doing. I'm not great at linux and this is my first foray into rocm. I've run into a ton of headaches trying to get my recently purchased MI100 working (purchased used). I can't even tell if it's alive or dead through lspci. I'm starting out with a fresh ubuntu install to bare metal each time but nothing seems to work. I followed the guide here for ubuntu 22.04 (AMD Documentation - Portal) and here (
) and everything appears to go fine until I reboot and then I run into flashing black screens and errors like "_common_interrrupt:63.55 no irq handler for vector" or the "Oh no! Something has gone wrong" error screen on boot with a logout button (logout just refreshes the error). Here's my system specs and info:


I also tried this on an older supermicro server with no luck either (that server runs MI25 GPUs just fine though). Both servers have been running reliably for months/years until I tried messing around with this, so I think I can rule out other hardware issues.


So far the only thing relevant I've come across is this: IOMMU Advisory for AMD Instinct™ I tried disabling IOMMU in my motherboard settings based on something I saw on another page, but it's not fixing anything. I believe I have SR-IOV enabled but I can try re-enabling IOMMU and updating my grub config I guess. The other strange thing is that the MI100 doesn't show up in lspci or anywhere else that I can see. I figured this was due to my issues with the rocm installation but maybe the card is just dead (I just bought it)?




Software


Ubuntu 22.04.2


rocm 5.4.3





Hardware


AMD MI100


Supermicro h12ssl-i motherboard


AMD Epyc Milan 7B13 CPU (64 core / 7713 rough equivalent).


128GB SSD, installing to it from a usb drive


2400MHz memory
 

CyklonDX

Well-Known Member
Nov 8, 2022
834
272
63
lspci should show you the mi100 regardless; if it doesn't show you the mi100/amd card then it means your system doesn't see the card. There's either something wrong with the card (test it somewhere else), or your bios settings.
 

Andrew_Carr

Member
Mar 18, 2022
37
11
8
lspci should show you the mi100 regardless; if it doesn't show you the mi100/amd card then it means your system doesn't see the card. There's either something wrong with the card (test it somewhere else), or your bios settings.
Thanks, I was afraid of that. I tried out

lspci | grep -i display


before going to bed and saw nothing. According to their documentation I should be seeing "Arcturus" or something at least. Guess the card is DOA. Was packaged pretty poorly so who knows. Will mess around a bit more but not sure what else to do. I could switch PCIE slots I guess. BIOS settings are basically defaults but I can reset those too I guess. Found some more info on BIOS settings reading through their documentation. Also, I was able to get rocm installed without corrupting my install finally. I installed rocm without dkms and installed the kernel mode drivers from the MI100 page before installing rocm.
 
Last edited:

CyklonDX

Well-Known Member
Nov 8, 2022
834
272
63
just do lspci | grep AMD


should see something as such.
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 06)
 
Last edited:

Andrew_Carr

Member
Mar 18, 2022
37
11
8
just do lspci | grep AMD


should see something as such.
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 06)
Yeah, I did that but then I had like 40 things returned due to the CPU and other unrelated things. Was trying to filter out the junk.
 

Andrew_Carr

Member
Mar 18, 2022
37
11
8
then just do grep Instinct instead.
Yeah, pretty sure there's something wrong with the card. I guess I can try it in a third computer, but I just changed my BIOS settings to all the recommended settings and switched PCI slots to the first slot on the motherboard and still nothing. Full lspci:


  1. 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  2. 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  3. 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  4. 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  5. 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  6. 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  7. 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  8. 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  9. 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  10. 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  11. 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
  12. 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
  13. 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 0
  14. 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 1
  15. 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 2
  16. 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 3
  17. 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 4
  18. 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 5
  19. 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 6
  20. 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Milan Data Fabric; Function 7
  21. 01:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
  22. 01:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  23. 02:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
  24. 02:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  25. 02:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Starship USB 3.0 Host Controller
  26. 40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  27. 40:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  28. 40:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  29. 40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  30. 40:03.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
  31. 40:03.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
  32. 40:03.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
  33. 40:03.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
  34. 40:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  35. 40:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  36. 40:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  37. 40:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  38. 40:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  39. 40:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  40. 40:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  41. 40:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  42. 41:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
  43. 42:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
  44. 43:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
  45. 44:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
  46. 45:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
  47. 45:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
  48. 46:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
  49. 46:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  50. 47:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
  51. 47:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
  52. 47:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  53. 47:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Starship USB 3.0 Host Controller
  54. 48:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
  55. 49:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
  56. 80:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  57. 80:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  58. 80:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  59. 80:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  60. 80:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  61. 80:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  62. 80:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  63. 80:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  64. 80:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  65. 80:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  66. 81:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
  67. 81:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  68. 82:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
  69. 82:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  70. c0:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  71. c0:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  72. c0:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  73. c0:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  74. c0:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  75. c0:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  76. c0:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  77. c0:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  78. c0:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
  79. c0:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
  80. c1:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
  81. c1:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
  82. c2:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
  83. c2:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA
 

CyklonDX

Well-Known Member
Nov 8, 2022
834
272
63
doesn't look like its here. AMD cards like taking 0a: address space
This is what you will typically see.

08:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a0 (rev 06)
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a1
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 06)
 
  • Like
Reactions: Andrew_Carr