Kinda desperate for help here from someone who knows what they're doing. I'm not great at linux and this is my first foray into rocm. I've run into a ton of headaches trying to get my recently purchased MI100 working (purchased used). I can't even tell if it's alive or dead through lspci. I'm starting out with a fresh ubuntu install to bare metal each time but nothing seems to work. I followed the guide here for ubuntu 22.04 (AMD Documentation - Portal) and here (
) and everything appears to go fine until I reboot and then I run into flashing black screens and errors like "_common_interrrupt:63.55 no irq handler for vector" or the "Oh no! Something has gone wrong" error screen on boot with a logout button (logout just refreshes the error). Here's my system specs and info:
I also tried this on an older supermicro server with no luck either (that server runs MI25 GPUs just fine though). Both servers have been running reliably for months/years until I tried messing around with this, so I think I can rule out other hardware issues.
So far the only thing relevant I've come across is this: IOMMU Advisory for AMD Instinct™ I tried disabling IOMMU in my motherboard settings based on something I saw on another page, but it's not fixing anything. I believe I have SR-IOV enabled but I can try re-enabling IOMMU and updating my grub config I guess. The other strange thing is that the MI100 doesn't show up in lspci or anywhere else that I can see. I figured this was due to my issues with the rocm installation but maybe the card is just dead (I just bought it)?
Software
Ubuntu 22.04.2
rocm 5.4.3
Hardware
AMD MI100
Supermicro h12ssl-i motherboard
AMD Epyc Milan 7B13 CPU (64 core / 7713 rough equivalent).
128GB SSD, installing to it from a usb drive
2400MHz memory
I also tried this on an older supermicro server with no luck either (that server runs MI25 GPUs just fine though). Both servers have been running reliably for months/years until I tried messing around with this, so I think I can rule out other hardware issues.
So far the only thing relevant I've come across is this: IOMMU Advisory for AMD Instinct™ I tried disabling IOMMU in my motherboard settings based on something I saw on another page, but it's not fixing anything. I believe I have SR-IOV enabled but I can try re-enabling IOMMU and updating my grub config I guess. The other strange thing is that the MI100 doesn't show up in lspci or anywhere else that I can see. I figured this was due to my issues with the rocm installation but maybe the card is just dead (I just bought it)?
Software
Ubuntu 22.04.2
rocm 5.4.3
Hardware
AMD MI100
Supermicro h12ssl-i motherboard
AMD Epyc Milan 7B13 CPU (64 core / 7713 rough equivalent).
128GB SSD, installing to it from a usb drive
2400MHz memory