ESXi/vSphere 8 - VM hangs on startup with Nvidia T4 passthrough

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

rbdrbls

New Member
Jan 25, 2023
1
0
1
Hi guys,

I've been banging my head against a wall getting a Nvidia Tesla T4 passthrough-enabled VM to boot. I have two ESXi hosts in a Vsphere 8.0.0 setup (Enterprise Plus), each with 1 T4 card. These systems previously each ran a Quadro P620 via passthrough without issues. Moving to the T4 has been nothing but trouble. :(

With either ESXi host, it properly boots and recognizes the card, and I am able to enable passthrough on it in the vSphere UI, as well as add it to a VM configuration. However, once I try to start the VM (on either host), it will hang at 88% and eventually error out. vmware.log for the VM shows:

2023-01-25T19:27:18.723Z In(05) vmx - MX: init lock: rank(PCIPassLCK_0)=0x3e7 lid=26
2023-01-25T19:30:27.731Z In(05) vmx - AH Failed to find a suitable device for pciPassthru0
2023-01-25T19:30:27.731Z In(05) vmx - Module 'DevicePowerOn' power on failed.


Some more things:
  • The VM is set to boot via EFI and boots up fine without the GPU passthrough device added - stock Ubuntu 22.04 install.
  • I've tried both DirectPath IO and Dynamic DirectPath IO to pass the card though, no difference.
  • Embedded virtualization is not enabled in the vm.
  • All VM memory is reserved.
  • I have also tried enabling and disabling the IOMMU in the VM (under CPU).
  • Tried autodetecting the video card, and manually specifying it.
  • Have tried restarting the host after enabling passthrough...have rebooted the host numerous times.
I've also tried the below config parameters in the .vmx in varying combinations, with no success:

pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32" (as the card has 16gb of memory)
pciPassthru0.msiEnabled = "FALSE"
hypervisor.cpuid.v0 = "FALSE"
svga.guestBackedPrimaryAware = "FALSE" (seems to like to be set to TRUE by default)


The host systems are each a Supermicro SuperServer 5019D-FN8TP running an up-to-date BIOS (v1.8), and this model is listed as supporting the T4 according to Qualified Platform List for GPUs | Supermicro -- now, I do have the GPU plugged into a x16 riser, which converts it to the x8 PCIE slot on the motherboard, but the T4 spec sheet says it supports PCIE 3.0 x8 and x16 so I didn't think this would be an issue.

BIOS is as follows:
Screenshot 2023-01-25 at 3.55.04 PM.jpg

The GPU shows up in the Vsphere UI as follows:
Screenshot 2023-01-25 at 4.07.57 PM.png
Screenshot 2023-01-25 at 4.07.30 PM.png

The GPU shows up fine on the host via esxcli hardware pci list -c 0x300 -m 0xff:

0000:65:00.0
Address: 0000:65:00.0
Segment: 0x0000
Bus: 0x65
Slot: 0x00
Function: 0x0
Vendor Name: NVIDIA Corporation
Device Name: TU104GL [Tesla T4]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x1eb8
SubVendor ID: 0x10de
SubDevice ID: 0x12a2
Device Class: 0x0302
Device Class Name: 3D controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0b
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x3001
Module ID: 45
Module Name: pciPassthru
Chassis: 0
Physical Slot: 7
Slot Description: CPU SLOT7 PCI-E 3.0 X8
Device Layer Bus Address: s00000007.00
Passthru Capable: true
Parent Device: PCI 0:100:0:0
Dependent Device: PCI 0:101:0:0
Reset Method: Bridge reset
FPT Sharable: true
NUMA Node: 0
Hardware Label:
Virtual Function:


Here's the .vmx file for the VM I'm trying to boot:

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "20"
nvram = "oc.nvram"
svga.present = "TRUE"
vmci0.present = "TRUE"
hpet0.present = "TRUE"
floppy0.present = "FALSE"
numvcpus = "2"
memSize = "16384"
firmware = "efi"
powerType.powerOff = "default"
powerType.suspend = "default"
powerType.reset = "default"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
sched.cpu.latencySensitivity = "normal"
vm.createDate = "1674612518956071"
scsi0.virtualDev = "pvscsi"
scsi0.present = "TRUE"
sata0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "oc.vmdk"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
sata0:0.deviceType = "cdrom-image"
sata0:0.fileName = "/vmfs/volumes/9d696458-538d8b1c/iso/ubuntu-22.04-live-server-amd64.iso"
sata0:0.present = "TRUE"
ethernet0.allowGuestConnectionControl = "FALSE"
ethernet0.virtualDev = "vmxnet3"
ethernet0.dvs.switchId = "50 11 bd bf 4b da 72 f0-66 52 ed d6 5f 9a a5 b8"
ethernet0.dvs.portId = "34"
ethernet0.dvs.portgroupId = "dvportgroup-2041"
ethernet0.dvs.connectionId = "1114659673"
ethernet0.shares = "normal"
ethernet0.addressType = "vpx"
ethernet0.generatedAddress = "00:50:56:91:f3:77"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
displayName = "oc"
guestOS = "ubuntu-64"
chipset.motherboardLayout = "acpi"
toolScripts.afterPowerOn = "TRUE"
toolScripts.afterResume = "TRUE"
toolScripts.beforeSuspend = "TRUE"
toolScripts.beforePowerOff = "TRUE"
uuid.bios = "42 11 41 c2 e2 4f 33 f8-bb e2 cc ae ec de ef e4"
vc.uuid = "50 11 cd 21 85 bf 53 07-6b 03 95 46 2f 0d f0 99"
migrate.hostLog = "oc-22261365.hlog"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "16384"
sched.mem.minSize = "16384"
sched.mem.shares = "normal"
migrate.encryptionMode = "opportunistic"
ftcpt.ftEncryptionMode = "ftEncryptionOpportunistic"
scsi0:0.ctkEnabled = "TRUE"
ctkEnabled = "TRUE"
sched.mem.pin = "TRUE"
numa.autosize.cookie = "40012"
numa.autosize.vcpu.maxPerVirtualNode = "4"
cpuid.coresPerSocket.cookie = "4"
sched.swap.derivedName = "/vmfs/volumes/611ffeaf-b4d4b252-6f7b-ac1f6b7d80aa/oc/oc-1416d0e7.vswp"
pciBridge1.present = "TRUE"
pciBridge1.virtualDev = "pciRootBridge"
pciBridge1.functions = "1"
pciBridge1:0.pxm = "0"
pciBridge0.present = "TRUE"
pciBridge0.virtualDev = "pciRootBridge"
pciBridge0.functions = "1"
pciBridge0.pxm = "-1"
scsi0.pciSlotNumber = "32"
ethernet0.pciSlotNumber = "34"
sata0.pciSlotNumber = "35"
scsi0:0.redo = ""
scsi0.sasWWID = "50 05 05 62 e2 4f 33 f0"
vmotion.checkpointFBSize = "16777216"
vmotion.checkpointSVGAPrimarySize = "16777216"
vmotion.svga.mobMaxSize = "16777216"
vmotion.svga.graphicsMemoryKB = "16384"
vmci0.id = "-320933916"
monitor.phys_bits_used = "45"
cleanShutdown = "TRUE"
softPowerOff = "TRUE"
tools.syncTime = "FALSE"
guestInfo.detailed.data = "architecture='X86' bitness='64' distroName='Ubuntu 22.04 LTS' distroVersion='22.04' familyName='Linux' kernelVersion='5.15.0-58-generic' prettyName='Ubuntu 22.04
toolsInstallManager.updateCounter = "1"
extendedConfigFile = "oc.vmxf"
sata0:0.startConnected = "FALSE"
bios.bootDelay = "5000"
vmx.buildType = "debug"
svga.autodetect = "TRUE"
svga.guestBackedPrimaryAware = "TRUE"
uuid.location = "56 4d f0 8d e1 dc 65 db-8e 50 1a 54 63 4b f8 3e"
svga.vramSize = "16777216"
vvtd.enable = "TRUE"
viv.moid = "f0c3d812-d205-4ee9-a1c6-452994dc9e42:vm-48044:A4Ad6e0tdI/Qwq+qN/eDfKIP6+cMXGD5Y6L6z5MTXBk="
pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32"
pciPassthru0.id = "00000:101:00.0"
pciPassthru0.deviceId = "0x1eb8"
pciPassthru0.vendorId = "0x10de"
pciPassthru0.systemId = "5c7944bd-360d-25c6-d570-ac1f6b7d80aa"
pciPassthru0.present = "TRUE"


Items like svga.vramSize, vmotion.*, svga.present were added automatically by VMWare. If I change from DirectPath to Dynamic Directpath, the pciPassthru0 items become:

pciPassthru0.allowedDevices = "0x10de:0x1eb8"
pciPassthru0.present = "TRUE"


Thank you for any help on this matter! Would love to get these cards working over the Quadros.
 

iommu

New Member
Feb 9, 2023
1
0
1
FWIW, I can't get passthrough working with the same parameters on the following configuration:
  1. eSXI 8.0
  2. X570D4U-2L2T
  3. nVidia 1650 or 3090 (both tested fine in other systems)
  4. Ubuntu 22.04 w/ latest nVidia drivers installed
  5. BIOS:
    1. Onboard GPU as primary
    2. ReBAR disabled
    3. Above 4G enabled
    4. IOMMU enabled
    5. Auto on other relevant options, e.g. AER
When I boot up, it just stalls at the Ubuntu logo. I think it's loading the nVidia kernel modules at this time.

Note that PCI passthrough works for another PCIe device containing SSDs, so this is definitely a GPU / Ubuntu / nVidia thing.

My next step is to try Windows 11 in a VM to see if that gives a different experience.
 

Sogndal94

Senior IT Operations Engineer
Nov 7, 2016
114
72
28
Norway
Have you tried passing to another vm? (new one)?
Also what mode is the T4s in WDDM mode or TCC?
I think the issue could be that the card is in TCC mode, and you are trying to use it for display?