Issue - Booting with 32x GPUs / 128GB of video RAM stuck at initializing

RyC

Active Member
Oct 17, 2013
357
89
28
I ran into this issue trying to run just 2 GPUs on the Intel S2600CP, still haven't found a solution and Intel support has been no help Intel S2600CP dual GPU passthrough issue

From what I've researched, the issue with enabling Above 4G Decoding is that ESXi is unable to use the GPUs that are mapped to that >4G space, so vGPU (and passthrough) won't work, even though the GPUs still show up. This SuperMicro FAQ says to enable it only if you're using more than a couple K1/K2 cards Super Micro Computer, Inc. - FAQ Entry (and also it's for XenServer, which I believe has fixed the issue running with Above 4G Decoding and multiple GPUs).

Otherwise, the solution I've read that may work is to change the MMCFG BASE like someone stated before (an option that doesn't exist on Intel motherboards). This ESXi specific SuperMicro FAQ says to leave Above 4G Decoding disabled and just change MMCFG BASE: Super Micro Computer, Inc. - FAQ Entry

VMware seems to be making progress towards being able to use >4G space, but I tried these steps and it had no effect (the VM error logs + google point to this KB): https://kb.vmware.com/selfservice/m...nguage=en_US&cmd=displayKC&externalId=2139299
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
@Patriot you were right

Swapped to 512/256 GB

Here is the Supermicro SuperBlade with 2x of these cards:

sthroot@dualgridm40:~$ sudo nvidia-smi | tee nvidia-smi.txt
Mon Apr 11 13:17:30 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID M40 Off | 0000:05:00.0 Off | N/A |
| 38% 26C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID M40 Off | 0000:06:00.0 Off | N/A |
| 38% 27C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID M40 Off | 0000:07:00.0 Off | N/A |
| 38% 28C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID M40 Off | 0000:08:00.0 Off | N/A |
| 38% 28C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GRID M40 Off | 0000:83:00.0 Off | N/A |
| 38% 27C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GRID M40 Off | 0000:84:00.0 Off | N/A |
| 38% 27C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GRID M40 Off | 0000:85:00.0 Off | N/A |
| 38% 28C P0 16W / 53W | 9MiB / 4092MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GRID M40 Off | 0000:86:00.0 Off | N/A |
| 0% 28C P0 15W / 53W | 9MiB / 4092MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

ASRock Rack unit stuck on initializing with 4 cards, I then changed for the other 4 and that did not help. Trying 2 now and still stuck at System Initializing...
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
OK ASRock Rack update - pulled the battery on the unit. 1 card installed and it is installing ubuntu. Going to try 2x cards next
 

Rain

Active Member
May 13, 2013
244
88
28
Code:
ERROR: Insufficient PCI Resources Detected!!!
This may just be my favorite error message ever.
 
  • Like
Reactions: Diavuno

ideabox

Member
Dec 11, 2016
69
25
18
34
Dual E5-2683 V3's.

Are you still attempting this?

How many did you get??
Try disabling all the SATA & USB ports and other resources eating into the PCI resources.

I was reading a while back that 16 GPU's was gained by someone but they had to disable all the SATA and USB to have them recognise the 16.