Issue - Booting with 32x GPUs / 128GB of video RAM stuck at initializing

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
I just tried putting 32x GPUs into the ASRock Rack GPU compute system, and it is stuck at System Initializing...

Any ideas on how to resolve? I am thinking that maybe 32 is just too many.
 

pc-tecky

Active Member
May 1, 2013
202
25
28
Wow! You said 32 GPUs?!? All from one motherboard? Yikes! I'd love to play with such a system, but I don't envy the power bill :eek: at the end of the month. Joining the BitCoin mining bandwagon? How is that even physically possible? o_O The rack is the chassis? :p I'm not a gamer, but that should yield some serious killer frame rates.

@Patrick, consider removing just one card (is that 1/2/4 GPUs per card?)?
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
8 cards with 4 GPUs per card. It is in the data center so power is not really an issue.
 
  • Like
Reactions: Chuntzu

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
I just tried putting 32x GPUs into the ASRock Rack GPU compute system, and it is stuck at System Initializing...

Any ideas on how to resolve? I am thinking that maybe 32 is just too many.
SO MUCH FUN can start with the phrase "...32x GPUs...". I'm on the edge of my seat waiting to hear more!

I've had a similar problem when adding tons of HBAs to a server. The answer for me was to disable the BIOS on all but one of the cards. If that's a workable option for the GPUs, give it a try.
 
  • Like
Reactions: coolrunnings82

pc-tecky

Active Member
May 1, 2013
202
25
28
Related any to this article: ASRock Rack 3U8G-C612 8-way-gpu server? Just got to thinking about air flow direction of all the fans. Does it matter? Seems to me that the chassis fans are counter acting and over powering the GPU's blower fans. Any reason why this wouldn't be the case? Blower fans, in the case of GPU cards [I've worked with], almost always pull from center and push out back slot. But if the back slots of the GPU cards serve as the front intake, then the blower fans are effectively being neutralized by the chassis fans. Remove the blower fans and save energy. I suppose that's why some of the nVidia Telsa compute module cards up for auction either [a)] had the blower fan and shell casing removed or [b)] never had an outer shell around it and solely relies on the air flow from the chassis fans for cooling.
 

vrod

Active Member
Jan 18, 2015
233
33
28
28
I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's. :)
 

RyC

Active Member
Oct 17, 2013
357
89
28
I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's. :)
Do you have a link to where he said that, I'm running into similar issues trying to run several GPUs
 

vrod

Active Member
Jan 18, 2015
233
33
28
28
I don't remember it being in any of those videos, but I haven't seen them anyway. :)


This is the one as far as I remember.
 

Patriot

Moderator
Apr 18, 2011
1,311
695
113
pci 64bit is already enabled, as is >4gb decoding and the correct MMIOH settings.
That was stopping 2 card from working, it was posting but the driver would calltrace when nvidia-smi was run.

Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 256G
Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 128G

There should be enough pcie map space between those for the cards... but maybe another device is getting in the way and using some of the BAR space? @Patrick might try increasing that MMIOH Base number.

Fun fact, I was having the same issue on my Tesla K40 in my SL270 gen8, I thought the card was just dead as it was sitting around for years.
When I saw he had the same error, I helped track his down so that I could use the same fix on mine. I just had to enable pci 64bit support but it is in a lovely hidden menu that you press Ctrl+a to unveil...I was not amused.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's. :)
We ran into it. When he got the SM 8x GPU that is the same box @William tested.

With 8x AMD GPUs you actually run into a case where in Windows you have more display outputs than the system can address, even if they are not in use. William may remember the exact number.

These GRID GPUs do not have display outputs so they fix that problem (albeit with other oddities).
 
  • Like
Reactions: William

Patrick

Administrator
Staff member
Dec 21, 2010
11,964
4,921
113
@Patriot - I will try that on the BAR. The issue is that these are in the data center in an ASRock Rack server that will not POST with the 32x GPUs installed.
 
  • Like
Reactions: balamit

William

Well-Known Member
May 7, 2015
785
250
63
63
Sadly you might have to remove GPU's down to maybe 3 or so to get it to post to the BIOS, then make the change for enable 4gb decoding. Unless that was changed it should be set as I had to so I could get the other GPU's to work.
 

superempie

Member
Sep 25, 2015
30
5
8
119
Assuming you use the 3U8G-C612 with K1's, it officially supports 0 GRID cards. Only the Tesla M60 is supported with a maximum of 4: Certified Servers | NVIDIA GRID
If your goal is to use vGPU, also checkout the BIOS section in the vGPU Deployment Guide on the Nvidia website.
I think William has a point in removing cards first, checking the BIOS settings and then adding them one by one. Hope this helps.
 

Patriot

Moderator
Apr 18, 2011
1,311
695
113
Official support mishmash... Though the guide suggest that you don't need >4G decoding for the M40 as it doesn't have >4gb per gpu.
Nvidia vdi guide

Also shows „ MMCFG base: Set this parameter to 0xC0000000.
Which appears to be for 2011 era hardware and 2011-3 switches to showing those values in decimal.

Does anyone understand this stuff?
Reading here... not messed with those setting terribly much.
Linux/Kernel/PCIDynamicResourceAllocationManagement – TJworld
 
Last edited:

superempie

Member
Sep 25, 2015
30
5
8
119
There is an updated version of the document here http://images.nvidia.com/content/pd...Guide_Citrix_on_vSphere_TechPub_v01_final.pdf , but it isn't any different with regards to the BIOS settings.

"MMCFG base: Set this parameter to 0xC0000000." is in 2011 era hardware. My SM servers have it. Had to specifically look for that feature to get my GRID K2 working. Can't help out on newer hardware.

Edit: I remember with an earlier failed attempt on a consumer grade motherboard with 1 GRID K2, but without these BIOS options, the system would start and the card was seen, but only vGPU wouldn't work.
 
Last edited:
  • Like
Reactions: Patriot