Issue - Booting with 32x GPUs / 128GB of video RAM stuck at initializing

Patrick · Apr 10, 2016

I just tried putting 32x GPUs into the ASRock Rack GPU compute system, and it is stuck at System Initializing...

Any ideas on how to resolve? I am thinking that maybe 32 is just too many.

T_Minus · Apr 10, 2016

32 GPU !! Wow

PigLover · Apr 10, 2016

Show off...

Sent from my SM-G925V using Tapatalk

pc-tecky · Apr 10, 2016

Wow! You said 32 GPUs?!? All from one motherboard? Yikes! I'd love to play with such a system, but I don't envy the power bill

at the end of the month. Joining the BitCoin mining bandwagon? How is that even physically possible?

The rack is the chassis?

I'm not a gamer, but that should yield some serious killer frame rates.

@Patrick, consider removing just one card (is that 1/2/4 GPUs per card?)?

Patrick · Apr 10, 2016

8 cards with 4 GPUs per card. It is in the data center so power is not really an issue.

dba · Apr 10, 2016

Patrick said:
I just tried putting 32x GPUs into the ASRock Rack GPU compute system, and it is stuck at System Initializing...

Any ideas on how to resolve? I am thinking that maybe 32 is just too many.

SO MUCH FUN can start with the phrase "...32x GPUs...". I'm on the edge of my seat waiting to hear more!

I've had a similar problem when adding tons of HBAs to a server. The answer for me was to disable the BIOS on all but one of the cards. If that's a workable option for the GPUs, give it a try.

pc-tecky · Apr 10, 2016

Related any to this article: ASRock Rack 3U8G-C612 8-way-gpu server? Just got to thinking about air flow direction of all the fans. Does it matter? Seems to me that the chassis fans are counter acting and over powering the GPU's blower fans. Any reason why this wouldn't be the case? Blower fans, in the case of GPU cards [I've worked with], almost always pull from center and push out back slot. But if the back slots of the GPU cards serve as the front intake, then the blower fans are effectively being neutralized by the chassis fans. Remove the blower fans and save energy. I suppose that's why some of the nVidia Telsa compute module cards up for auction either [a)] had the blower fan and shell casing removed or [b)] never had an outer shell around it and solely relies on the air flow from the chassis fans for cooling.

Keljian · Apr 10, 2016

First check all of the 8 pin connectors on the boards. Often a bad connection can cause boot failure

vrod · Apr 10, 2016

I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's.

RyC · Apr 10, 2016

vrod said:
I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's.

Do you have a link to where he said that, I'm running into similar issues trying to run several GPUs

Mkvarner · Apr 11, 2016

It's one of these videos. Don't have time to watch so can't say which one.

vrod · Apr 11, 2016

I don't remember it being in any of those videos, but I haven't seen them anyway.

This is the one as far as I remember.

Patriot · Apr 11, 2016

pci 64bit is already enabled, as is >4gb decoding and the correct MMIOH settings.
That was stopping 2 card from working, it was posting but the driver would calltrace when nvidia-smi was run.

Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 256G
Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 128G

There should be enough pcie map space between those for the cards... but maybe another device is getting in the way and using some of the BAR space? @Patrick might try increasing that MMIOH Base number.

Fun fact, I was having the same issue on my Tesla K40 in my SL270 gen8, I thought the card was just dead as it was sitting around for years.
When I saw he had the same error, I helped track his down so that I could use the same fix on mine. I just had to enable pci 64bit support but it is in a lovely hidden menu that you press Ctrl+a to unveil...I was not amused.

Patrick · Apr 11, 2016

vrod said:
I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's.

We ran into it. When he got the SM 8x GPU that is the same box @William tested.

With 8x AMD GPUs you actually run into a case where in Windows you have more display outputs than the system can address, even if they are not in use. William may remember the exact number.

These GRID GPUs do not have display outputs so they fix that problem (albeit with other oddities).

Patrick · Apr 11, 2016

@Patriot - I will try that on the BAR. The issue is that these are in the data center in an ASRock Rack server that will not POST with the 32x GPUs installed.

William · Apr 11, 2016

Sadly you might have to remove GPU's down to maybe 3 or so to get it to post to the BIOS, then make the change for enable 4gb decoding. Unless that was changed it should be set as I had to so I could get the other GPU's to work.

superempie · Apr 11, 2016

Assuming you use the 3U8G-C612 with K1's, it officially supports 0 GRID cards. Only the Tesla M60 is supported with a maximum of 4: Certified Servers | NVIDIA GRID
If your goal is to use vGPU, also checkout the BIOS section in the vGPU Deployment Guide on the Nvidia website.
I think William has a point in removing cards first, checking the BIOS settings and then adding them one by one. Hope this helps.

Patriot · Apr 11, 2016

Official support mishmash... Though the guide suggest that you don't need >4G decoding for the M40 as it doesn't have >4gb per gpu.
Nvidia vdi guide

Also shows MMCFG base: Set this parameter to 0xC0000000.
Which appears to be for 2011 era hardware and 2011-3 switches to showing those values in decimal.

Does anyone understand this stuff?
Reading here... not messed with those setting terribly much.
Linux/Kernel/PCIDynamicResourceAllocationManagement – TJworld

superempie · Apr 11, 2016

There is an updated version of the document here http://images.nvidia.com/content/pd...Guide_Citrix_on_vSphere_TechPub_v01_final.pdf , but it isn't any different with regards to the BIOS settings.

"MMCFG base: Set this parameter to 0xC0000000." is in 2011 era hardware. My SM servers have it. Had to specifically look for that feature to get my GRID K2 working. Can't help out on newer hardware.

Edit: I remember with an earlier failed attempt on a consumer grade motherboard with 1 GRID K2, but without these BIOS options, the system would start and the card was seen, but only vGPU wouldn't work.

Patrick · Apr 11, 2016

Wow!

Issue - Booting with 32x GPUs / 128GB of video RAM stuck at initializing

Administrator

Build. Break. Fix. Repeat

Moderator

Active Member

Administrator

Moderator

Active Member

Active Member

Active Member

Active Member

Member

Active Member

Moderator

Administrator

Administrator

Well-Known Member

Member

Moderator

Member

Administrator