Issue - Booting with 32x GPUs / 128GB of video RAM stuck at initializing

Discussion in 'Processors and Motherboards' started by Patrick, Apr 10, 2016.

  1. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,718
    Likes Received:
    4,650
    I just tried putting 32x GPUs into the ASRock Rack GPU compute system, and it is stuck at System Initializing...

    Any ideas on how to resolve? I am thinking that maybe 32 is just too many.
     
    #1
    Matt84, Rain and dba like this.
  2. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,938
    Likes Received:
    1,538
    :eek: 32 GPU !! Wow :D :D
     
    #2
  3. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,824
    Likes Received:
    1,153
    Show off...;)

    Sent from my SM-G925V using Tapatalk
     
    #3
    Chuntzu likes this.
  4. pc-tecky

    pc-tecky Member

    Joined:
    May 1, 2013
    Messages:
    200
    Likes Received:
    24
    Wow! You said 32 GPUs?!? All from one motherboard? Yikes! I'd love to play with such a system, but I don't envy the power bill :eek: at the end of the month. Joining the BitCoin mining bandwagon? How is that even physically possible? o_O The rack is the chassis? :p I'm not a gamer, but that should yield some serious killer frame rates.

    @Patrick, consider removing just one card (is that 1/2/4 GPUs per card?)?
     
    #4
    Last edited: Apr 10, 2016
  5. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,718
    Likes Received:
    4,650
    8 cards with 4 GPUs per card. It is in the data center so power is not really an issue.
     
    #5
    Chuntzu likes this.
  6. dba

    dba Moderator

    Joined:
    Feb 20, 2012
    Messages:
    1,478
    Likes Received:
    181
    SO MUCH FUN can start with the phrase "...32x GPUs...". I'm on the edge of my seat waiting to hear more!

    I've had a similar problem when adding tons of HBAs to a server. The answer for me was to disable the BIOS on all but one of the cards. If that's a workable option for the GPUs, give it a try.
     
    #6
    coolrunnings82 likes this.
  7. pc-tecky

    pc-tecky Member

    Joined:
    May 1, 2013
    Messages:
    200
    Likes Received:
    24
    Related any to this article: ASRock Rack 3U8G-C612 8-way-gpu server? Just got to thinking about air flow direction of all the fans. Does it matter? Seems to me that the chassis fans are counter acting and over powering the GPU's blower fans. Any reason why this wouldn't be the case? Blower fans, in the case of GPU cards [I've worked with], almost always pull from center and push out back slot. But if the back slots of the GPU cards serve as the front intake, then the blower fans are effectively being neutralized by the chassis fans. Remove the blower fans and save energy. I suppose that's why some of the nVidia Telsa compute module cards up for auction either [a)] had the blower fan and shell casing removed or [b)] never had an outer shell around it and solely relies on the air flow from the chassis fans for cooling.
     
    #7
  8. Keljian

    Keljian Active Member

    Joined:
    Sep 9, 2015
    Messages:
    429
    Likes Received:
    71
    First check all of the 8 pin connectors on the boards. Often a bad connection can cause boot failure
     
    #8
  9. vrod

    vrod Active Member

    Joined:
    Jan 18, 2015
    Messages:
    233
    Likes Received:
    33
    I remember Linus from LTT had somewhat the same issue with his virtualized monster pc. I think he had to enable something like "pci 64 bit decoding" or similar to get the system to post. But that was "just" 7 GPU's. :)
     
    #9
  10. RyC

    RyC Active Member

    Joined:
    Oct 17, 2013
    Messages:
    355
    Likes Received:
    83
    Do you have a link to where he said that, I'm running into similar issues trying to run several GPUs
     
    #10
  11. Mkvarner

    Mkvarner Member

    Joined:
    Jan 3, 2015
    Messages:
    58
    Likes Received:
    14
    It's one of these videos. Don't have time to watch so can't say which one.



     
    #11
  12. vrod

    vrod Active Member

    Joined:
    Jan 18, 2015
    Messages:
    233
    Likes Received:
    33
    I don't remember it being in any of those videos, but I haven't seen them anyway. :)



    This is the one as far as I remember.
     
    #12
  13. Patriot

    Patriot Moderator

    Joined:
    Apr 18, 2011
    Messages:
    1,293
    Likes Received:
    674
    pci 64bit is already enabled, as is >4gb decoding and the correct MMIOH settings.
    That was stopping 2 card from working, it was posting but the driver would calltrace when nvidia-smi was run.

    Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
    Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 256G
    Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 128G

    There should be enough pcie map space between those for the cards... but maybe another device is getting in the way and using some of the BAR space? @Patrick might try increasing that MMIOH Base number.

    Fun fact, I was having the same issue on my Tesla K40 in my SL270 gen8, I thought the card was just dead as it was sitting around for years.
    When I saw he had the same error, I helped track his down so that I could use the same fix on mine. I just had to enable pci 64bit support but it is in a lovely hidden menu that you press Ctrl+a to unveil...I was not amused.
     
    #13
  14. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,718
    Likes Received:
    4,650
    We ran into it. When he got the SM 8x GPU that is the same box @William tested.

    With 8x AMD GPUs you actually run into a case where in Windows you have more display outputs than the system can address, even if they are not in use. William may remember the exact number.

    These GRID GPUs do not have display outputs so they fix that problem (albeit with other oddities).
     
    #14
    William likes this.
  15. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,718
    Likes Received:
    4,650
    @Patriot - I will try that on the BAR. The issue is that these are in the data center in an ASRock Rack server that will not POST with the 32x GPUs installed.
     
    #15
    balamit likes this.
  16. William

    William Active Member

    Joined:
    May 7, 2015
    Messages:
    768
    Likes Received:
    240
    Sadly you might have to remove GPU's down to maybe 3 or so to get it to post to the BIOS, then make the change for enable 4gb decoding. Unless that was changed it should be set as I had to so I could get the other GPU's to work.
     
    #16
  17. superempie

    superempie New Member

    Joined:
    Sep 25, 2015
    Messages:
    28
    Likes Received:
    5
    Assuming you use the 3U8G-C612 with K1's, it officially supports 0 GRID cards. Only the Tesla M60 is supported with a maximum of 4: Certified Servers | NVIDIA GRID
    If your goal is to use vGPU, also checkout the BIOS section in the vGPU Deployment Guide on the Nvidia website.
    I think William has a point in removing cards first, checking the BIOS settings and then adding them one by one. Hope this helps.
     
    #17
  18. Patriot

    Patriot Moderator

    Joined:
    Apr 18, 2011
    Messages:
    1,293
    Likes Received:
    674
    Official support mishmash... Though the guide suggest that you don't need >4G decoding for the M40 as it doesn't have >4gb per gpu.
    Nvidia vdi guide

    Also shows „ MMCFG base: Set this parameter to 0xC0000000.
    Which appears to be for 2011 era hardware and 2011-3 switches to showing those values in decimal.

    Does anyone understand this stuff?
    Reading here... not messed with those setting terribly much.
    Linux/Kernel/PCIDynamicResourceAllocationManagement – TJworld
     
    #18
    Last edited: Apr 11, 2016
  19. superempie

    superempie New Member

    Joined:
    Sep 25, 2015
    Messages:
    28
    Likes Received:
    5
    There is an updated version of the document here http://images.nvidia.com/content/pd...Guide_Citrix_on_vSphere_TechPub_v01_final.pdf , but it isn't any different with regards to the BIOS settings.

    "MMCFG base: Set this parameter to 0xC0000000." is in 2011 era hardware. My SM servers have it. Had to specifically look for that feature to get my GRID K2 working. Can't help out on newer hardware.

    Edit: I remember with an earlier failed attempt on a consumer grade motherboard with 1 GRID K2, but without these BIOS options, the system would start and the card was seen, but only vGPU wouldn't work.
     
    #19
    Last edited: Apr 11, 2016
    Patriot likes this.
  20. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,718
    Likes Received:
    4,650
    Wow! upload_2016-4-11_12-51-40.png
     
    #20
    Rain, Chuntzu and William like this.
Similar Threads: Issue Booting
Forum Title Date
Processors and Motherboards Supermicro H12SSW-NT - FreeNAS 11.3-U1 NVMe issues Thursday at 1:31 AM
Processors and Motherboards EPYCD8-2T Serious issues Feb 11, 2020
Processors and Motherboards Supermicro H11DSI-NT Blinking Red LED and CPU Cooler Fan Issue Feb 6, 2020
Processors and Motherboards IPMI Viewer KVM Console Color issue Jan 17, 2020
Processors and Motherboards Dell Xanadu C6100 XS23-SB Board Config Issues Jan 7, 2020

Share This Page