Microsoft HGX-1 at the AI Hardware Summit

Discussion in 'STH Main Site Posts' started by Patrick Kennedy, Sep 19, 2018.

  1. #1
  2. Mam89

    Mam89 Member

    Joined:
    Jan 14, 2016
    Messages:
    52
    Likes Received:
    7
    Is there any redundancy in this system, or is it a fill the rack and hope nothing fails system?

    I'm curious how the pcie fabric between chassis fairs vs the internal nvlink of the gpus. Wouldn't it be slower moving data between the nodes or does it even matter?

    Did Microsoft say what workflows, specifically, this was targeted at?

    Overall I'm really confused on how this works.
     
    #2
  3. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,496
    Likes Received:
    4,440
    Great questions. There are 6 power supplies IIRC for power redundancy. Fans are also paired up for redundancy. If the GPUs die, they die. Fair point.

    NVLink is usually faster, but it turns out a lot of the deep learning guys are pushing data back to the CPUs because it is easier. So PCIe is a known quantity while NVLink is a bit more exotic. You are right that the DGX-2 / HGX-2 would be faster with the 3kW+ NVLink switching fabric.
     
    #3
  4. Mam89

    Mam89 Member

    Joined:
    Jan 14, 2016
    Messages:
    52
    Likes Received:
    7
    Thanks for the reply Patrick. I think I just needed a sanity check for why this even exists.

    I can see a software driven SAN-like gpgpu system with a separate fabric, redundent heads, switches, etc being pretty cool.

    In fact somethimg like that would be killer for all kinds of workloads, done right.

    But this just looks like a bomb taking up rackspace. Unless I'm seriously missing something...
     
    #4
Similar Threads: Microsoft HGX-1
Forum Title Date
STH Main Site Posts Intel Agilex Next-Gen FPGAs Shipping to Microsoft and Others Aug 31, 2019
STH Main Site Posts Microsoft Project Corsica ASIC Delivers 100Gbps Zipline Performance May 12, 2019
STH Main Site Posts Microsoft OCP Keynote on Denali and Project Zipline Mar 14, 2019
STH Main Site Posts Microsoft Azure Lsv2 Instances Based on AMD EPYC Available Feb 11, 2019
STH Main Site Posts Microsoft Debuts Project Brainwave Access to Intel FPGAs for AI May 11, 2018

Share This Page