Microsoft HGX-1 at the AI Hardware Summit

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Mam89

Member
Jan 14, 2016
58
11
8
34
SoCal
Is there any redundancy in this system, or is it a fill the rack and hope nothing fails system?

I'm curious how the pcie fabric between chassis fairs vs the internal nvlink of the gpus. Wouldn't it be slower moving data between the nodes or does it even matter?

Did Microsoft say what workflows, specifically, this was targeted at?

Overall I'm really confused on how this works.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
Is there any redundancy in this system, or is it a fill the rack and hope nothing fails system?

I'm curious how the pcie fabric between chassis fairs vs the internal nvlink of the gpus. Wouldn't it be slower moving data between the nodes or does it even matter?

Did Microsoft say what workflows, specifically, this was targeted at?

Overall I'm really confused on how this works.
Great questions. There are 6 power supplies IIRC for power redundancy. Fans are also paired up for redundancy. If the GPUs die, they die. Fair point.

NVLink is usually faster, but it turns out a lot of the deep learning guys are pushing data back to the CPUs because it is easier. So PCIe is a known quantity while NVLink is a bit more exotic. You are right that the DGX-2 / HGX-2 would be faster with the 3kW+ NVLink switching fabric.
 

Mam89

Member
Jan 14, 2016
58
11
8
34
SoCal
Thanks for the reply Patrick. I think I just needed a sanity check for why this even exists.

I can see a software driven SAN-like gpgpu system with a separate fabric, redundent heads, switches, etc being pretty cool.

In fact somethimg like that would be killer for all kinds of workloads, done right.

But this just looks like a bomb taking up rackspace. Unless I'm seriously missing something...