Questions about AMD EPYC NUMA & Rome

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

AmusedGoose

New Member
Mar 16, 2019
24
12
3
I would like to build an AMD based GPU server for deep learning with 8x GTX/RTX cards, however it seems that the server I want doesn't exist.
If I understand correctly, for deep learning I would want all my GPU's and 50G nic on a single NUMA node, like with the ASUS ESC8000 G4. The servers I can find for AMD have the PCIe lanes spread over the NUMA nodes. (EPYCD8-2T mobo and G291-Z20 server eg.)

Will the Rome architecture solve this problem; ie. will all cores be able to talk to all PCIe devices and PCIe to PCIe at low latency/high bandwidth?
Is there no AMD server in existence that does 8 GPU's single root by using 2x PEX8796 for example?

other questions;
Does a NIC have to be connected to the same NUMA node (when in a cluster)?
How well do GTX cards scale over multiple machines for machine learning (eg. 4x 8 GPU servers)?
 
Last edited:

AmusedGoose

New Member
Mar 16, 2019
24
12
3
Thanks, switches would still be needed for serving 8 GPU's as there's always PCIe lanes needed for other purposes.

It looks like the G291-Z20 would become useful in cluster setups, however it's mainly suited for fanless gpu's. I hope to encounter a server like the ESC8000 built for AMD, it would be a monster with Rome if the latency is good.
 

Frank173

Member
Feb 14, 2018
75
9
8
I have the ESC8000 and it's by far the best gpu server I have ever had or seen or heard of till now. You can actually change the toplogy in the bios and via IPMI which I have not seen in any other server.

Amd based Boards so far are great for general server needs and storage but horrific setups for gpu compute servers or any high performance tasks. I will never ever touch amd again ever. My string of bad experiences is just too long to give them another try. Their math and vector units in their threadripper and even 2 different Epyc CPUs I owned hugely underperformed Intel's avx512 even at identical tact speeds. Also take a look at how most amd boards cannot even get more than 4 pcie x16 lanes out of a 128 pcie lane cpu. I find the entire amd story to be overhyped while Intel is definitely on the upswing again after several years of complacency.

Do yourself a favor and pay a little more for superior quality and design of Intels scalable series especially when most of your money is spent on gpus anyway. You may want to consider your gpu server to also be equipped with more powerful CPUs for cpu bound workloads and again amd is falling here way short in the server cpu market.

I highly recommend not to first think of lesser important issues such as numa nodes. First consider what problems those gpu compute servers ought to solve. In almost all cases (unless you deal with gpu server clusters) inter gpu i/o is completely irrelevant to the training performance and time it takes of your ai models. In most cases the numa nodes are irrelevant as well. You should first start at posing the problem and whether you are even able to parallelize gpu bound work across multiple gpu. I benefit the most by running multiple training sessions at the same time with each training session running on a single gpu but different parameterizations. In such context numa nodes and on which root each gpu is located at is completely irrelevant.



I would like to build an AMD based GPU server for deep learning with 8x GTX/RTX cards, however it seems that the server I want doesn't exist.
If I understand correctly, for deep learning I would want all my GPU's and 50G nic on a single NUMA node, like with the ASUS ESC8000 G4. The servers I can find for AMD have the PCIe lanes spread over the NUMA nodes. (EPYCD8-2T mobo and G291-Z20 server eg.)

Will the Rome architecture solve this problem; ie. will all cores be able to talk to all PCIe devices and PCIe to PCIe at low latency/high bandwidth?
Is there no AMD server in existence that does 8 GPU's single root by using 2x PEX8796 for example?

other questions;
Does a NIC have to be connected to the same NUMA node (when in a cluster)?
How well do GTX cards scale over multiple machines for machine learning (eg. 4x 8 GPU servers)?
 
Last edited:
  • Like
Reactions: Evan and Zhang

AmusedGoose

New Member
Mar 16, 2019
24
12
3
I've already got a EPYCD8-2T with a 7401P on the way, so I'll be able to test your statements and compare performance to Xeon systems.
That server is designed for 4 seperate VM's with 1 GPU each so there'll be no cross communication anyways (but I'll test it nevertheless).
AVX512 is not something I'm particularly looking for, the benchmarks that matter to me are nearly all in favor of AMD.

It would mean a good deal of savings to base future cluster systems on AMD so I hope I can prove you wrong when Rome comes out.
 

Frank173

Member
Feb 14, 2018
75
9
8
Confused...you asked about Rome but ordered non Rome... You asked and worried about Numa nodes but know already that your specific workload won't be bottlenecked by pcie lanes on different nodes...maybe I missed it but what did you exactly try to ask with your question?

Would you mind sharing exactly which benchmarks specifically are in favor of the epyc over any comparable scalable Intel CPU other than price? Because for hpc I found none. The Gold 6146 for example runs circles around most epycs. And it's not just avx512 that shines in Intel scalable but overall floating point performance in those epyc CPUs I tested leaves a lot to be desired.

But sure, if your focus is entirely on gpu bound workloads that are independent of each other's gpu and you wanna go as cheap as possible then an epyc based single CPU Board might work, but as you said, amd does not (yet) have the smarts to transfer that together with vendors into a gpu server solution with matching pcie extension board that houses the switches and lanes. Perhaps some will be coming with the new Rome CPUs. I myself decided not to hold my breath with AMD, however, I hope your build works out for yourself. Best wishes

I've already got a EPYCD8-2T with a 7401P on the way, so I'll be able to test your statements and compare performance to Xeon systems.
That server is designed for 4 seperate VM's with 1 GPU each so there'll be no cross communication anyways (but I'll test it nevertheless).
AVX512 is not something I'm particularly looking for, the benchmarks that matter to me are nearly all in favor of AMD.

It would mean a good deal of savings to base future cluster systems on AMD so I hope I can prove you wrong when Rome comes out.
 
Last edited:

AmusedGoose

New Member
Mar 16, 2019
24
12
3
Confused...you asked about Rome but ordered non Rome... You asked and worried about Numa nodes but know already that your specific workload won't be bottlenecked by pcie lanes on different nodes...maybe I missed it but what did you exactly try to ask with your question?
For the moment I'm building a machine that won't serve for multi-GPU training purposes, so not related to my questions, but it will be able to answer just how bad it is not staying within a single NUMA node and how single GPU setups compare to Intel.
I indeed did not order Rome, the reason being that it is not for sale yet.

Would you mind sharing exactly which benchmarks specifically are in favor of the epyc over any comparable scalable Intel CPU other than price? Because for hpc I found none. The Gold 6146 for example runs circles around most epycs. And it's not just avx512 that shines in Intel scalable but overall floating point performance in those epyc CPUs I tested leaves a lot to be desired.
We might be looking at different measurements then, as price/performance is what counts in my book. These CPU's won't do more than pre-chew data and feed GPU's but by far most of their time will be spent on keeping the GPU's as busy as they can be, so as little money as possible will be spent on them. I think you won't disagree with me that AMD EPYC offers a competitive performance/cost.

This thread here is not to have the millionth discussion about AMD vs Intel, it's about how Rome will perform for GPU clusters, so rather not descent further down that road.
 

Frank173

Member
Feb 14, 2018
75
9
8
How Rome will perform is way too early to tell imo. But again for pure gpu compute requirements, the switch topology is of higher importance, lesser the cpu one. Any gpus that run on switches,without inter-gpu i/o going through the cpu is a good thing. So, I am still confused why the question on Rome cpus when you need switches anyway for that many gpus... Anyway... I shared my thoughts, hardly ever is a gpu compute server only performing pure gpu tasks but quite some workload will be cpu bound as well. You talk about price/performance but then seem to dislike the idea to expand a gpu compute server into a hpc server that serves gpu workloads as well as cpu bound ones. Unless you don't have any cpu bound workloads at all, even on different machines, a price/performance comparison should imo always include the consideration to combine gpu and cpu bound workloads into a single machine. Unless of course you are running very specific low latency workload demands which does not seem to be the case.

Re your question on whether the Nic should be on the same node as the gpus, it again depends on what you want to do specifically. And who is gonna answer your question regarding Rome in this context? There are simply no performance metrics out yet other than marketing hype. To give you an example, for any ai training that does not distribute batches across clusters it is irrelevant whether your Nic is on the same node as the gpus. I can't stress enough how important it is to start with stating and understanding the problem first, not the other way around with hardware first.

For the moment I'm building a machine that won't serve for multi-GPU training purposes, so not related to my questions, but it will be able to answer just how bad it is not staying within a single NUMA node and how single GPU setups compare to Intel.
I indeed did not order Rome, the reason being that it is not for sale yet.



We might be looking at different measurements then, as price/performance is what counts in my book. These CPU's won't do more than pre-chew data and feed GPU's but by far most of their time will be spent on keeping the GPU's as busy as they can be, so as little money as possible will be spent on them. I think you won't disagree with me that AMD EPYC offers a competitive performance/cost.

This thread here is not to have the millionth discussion about AMD vs Intel, it's about how Rome will perform for GPU clusters, so rather not descent further down that road.
 
Last edited:

zir_blazer

Active Member
Dec 5, 2016
356
128
43
So, I am still confused why the question on Rome cpus when you need switches anyway for that many gpus...
Because the whole point is that Rome may provide an extremely simplified topology where the IO chiplet has a stupidly wide 128 Lane PCIe Controller. On a proper Motherboard it would kill any need for PCIe Switches, since Rome itself should be able to push 7 16x PCIe Slots from the same IO die. For reference, the PEX8796 has 96 Lanes minus 16 for the uplink, with 80 it can feed 5 16x slots.
Current EPYC has 4 Zen dies with 32 PCIe Lanes each, thus you can have at most 2 16x Video Cards in the same NUMA Node. Video Cards attached elsewhere means extra latency. By unifying all the IO in the same die, Rome will have both Memory and PCIe latency absolutely uniform.
 
Last edited:

Frank173

Member
Feb 14, 2018
75
9
8
that's an admirable dream but you are forgetting the need for lanes for nvme drives and other peripherals. You can't drive 8gpus with 128 lanes. On the other hand, you can couple multiple switches with minimal performance impact. For a gpu compute server you won't get around switches for the time being. I rather believe that Mellanox or the likes will come up with an ASIC that any number pcie lanes can hook up to and where the switch topology will be configurable via virtual switches on the software side. Now I am dreaming...

Because the whole point is that Rome may provide an extremely simplified topology where the IO chiplet has a stupidly wide 128 Lane PCIe Controller. On a proper Motherboard it would kill any need for PCIe Switches, since Rome itself should be able to push 7 16x PCIe Slots from the same IO die. For reference, the PEX8796 has 96 Lanes minus 16 for the uplink, with 80 it can feed 5 16x slots.
Current EPYC has 4 Zen dies with 32 PCIe Lanes each, thus you can have at most 2 16x Video Cards in the same NUMA Node. Video Cards attached elsewhere means extra latency. By unifying all the IO in the same die, Rome will have both Memory and PCIe latency absolutely uniform.
 

zir_blazer

Active Member
Dec 5, 2016
356
128
43
that's an admirable dream but you are forgetting the need for lanes for nvme drives and other peripherals. You can't drive 8gpus with 128 lanes.
Which is why I specifically said SEVEN slots. You still have 16 lanes free for basic I/O, you can get a few NICs, SATAs and a NVMe or two from there.

On the other hand, you can couple multiple switches with minimal performance impact. For a gpu compute server you won't get around switches for the time being.
Only if you're going above 7 GPUs. Rome is actually better than a single 5 slots PEX8796 or PEX9797, which is what the GPU heavy platforms typically uses. I think that how Rome deals with this is more important, but on paper it seems that there is no reason for it to not have perfect 7 GPUs scalability. Yeah, sure, if you want 8 or more, you can't avoid PCIe Switchs.

Also, I'm surprised that there are no use cases where you can saturate the 16x uplink between the PCIe Switch and the Processor integrated PCIe Controller. Rome will absolutely kick ass in such scenario, as the closest thing to it is a 48 lane Skylake-X with 3 GPUs.
 

Frank173

Member
Feb 14, 2018
75
9
8
Agree with all you said IF AMD delivers what you hope for. But gpu compute servers generally house 8 or more gpus, a few less but I don't consider those serious gpu compute servers. Rome may be interesting in many regards though admittedly I have lost hope in AMD for anything that concerns low latency needs and hpc needs. They so far disappointed time and again. Their entire gpu compute lineup is laughable, until today they have not managed to get a library out that can interface with all major AI toolsets, all their CPUs have serious hpc flaws and the cpu package interconnects even disqualified them for anything low latency wise. I hope thtey get their act together at some point, but I am not the one who is gonna hold my breath for that to happen. Am not an Intel fan boy but trust their technology a million more times.

Which is why I specifically said SEVEN slots. You still have 16 lanes free for basic I/O, you can get a few NICs, SATAs and a NVMe or two from there.


Only if you're going above 7 GPUs. Rome is actually better than a single 5 slots PEX8796 or PEX9797, which is what the GPU heavy platforms typically uses. I think that how Rome deals with this is more important, but on paper it seems that there is no reason for it to not have perfect 7 GPUs scalability. Yeah, sure, if you want 8 or more, you can't avoid PCIe Switchs.

Also, I'm surprised that there are no use cases where you can saturate the 16x uplink between the PCIe Switch and the Processor integrated PCIe Controller. Rome will absolutely kick ass in such scenario, as the closest thing to it is a 48 lane Skylake-X with 3 GPUs.