I built a system based on EPYC 7441P with 128 GB 2666 RAM. The motherboard is an Asrock EPYCD8-2T. Equiped with 4x 1080ti.
Those PCIe extenders are 3M 50cm extenders, which have given zero issues so far.
Benchmarks so far show 85-90% scaling efficiency for multi GPU training (resnet50, resnet152, inceptionV4), even though traffic has to pass the CPU. Testing was done in a virtual machine which was allocated 40 (v)CPU's (10 per NUMA node) & 120GB of RAM. (Used TensorFlow 1.13 with parameter server). Not perfect but not bad either.
As far as I can tell, passing over NUMA nodes has little influence which was my main concern with AMD EPYC's CPU's. This machine will usually serve 4 virtual machines with 1 GPU per machine however, so it is anyway moot in this case.
Can anyone confirm if GPUdirect RDMA works for GTX cards? I'm convinced not but find it strange/amazing that some teams manage to get excellent efficiency on GTX clusters.
Those PCIe extenders are 3M 50cm extenders, which have given zero issues so far.
Benchmarks so far show 85-90% scaling efficiency for multi GPU training (resnet50, resnet152, inceptionV4), even though traffic has to pass the CPU. Testing was done in a virtual machine which was allocated 40 (v)CPU's (10 per NUMA node) & 120GB of RAM. (Used TensorFlow 1.13 with parameter server). Not perfect but not bad either.
As far as I can tell, passing over NUMA nodes has little influence which was my main concern with AMD EPYC's CPU's. This machine will usually serve 4 virtual machines with 1 GPU per machine however, so it is anyway moot in this case.
Can anyone confirm if GPUdirect RDMA works for GTX cards? I'm convinced not but find it strange/amazing that some teams manage to get excellent efficiency on GTX clusters.
Last edited: