Looking for a solution with 6 full x16 pcie lanes

Frank173

Member
Feb 14, 2018
75
9
8
Hi,

I am in need for a solution that offers six full x16 pcie lanes. I need 4 slots for 4 GPUs (for deep learning purposes, for certain models and toolboxes I use, GPUs in an x8 slot perform significantly slower than in x16 slots), 1 slot for a 100gigabit Mellanox ConnextX-4 NIC, and 1 slot for my Highpoint 4x NVME pcie card. For full performance I can't do away with x8 bandwidth and need all six pcie slots to offer full x16 bandwidth.

I sense there is currently no motherboard and CPU combination out there that satisfies my need. Six x16 lanes add up to 96 lanes and unless I am mistaken currently dual Xeon solutions only offer 2x 48 lanes but I have not come across a board that makes all those lanes available to 6 pcie slots. So, the only CPU that would currently provide the bandwidth I require is the Epyc processor. While I find the base clock speed very low of all processors in the Epyc lineup I wonder how the Epyc CPUs perform in boost mode and how hot they run and for how long. But more importantly I have not found any motherboard solutions for the Epyc CPUs that satisfy my pcie slot needs.

Here my questions:

1) Is there a dual Xeon motherboard out there that provides 6 pcie slots that can concurrently run at x16 bandwidth?
2) Is there an Epyc board out there that provides 6 pcie slots that concurrently run at x16 bandwidth?
3) Is there any hope that AMD will come out with new Epyc processors at any time soon that run at higher clock speeds. With higher I mean > 3GHz sustainable clock speeds.
4) Are there OEM servers that satisfy my needs?
5) Do you know of a custom solution for an Epyc setup where I could use raisers and route the lanes to an external PCIE enclosure in case of space limitations?

Thanks for your suggestions,
Matt
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,289
441
83
I assume your workload can't be split across multiple nodes?

For four 16x GPU slots (assuming they take double-slot GPUs) you're already probably looking at eight PCI slots and thus likely a custom form factor; two additional 16x slots would take you up to ten and thus likely a custom case format as well.

Closest in terms of semi-off-the-shelf stuf are custom whitebox barebones like SM's GPU-optimised range, e.g:
Supermicro | Products | SuperServers | 1U | 1028GQ-TRT
...but I don't see any in that list that can still accommodate the four GPUs and still have another two 16x slots free other than the huge-ass >4 GPU jobs.

Edit: Tyan make a barebones 1P Epyc mobo/chassis that'll take 4x GPUs and apparently have three 16x slots left over (but I only see a riser space for one for use as an expansion card), although it's currently only listed as "coming soon";
https://www.tyan.com/Barebones_GA88B8021_B8021G88V2HR-2T-N
 

Frank173

Member
Feb 14, 2018
75
9
8
Thanks for replying, I am confused though about your "eight PCI slot" comment. I need 6 x16 PCIe slots, not 8 or 10. As long as the slots are spaced apart so that standard double size GPUs fit in there should not be a problem. Though I have not seen such. I saw the Tyan board (not the server barebone you linked to) but its PCIe configuration looks incredibly strange. I have zero issue with raisers or routing PCIe lanes to an external enclosure. But I do not see how Tyan offers 6x x16 lanes on those expansion/raiser slots. Heck, I do not even mind going for a full server based setup but even there I have not found what I am looking for, though the Tyan barebone you linked to comes close. But am I willing to shell out USD 1800 just for the server case with board alone?....


I assume your workload can't be split across multiple nodes?

For four 16x GPU slots (assuming they take double-slot GPUs) you're already probably looking at eight PCI slots and thus likely a custom form factor; two additional 16x slots would take you up to ten and thus likely a custom case format as well.

Closest in terms of semi-off-the-shelf stuf are custom whitebox barebones like SM's GPU-optimised range, e.g:
Supermicro | Products | SuperServers | 1U | 1028GQ-TRT
...but I don't see any in that list that can still accommodate the four GPUs and still have another two 16x slots free other than the huge-ass >4 GPU jobs.

Edit: Tyan make a barebones 1P Epyc mobo/chassis that'll take 4x GPUs and apparently have three 16x slots left over (but I only see a riser space for one for use as an expansion card), although it's currently only listed as "coming soon";
https://www.tyan.com/Barebones_GA88B8021_B8021G88V2HR-2T-N
 
Last edited:

kapone

Well-Known Member
May 23, 2015
831
413
63
I seriously doubt such a beast exists. Even if it did, at full throttle on all those pci-e slots, I suspect the CPU(s) will run out of steam long before.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,010
4,992
113
Just wondering, are you not using NCCL for this?

Most of the push for EPYC and GPUs right now are focusing on HPC applications where you have more CPU/ RAM to GPU instead of GPU to GPU communication that you tend to see in deep learning.

If you are using the Mellanox ConnectX-4 we often see those attached to each NUMA node which means four in a single socket EPYC. It is not uncommon to see two cards in a dual Xeon system to avoid QPI/ UPI hops.
 

Frank173

Member
Feb 14, 2018
75
9
8
Not sure how you relate NCCL to my problem. NCCL does not necessarily speed up training of models. It completely depends on what you are doing. I actually run multiple models in parallel on each of the GPUs. For that NCCL does not offer any benefits. Ah, ok, after reading on I guess I see why you brought up NCCL. The Connextx-4 is entirely unrelated to deep learning, I use it in peer-to-peer mode to move large data sets to a different machine for post processing and also to access large data sets for statistical modeling on a connected work station.

Long story short, I want to combine a deep learning training server with a data repository and file server. This is not for a corporate setup but in a 2-3 people personal lab.

Just wondering, are you not using NCCL for this?

Most of the push for EPYC and GPUs right now are focusing on HPC applications where you have more CPU/ RAM to GPU instead of GPU to GPU communication that you tend to see in deep learning.

If you are using the Mellanox ConnectX-4 we often see those attached to each NUMA node which means four in a single socket EPYC. It is not uncommon to see two cards in a dual Xeon system to avoid QPI/ UPI hops.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,010
4,992
113
Makes sense on the CX4 then.

And on NCCL - we work with quite a few larger AI / deep learning shops. The general rule is single root for GeForce GTX / Titans. Tesla V100 you want NVLINK.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,289
441
83
I am confused though about your "eight PCI slot" comment. I need 6 x16 PCIe slots, not 8 or 10.
Yes, but if you're using standard mobo and case layouts with double-wide GFX cards (you don't specify which cards you're using) then that means a board and a case that features ten PCI "spaces", which was basically a lead up to saying "no, you won't get this with standard COTS form factors".

As for that Tyan board, as far as I can make out its physical PCIe layout means you can have four double-slot GPUs and one PCIe 16x expansion card.

Frank173 said:
I want to combine a deep learning training server with a data repository and file server. This is not for a corporate setup but in a 2-3 people personal lab.
Those are quite different use cases and, unless you've got some quite serious IO requirements, it might make more economic sense to split the data off into a separate server; as you say, that way you might be able to avoid spending X grand on a custom barebones (esp. if you're trying to combine a GPU compute chassis with a file server chassis - those SM GPU chassis with 24 drive bays are likely wandering into the "if you have to ask you can't afford it" territory, reseller on this side of the pond has the bottom-level spec starting at £8300 ex. VAT without any GPUs).
 

Frank173

Member
Feb 14, 2018
75
9
8
Correct. Titan V do not expose the nvlink connector.

Makes sense on the CX4 then.

And on NCCL - we work with quite a few larger AI / deep learning shops. The general rule is single root for GeForce GTX / Titans. Tesla V100 you want NVLINK.
 

Frank173

Member
Feb 14, 2018
75
9
8
Fair suggestions and food for thoughts. Thanks for sharing your thoughts. I use 4 striped nvme modules for over 10 gb/sec sequential throughput and mirrored physical drives for redundancy. This is just for a small lab not a corporate or larger scale setup. I will take another look at the Tyan board. Thanks again.

Yes, but if you're using standard mobo and case layouts with double-wide GFX cards (you don't specify which cards you're using) then that means a board and a case that features ten PCI "spaces", which was basically a lead up to saying "no, you won't get this with standard COTS form factors".

As for that Tyan board, as far as I can make out its physical PCIe layout means you can have four double-slot GPUs and one PCIe 16x expansion card.



Those are quite different use cases and, unless you've got some quite serious IO requirements, it might make more economic sense to split the data off into a separate server; as you say, that way you might be able to avoid spending X grand on a custom barebones (esp. if you're trying to combine a GPU compute chassis with a file server chassis - those SM GPU chassis with 24 drive bays are likely wandering into the "if you have to ask you can't afford it" territory, reseller on this side of the pond has the bottom-level spec starting at £8300 ex. VAT without any GPUs).