Choosing a server/chassis for GPU workload

fragar · Apr 24, 2020

Aha, got it. So basically the Supermicro motherboards consolidate those eight traditionally scattered connectors into a single 8x2 connector, and if you use a Supermicro motherboard with a non-Supermicro chassis, those eight individual connectors from the chassis will all just happen to connect next to each other on the motherboard. Cool.

fragar · May 4, 2020

I posted some questions about colocating these servers here:

https://forums.servethehome.com/index.php?threads/questions-about-colocating.28557/

It seems that 10kW per rack is kind of a sweet spot, only a few data centers offer more than that and those services aren't any cheaper per kW. Four GPUs per 4u server hits that sweet spot.

tinco · May 6, 2020

Thanks for the update, just found this thread because of it. I'm designing a 4GPU system and arrived at the same supermicro case, so good to know it really probably is the most attractive case out there right now. Our software requires a high clock speed CPU in addition to the GPU's, so instead of EPYC we'll be going for ThreadRipper, the GPU's are probably going to be 2080 supers.

Are you going to be ordering soon? I think we'll pitch the investors this week and hopefully order the systems somewhere next month. Have you also looked into having a company build the machines for you? I found a couple companies that would build these kinds of systems (with consumer gpu's and even threadrippers). Some even do custom water cooling for rack servers.

Super interesting that you're underclocking your gpu's, that might be interesting for us as well.

fragar · May 6, 2020

Interesting.

My workload also has a CPU component, Threadripper would make sense for me as well but I want IPMI and no sTRX4 boards currently support it. Also, Epycs are basically underclocked Threadrippers and that might not be such a bad thing (see below).

I will be ordering my first system this week. Haven't really considered getting a company to build it, I want to do this myself and learn more about it, and save money. I've built several home servers over the years.

Watercooling sounds flaky and unnecessary.

Re. GPU underclocking, the 2080 Ti seems to have a clear sweet spot at 160 Watts:

Watts	training speed (samples per second)
150 (-6.67%)	808 (-6.68%)
155 (-3.22%)	837 (-2.99%)
160	862
170 (+6.25%)	880 (+2.09%)
180 (+5.88%)	893 (+1.48%)
200 (+11.1%)	911 (+2.02%)
220 (+10.0%)	926 (+1.65%)
240 (+9.09%)	936 (+1.08%)

This is all on Ubuntu 18.04 via "nvidia-smi -i 0 --power-limit=160", etc. Going up from 160 Watts gives only small gains in performance while going below 160 Watts hits a cliff of some kind. These numbers are for training but I've found the same thing for inference, on two different cards. Not sure why those cards are clocked so high by default, the cost of a 2080 Ti is dominated by power consumption.

I also have a Threadripper 2990WX on my training server but haven't tried underclocking that. GPU power consumption will be the primary cost of the inference servers.

tinco · May 6, 2020

Cool! GPU's are actually not much of a bottleneck for us, we run multiple instances of a photogrammetry software, and it has very spiky use of the GPU. Unfortunately the spikes usually coincide (because our datasets are usually chunked and then submitted at once) so we need multiple GPU's to handle that. Having them underclocked might save us quite some money.

The water cooling isn't interesting for the 4 GPU systems, but this builder could fit 2 GPU's and a 3970X in a 2U slot with it, that's pretty cool if rackspace is at a premium

fragar · May 6, 2020

Stuffing a 3970x into 2u seems pretty crazy, that's a 280 Watt CPU. How much is that builder charging?

For your use case GPU underclocking might not be so clear, you'll need to run the numbers. Underclocking makes more sense when your workload is steady.

For example, if you run a 2080 Ti at 260 Watts 24/7 for five years in a data center, you'll pay €1100 for the GPU and then something like €2500 Euro for the power. If you're only running it 20% of the time though, then you'll still pay €1100 for the GPU and but only €500 for power, so it makes more sense to run it at a higher speed.

tinco · May 7, 2020

Ah that makes total sense. I'll have to run the numbers to see if it makes sense at all.

I slightly misremembered, the 2U case for the 3970x with water cooling has 3 pci-e slots vertically stacked so only 1 gpu will fit, the 3u version does have support for 2 gpu's, but it is not water cooled (or at least, not necessarily). You can check them out here:

2U PC 3970X Performance | 4 PCIe Slots

2U PC RT37X Rack PC AMD 3000 Series, 256GB MAX RAM, 650W Power Supply, 4 PCIe Slots, 2 Front USB 3.0 Ports, 7 Days Delivery

g2digital.co.uk

Since we might do business with them I don't want to share the exact quote here, but I was pleasantly surprised. There's only a couple hundred quid difference between the air cooled and the watercooled version.

fragar · May 7, 2020

That's pretty cool, it's a 280 Watt CPU in 2u at 50 db. It's for CPU-centric workloads though.

tinco · Jun 5, 2020

We ordered a first test system, didn't end up with a full service seller but just a parts seller that has an assembly service (saving money where we can). They're doing basic tests now to see if it boots and they sent a picture:

Looks like it all goes together easily, might even fit a fifth card in there it seems if you've got an insanely large motherboard.

Got no complaints about compatibility, even though the motherboard is an e-atx non-supermicro one (Asrock TRX40 Creator iirc).

liam.gigabyte · Aug 5, 2020

fragar said:
I'm going to go with the Supermicro AS-747TQ and Gigabyte MZ01-CE1. This is the only good server-grade solution which is currently available and fits into the racks at my intended data center, and it happens to be the cheapest of the prospective solutions and has the best ratio of capital costs to operation costs.

On the motherboard side, the Gigabyte MZ01-CE1 and the ASUS ASRock Rack EPYCD8-2T both seem perfectly adequate, that's close to a coin flip.

It seems that the Supermicro 4124GS-TNR would fit in the 71cm racks in my data center, they have 25 cm extra in front and 5 cm in back. It's not available though, and even if it was I'd prefer the self-assembled 747TQ.

Thanks for the help, especially to BlueFox.

So what did you decide?

Search

Choosing a server/chassis for GPU workload

fragar

Member

fragar

Member

tinco

New Member

fragar

Member

tinco

New Member

fragar

Member

tinco

New Member

2U PC 3970X Performance | 4 PCIe Slots

fragar

Member

tinco

New Member

liam.gigabyte

New Member