Interesting.
My workload also has a CPU component, Threadripper would make sense for me as well but I want IPMI and no sTRX4 boards currently support it. Also, Epycs are basically underclocked Threadrippers and that might not be such a bad thing (see below).
I will be ordering my first system this week. Haven't really considered getting a company to build it, I want to do this myself and learn more about it, and save money. I've built several home servers over the years.
Watercooling sounds flaky and unnecessary.
Re. GPU underclocking, the 2080 Ti seems to have a clear sweet spot at 160 Watts:
Watts | training speed (samples per second) |
150 (-6.67%) | 808 (-6.68%) |
155 (-3.22%) | 837 (-2.99%) |
160 | 862 |
170 (+6.25%) | 880 (+2.09%) |
180 (+5.88%) | 893 (+1.48%) |
200 (+11.1%) | 911 (+2.02%) |
220 (+10.0%) | 926 (+1.65%) |
240 (+9.09%) | 936 (+1.08%) |
This is all on Ubuntu 18.04 via "nvidia-smi -i 0 --power-limit=160", etc. Going up from 160 Watts gives only small gains in performance while going below 160 Watts hits a cliff of some kind. These numbers are for training but I've found the same thing for inference, on two different cards. Not sure why those cards are clocked so high by default, the cost of a 2080 Ti is dominated by power consumption.
I also have a Threadripper 2990WX on my training server but haven't tried underclocking that. GPU power consumption will be the primary cost of the inference servers.