Why aren't ThreadRippers dominating the HPC market?

fp64 · Jun 24, 2020

tinco,
yr best bet is to get an account on amd's devguru forum. amd staff are not very responsive (unless there is a software bug involved) but it is probably better than collecting random guesses. u may have to wait a day or two for an answer - if u are to get a response at all. when i was programming opencl i got most of my questions answered by amd staffers in 1 to 3 days. but when i was programming fortran/c on an intel chip 4 years ago, i went on its forums for optimization help and i got up to 5 responses per day (weekdays only) and ended up with a 40% faster code within 5days. tr is not a good computational chip and most probably u would have been better off with a second hand v3/v4 server.
--

tinco · Jun 24, 2020

hi @fp64, I'd love to get my hands on the source code, but unfortunately we don't have that luxury. The software is off the shelf, we're just trying to figure out the best hardware to run it on. If we're lucky the software vendor will find out they can do better on server hardware, but as far as I'm aware their target platform is workstations. There's other photogrammetry software out there, but none that fit our business case as well as Metashape at the moment.

tinco · Jul 2, 2020

I totally forgot, but earlier this morning DDR4 4000 32GB modules came out, which means I can have near 30% faster memory in my ThreadRipper machines. I wonder if that's better than having double the amount of memory channels.

Is double the amount of memory channels really just double the amount of bandwidth? I did notice that if your memory is located on the wrong side of the board, you get an extra 100ns penalty for having to go through the other sockets bandwidth. I want to run 8 separate processes, each with 64gb allocated to it, and locked to 1/8th subset of the cores, is it avoidable to have them allocate on the wrong side?

I just got a quote from our supplier. We've got roughly $10k per threadripper machine, and $24k for a dual EPYC 7542, so that's a $4k difference, but $2000 less license costs for the EPYC which makes it a really close race. The power efficiency isn't going to make up that difference anytime soon I think.

EffrafaxOfWug · Jul 2, 2020

Before you spend that amount of money, you need to think long and hard whether 30% faster memory will actually equate to any substantial gains in your workload. Better still would be to test long and hard!

Data on this sort of stuff is still thin on the ground since most benches you see of threadrippers et al usually only test different memory against games but there are a few knocking around. Just don't get taken in by all of the tests that were done for Zen/Zen+ chips - Zen 2 is a completely different beast when it comes to memory performance, and much less in need of high-speed/low-latency memory in order to reach peak performance. For workloads that are memory-sensitive, latency usually matters more than bandwidth.

Going from two to four memory channels is a doubling of theoretical bandwidth - whether your applications actually make use of it all is another matter. If you want to make sure that the data in RAM is held "closest" to the core(s) processing that data, then that's in the realms of NUMA but IIRC that's not applicable to a single-socket threadripper system (since a 3000 series threadripper is a single NUMA domain due to the IO die). 2P Epyc 7002 systems are two NUMA domains. If your workload isn't NUMA-aware but is sensitive to memory bandwidth/latency, better to keep it on a 1P system where NUMA isn't an issue.

It's an orthogonal issue I guess, but why does licensing for a 2P Epyc system (2*7542 = 64 cores) cost less than for a 1P threadripper (max. 64 cores)?

tinco · Jul 2, 2020

They've got a good old per machine licensing, so not per core. This software is really workstation oriented, even though they have a fully functional network processing system.

I'm definitely thinking long and hard, the thing we've already got one of these threadripper systems up and running and we're already happy with the performance. I'm not sure how to test this without grabbing the wallet and buying a system. It's not like people have these 8gpu dual socket 1TB ram systems laying around doing nothing. So now going for the threadripper is the safe choice, and going big is the risky option.

Is there any way through monitoring or instrumenting we can determine if memory bandwidth is a bottleneck in this software? The fact that it needs at least 64GB is a bit of a tell of course..

TXAG26 · Jul 2, 2020

I'm not sure how many total systems you need, but if it's more than 1 or two, I'd recommend buying one of each and benchmarking them with your actual load.

EffrafaxOfWug · Jul 2, 2020

tinco said:
They've got a good old per machine licensing, so not per core. This software is really workstation oriented, even though they have a fully functional network processing system.

If it's per-machine then shouldn't the licensing be the same for either system, not $2000 cheaper for one of them?

tinco said:
I'm definitely thinking long and hard, the thing we've already got one of these threadripper systems up and running and we're already happy with the performance. I'm not sure how to test this without grabbing the wallet and buying a system. It's not like people have these 8gpu dual socket 1TB ram systems laying around doing nothing. So now going for the threadripper is the safe choice, and going big is the risky option.

I don't know whether it's possible to your current setup or not, but when I was testing my new X470D4U + 3700X combo (the 3000 series were brand new at the time and no-one had done any real memory comparisons at that point), I did a faux-memory-test comparing between my 2666 ECC modules (stock and overclocked to 3200) and my 3200 (stock and downclocked to 2666) to see if, like the 2000 series before it, memory speed was was important. For my ffmpeg-based workloads, it wasn't (1-1.5% improvement); the difference only became apparent in synthetics.

If you can try different memory modules and/or clocks on your existing setup and see if it makes any difference to performance, it should give you some idea whether moving to faster memory on this or another platform is beneficial or not.

tinco said:
Is there any way through monitoring or instrumenting we can determine if memory bandwidth is a bottleneck in this software? The fact that it needs at least 64GB is a bit of a tell of course..

I assume you mean it needs a minimum of 64GB per instance rather than 64GB/s per instance or anything like that? There's no method I know of that'll accurately gauge how bottlenecked your system is on memory (although you can try inferring from observable memory stats) but if you're already running multiple instances of the software on a single machine and still showing near-linear scaling, it would indicate that memory isn't a bottleneck.

tinco · Jul 2, 2020

Ah sorry so I didn't really make the situation clear. We've been running a cluster of simple workstations. Then we figured out we could add a GPU to each 3900X based system and double the performance by running 2 instances on one machine.

Now I got budget to roughly quadruple our throughput, and based on this new information I experimentally acquired a threadripper system with 4 gpu's, which performs as we hoped, roughly doing 4 times the throughput of our single GPU systems. So now I've got three quarters of the budget left, I've proven the scaling works, so EPYC isn't off the table anymore.

The goal is to run 16 instances, the instances each take up 64GB of RAM, do best at around 8 CPU cores, and take up one GPU (we're doing 2080 supers). I can either buy 3 more of the threadripper systems, that sadly don't have IPMI yet, but that do have a proven track record, or go further off the beaten track and add the EPYC to our already colorful (literally

) server rack with 8 gpu's (and either fill out the budget with another threadripper, or starry eyes on the investors to shell out for another dual socket EPYC, a single socket EPYC 4 GPU machine isn't cost effective).

offtopic: the reason for the 8 gpu's is actually a bit silly. The instances only use the GPU's for about 10% of the time, so they're mostly idle, but since the machines all work on chunks of the same project, all the instance receive a chunk at the same time, so all the instances use their GPU's at roughly the same time. I could add some delays in, but that would not be very efficient, and it still wouldn't 100% eliminate the issue.

alex_stief · Jul 3, 2020

tinco said:
Is double the amount of memory channels really just double the amount of bandwidth?

There can be some losses, but it's pretty close to double. So you won't get anywhere near the total bandwidth of 8-channel Epyc with a Threadripper CPU. No matter how high you overclock that quad-channel memory.

tinco said:
I did notice that if your memory is located on the wrong side of the board, you get an extra 100ns penalty for having to go through the other sockets bandwidth. I want to run 8 separate processes, each with 64gb allocated to it, and locked to 1/8th subset of the cores, is it avoidable to have them allocate on the wrong side?

By default, the "first touch" rule applies. Unless of course the programmers did something funny. Meaning the core that accesses (not allocates) data first will determine where the data physically resides in RAM.
So all you have to do is prevent the operating system from messing up memory locality by shifting threads to other cores. Which for some reason, most OS are still doing liberally.
This is usually done by pinning the threads to certain cores, e.g. using taskset. Or whatever alternative your OS offers. Page table migration is supposed to increase memory locality further, by moving data in memory closest to the cores that access it the most. But of course that can't work when the OS is shifting threads to different cores several times a second. And if your threads have been pinned right from the start, PTM can't improve things anyway.

EffrafaxOfWug · Jul 10, 2020

I'd take this supposed leak with a truckload of salt, but if it's legit it looks like there might be a range of Threadripper "Pro" devices with high clocks, eight-channel memory and support for RDIMMs (as well as UDIMM and LRDIMM) if the image is to be trusted. Essentially a up-clocked version of Epyc it'll likely need a new socket/chipset (or just be based on the Epyc boards).

These would certainly make things more interesting in the workstation arena although I worry they'll be OEM-only like the rest of AMD's "Pro" marque.

Search

Why aren't ThreadRippers dominating the HPC market?

fp64

Member

tinco

New Member

tinco

New Member

EffrafaxOfWug

Radioactive Member

tinco

New Member

TXAG26

Active Member

EffrafaxOfWug

Radioactive Member

tinco

New Member

alex_stief

Well-Known Member

EffrafaxOfWug

Radioactive Member