Why aren't ThreadRippers dominating the HPC market?

tinco · Jun 21, 2020

For context: my company is running a small photogrammetry data processing cluster. Photogrammetry (like most data processing) doesn't scale super well horizontally, so high clock speed CPU's dominate. In addition to a fast CPU, photogrammetry also makes use of the GPU at certain steps of the process. We started out on consumer level hardware last year (3900X + 2080 Ti) and just recently discovered we can make more efficient use of the hardware by splitting the data in chunks and running multiple GPU's in each machine.

So we bought an experimental machine. It's a 3970X in a rack with 4 2080 supers. I bolted a Noctua TR4 to it, with 2 Delta 11kRPM 92mm fans, keeping it at 84 degrees whilst maintaining 100% load on 32 cores at 3700mhz.

32 cores at 3700mhz! That is absolutely insane. You have to go dual socket to come close with the 7F52. And then you're looking at a significantly more expensive system, that takes more power to boot (2x 240W cpu's? just looking at the spec sheet, haven't actually tested this).

I know basically all software performs best at high clock speeds. Why isn't running threadrippers more popular? There's only a couple boards out there, as far as I know Tyan is the only one that's decently specced and has IPMI. We went with ASRock just because it has 2x 10gbit, and we don't have PCI-e to spare for a network card, so we just don't have IPMI.

Am I missing something? Why isn't everyone running these CPU's?

ari2asem · Jun 21, 2020

my guess is because there is not that much choise for boards with server specifications ?? (ipmi, ecc-reg, 10gbps network). although you can have ecc-udimm (unbuffered) on TR-platform (i am running 1920x cpu with ecc-udimm on asrock phantom gaming board).

real server boards based on TR-platform (1st, 2nd or 3rd gen) are very rare and small market. and therefore more pricey

tinco · Jun 21, 2020

Yeah but wouldn't a big player just ask ASRock to make them a board? Basically slap an ast2500 on that Trx40 creator board and you're there. If the chips that good, why aren't they making those boards anyway? Is it just that the markets still too small? Companies are just waking up to Epyc Rome, maybe ryzen is just not on the horizon.

ramblinreck47 · Jun 21, 2020

tinco said:
Yeah but wouldn't a big player just ask ASRock to make them a board? Basically slap an ast2500 on that Trx40 creator board and you're there. If the chips that good, why aren't they making those boards anyway?

ASRock just announced this recently: ASRock Rack > TRX40D8-2N2T

tinco · Jun 22, 2020

Well I'll be damned. Why did they have to go and make that last slot smaller? That board looks absolutely perfect except for that last pcie slot..

So I guess we're not the only ones

Edit: ohh I just noticed that last slot is open ended so it will definitely fit the 4th card. That's excellent, thanks for bringing it too my attention @ramblinreck47 I think I'll ordering a couple!

fp64 · Jun 22, 2020

Not enough memory channels, the extra memory hop @ having to go thru the io chip and floating point hardware still not competitive with those of intel.
--

tinco · Jun 22, 2020

Hey thanks for pitching in @fp64 that's interesting you'd say that. I haven't been able to benchmark an EPYC yet for our software, though I recently did get my hands on a system so I should be able to do that soon. I do know that the Ryzen's outperform both the Xeon's and the i9's on our workload, though mostly because they've got more high clocked cores.

So would you say it's likely the EPYC might outperform the Threadripper even if it's lower clocked because of the extra memory delay? I suppose maybe that's something that machine learning systems deal with because of their random access patterns and huge ram requirements?

EffrafaxOfWug · Jun 22, 2020

I'm not entirely sure where the "Zen 2 has <Intel FPU performance" claim is coming from either. Other than AVX512 code I wasn't aware of any large performance deficit on the AMD side in FPU workloads; StH's benches don't seem to have included any large discrepancies at least.

alex_stief · Jun 22, 2020

Because TR is cut down from Epyc to avoid internal competition, putting it into the "prosumer" category.
No support for RDIMM means low overall memory capacity. 256GB max.
Many HPC applications run into memory bandwidth bottlenecks with this many cores on only 4 memory channels. Advantage Epyc.
Power efficiency is an important topic for HPC clusters. The lower clocked Epyc CPUs are better at that.
Node size: Epyc still allows you to have 2 CPUs on one board. While many HPC applications can run on distributed memory systems, having more performance within a shared memory system is still favorable.
Density: With Epyc, you can get 4 dual-CPU nodes within 2U. Even if floor space is not too high on the priority list, the difference in density is too big to overlook.
The list goes on...

hmw · Jun 22, 2020

alex_stief said:
No support for RDIMM means low overall memory capacity. 256GB max.
Many HPC applications run into memory bandwidth bottlenecks with this many cores on only 4 memory channels. Advantage Epyc.

^^^^this 10x

For any medium to large company wanting to do HPC, EPYC just makes more sense.

Price is usually not an issue in these cases however, even for medium companies, EPYC is available at discounts, making any price difference between TR and EPYC really a non-issue. Look at the HP deals for EPYC processors - they undercut the equivalent Threadripper SKU.

Memory configuration is a huge factor in decision making as is ESXi support. A lot of boxes sold are virtualized so that teams can use VMs with GPUs via SR-IOV. TR simply doesn't have the same virtualization support - think about all the literature available for ESXi and NPS=1,4 etc, there isn't the same knowledge base available for TR

And finally there's ancillary support. If you're adding three or four 100 GbE NICs to your box, and you need support because of some PCIe incompatibilities - you're more likely to get it from tier 1 vendors such as Supermicro, Tyan and Quanta. Supermicro or Tyan might not care about the small shop that uses a dozen motherboards but will care when the numbers are larger.

Since most of the HPC market is medium to larger sized companies - and they tend to gravitate towards folks other than Asus etc, what you see in the market is simply a reflection of this.

fp64 · Jun 22, 2020

the impediments to performance of the current tr generation are cumulative. u have to overcome all to make tr attractive for hpc. zen2 trails badly intel in floating point performance. see the multicore scores in the puget hpc blog:

HPC Parallel Performance for 3rd gen Threadripper, Xeon 3265W and EPYC 7742 (HPL HPCG Numpy NAMD)

On March 19, 2020 I did a webinar titled, "AMD Threadripper 3rd Gen HPC Parallel Performance and Scaling ++(Xeon 3265W and EPYC 7742)" The "++(Xeon 3265W and EPYC 7742)" part of that title was added after we had scheduled the webinar. It made the presentation a lot more interesting than the...

www.pugetsystems.com

hmw · Jun 22, 2020

fp64 said:
the impediments to performance of the current tr generation are cumulative. u have to overcome all to make tr attractive for hpc. zen2 trails badly intel in floating point performance. see the multicore scores in the puget hpc blog:

HPC Parallel Performance for 3rd gen Threadripper, Xeon 3265W and EPYC 7742 (HPL HPCG Numpy NAMD)

On March 19, 2020 I did a webinar titled, "AMD Threadripper 3rd Gen HPC Parallel Performance and Scaling ++(Xeon 3265W and EPYC 7742)" The "++(Xeon 3265W and EPYC 7742)" part of that title was added after we had scheduled the webinar. It made the presentation a lot more interesting than the...

www.pugetsystems.com

That is a confusing benchmark discussion. The EPYC machine is actually an Azure instance? Why not test apples to apples. Having said that - there are now 'faster' EPYC chips targeted at the HPC market: https://www.servethehome.com/amd-epyc-7f52-benchmarks-review-and-market-perspective/4/

And I'd be cautious about AVX-512. I asked developers why they didn't use it more often and the answer was that even when used judiciously, they still have the potential to use up so much thermal headroom that they can end up being slower or the same speed as AVX2 instructions

badskater · Jun 22, 2020

as @hmw said, AVX2 is the favorite one, due to thermal headroom. I'm working a lot on HPC deployments and most of the time people go to Xeon to prepare for AVX-512 in the future when it's more stable, Epyc if they want to future proof high core counts (I've seen a lot of 7f52 being used there) or ARM a lot and a lot more now.

The thing with HPC is all what @alex_stief said.

Some environments have multi generation nodes and upgrade their systems in phases, others just go with a huge setup at the start for 5-6 years. (you rarely see hpcs lasting more than that in single phase deployments)

The thing that's always taken into consideration is floor space, power per chassis and expected uses. 99% of HPCs use 100gbps+ for interconnections and now 25gbps, before 10gbps for normal networking. Some customers, due to their use case, we even recommend going Azure/AWS, because it's not worth it for them to get a permenant HPCs in their DCs.(Or they simply don't have the space)

TR uses too much floor space to even be considered. (and if you want GPUs, it's 1U for 4GPUs, 2U for 6 and 4U for 8 in a node)

TXAG26 · Jun 22, 2020

It is also a dual socket EPYC setup as well. It sounds like they may be working up some AMD Epyc workstations, so benchmarks on that hardware in a more apples-to-apples compare would be helpful.

tinco · Jun 23, 2020

I guess we're maybe just at a sort of weird spot. We're using software that's dealing with some weird concurrency issue, it doesn't scale well past ~8 cores. Possibly lock contention or whatever, but we don't have control over it. The way we deal with it is splitting the datasets, running the processing over more smaller chunks, but this costs a lot of both regular and video memory, so scaling doesn't come easy.

Our sweet spot lies at 64GB ram for every 8-12 cores, with 8GB of vmem and a reasonably fast GPU. So we went for the 3970X with 256GB ram and 4 2080 supers. I'm trying to figure out if the high frequency EPYC's or a Xeon Gold/Plat might make sense, but then the ThreadRipper just beats them at cost efficiency. I don't think we're limited on memory throughput, not sure how I'd measure that besides maybe squeezing the pipe and seeing if it hurts..

alex_stief · Jun 23, 2020

Well, the definition of HPC I agree with the most is "computing at a bottleneck". That could be floating point throughput, memory bandwidth or capacity, latency or whatever else might be capping execution speed. A poor implementation of an algorithm that does not scale well with over a handful of cores does not necessarily qualify as high performance computing. Anyway, it might be worth investigating what exactly is the limiting factor for your application. That would either allow you to tune the application, or at least choose the hardware best suited for it.

traderjay · Jun 23, 2020

fp64 said:
Not enough memory channels, the extra memory hop @ having to go thru the io chip and floating point hardware still not competitive with those of intel.
--

8 Channels still not enough? Intel can't touch AMD in many HPC/server performance unless it uses a very specialized instruction set that is missing on the current gen of EPYC, and it will be added in the third Gen.

alex_stief · Jun 23, 2020

I think he was referring to TR CPUs with that comment.

blinkenlights · Jun 23, 2020

I have some experience with both HPC cluster design and photogrammetry/remote sensing prior to the GPU accelerated era. Most large companies have clusters like the one @tinco described, but few would be running production or customer code on it at this point - maybe as a proof of concept. As @alex_stief and @badskater pointed out, everything from density (sockets, memory capacity) to power efficiency and floor space are against running desktop processors in HPC clusters, no matter how impressive their performance.

In a way, the HPC market disruption you describe already happened with Beowulf, what, about 25 years ago now? That paved the way for hyperscaling with GPUs and high-density nodes. Oddly enough, it looks like the
Chinese are pushing HPC back towards custom silicon and hybrid solutions (June 2020 | TOP500).

As an aside, this is a great article about modern HPC design - there is a lot more involved than slapping a few compute nodes in a rack with an IB fabric and calling it a day: Construction of a Supercomputer - Architecture and Design

badskater · Jun 23, 2020

@blinkenlights not just the chinese i'd say. The whole world is moving toward all of those custom stuff and hybrid solutions. Fugaku is gonna provoke a new change in how HPCs are designed for sure.

The CodeProject article is the first thing we read when we work with the HPC team at work, so I can confirm it's quite complex and a nice thing to understand all of it. (I'm a technical architect for mostly Private Cloud at work, but work often with the HPC team who lives in a completely different world) It's quite nice to see the difference in mentality between what I usually do and what they do, tho the designs are often the same when it comes to how we choose the components on both sides, since we do big customers usually.

Why aren't ThreadRippers dominating the HPC market?

New Member

Active Member

New Member

Active Member

New Member

Member

New Member

Radioactive Member

Well-Known Member

Active Member

Member

Active Member

Automation Architect

Active Member

New Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Automation Architect