Request for feedback: AMD EPYC under I/O load (PCIe utilization)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

blinkenlights

Active Member
May 24, 2019
157
65
28
I am going to pose an honest question and request honest feedback, without this turning into a flame war. My Intel 2nd Gen Scalable on an X11SPi-TF home server suffered an unexpected catastrophic fault (CATERR) this weekend. Main board displayed the BMC booting up, reset, turn on without display output, reset, rinse, repeat, recycle, etc. After about three hours, I was able to get it back online, thank God, so all ended well.

I began to ponder the possibility of replacing the processor, main board, or both. Thoughts of "you too can live the dream of 256MB L3 cache" danced through my head, but then some recent comments by everyone's favorite Canuck (don't judge, I watch for the drops and hardware on fire) about EPYC struggling under I/O load came to mind:


Basically, what Linus and Wendell were saying is that Intel has fewer PCIe lanes available but struggles less than EPYC when those PCIe lanes are fully utilized. Wendell hinted at the Intel products being "better engineered" in that regard. Has anyone here seen that in real life? Perhaps, with an all-NVMe NAS similar to what Linus was building?
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
What "struggles" have they encountered?

FWIW we're soak-testing some Epyc's at work right now, although it's a compute-heavy workload with comparatively little in terms of network/disc IO. Certainly no IO problems though.

P.S. Are you talking about Linus Torvalds? He's finnish and not canadian and he only just got himself a threadripper but I've not seen anything about him having problems with it (quite the opposite in fact).
 

blinkenlights

Active Member
May 24, 2019
157
65
28
What "struggles" have they encountered?
The second linked video provides the details; a good summary within the first five minutes. TLDR, an AMD based new build was outpaced by a large NVMe array capable of consuming up to half of the memory bandwidth. I warn you, their videos can be addictive, especially where $10k+ processors falling off a counter are involved :eek:

P.S. Are you talking about Linus Torvalds? He's finnish and not canadian and he only just got himself a threadripper but I've not seen anything about him having problems with it (quite the opposite in fact).
No, Linus Sebastian of Linus Tech Tips (the linked videos). I worked with Linus Torvalds years ago - he is definitely nothing like the younger Linus :rolleyes: And he is indeed Finnish, but IIRC both he and Tove became American citizens a while back.
 

blinkenlights

Active Member
May 24, 2019
157
65
28
I did a bit more digging and found another discussion about this same topic: AMD Epyc has problems when you max out PCIe lanes. The general theme of that thread was "he built a storage system that was faster than the memory bandwidth" - something that I think is not quite right. Wendell over at Level1Techs actually did have a post related to Linus' server upgrade: Fixing Slow NVMe Raid Performance on Epyc.

From Wendell's post, it sounds to me like the "problem" is really ZFS overhead, not the hardware, and that makes sense to me. I am a big supporter of ZFS - billions of dollars poured into developing something enterprise-y that is now free to use? Sign me up! - and personally use it all over the place. However, the reality is that the ZFS foundation was laid a long time ago when flash drives were small and generally only useful for SLOG and caching. I doubt the architects ever envisioned having to deal with very large, multi-device flash arrays connected directly to the CPU(s) via PCIe.

TL;DR I think this is less about EPYC and more about file system design.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
Every system's got a bottleneck somewhere, upgrade on section to get rid of a bottleneck and you just end up with a larger bottleneck somewhere else.

FWIW after you mentioned it I tried throwing as much at the 1P Epyc systems we're trialling at work as I had access to (basically a few 40GbE NICs and a half dozen NVME drives) and we weren't able to observe any bottlenecks.

Didn't watch the video (don't like them, sorry) but it would have been useful for them to say what their methodology was. Filesystems always run in to scaling problems when you effectively remove any IO constraints. The post on the Level1Techs was very interesting though - I've done a fair amount of tweaking md-raid on SSDs (it too has scaling problems with ultra-fast storage especially if you use parity RAID) but not run in to this myself.
 
  • Like
Reactions: blinkenlights

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
A few notes here:
  • We covered the NVMe hot-plug issue a lot back in 2017-2019. It is mostly fixed. The right way to look at that is that a lot of server vendors assumed that NVMe hot-plug would work, but did not realize there were Intel-specific bits in the process. Many of the server vendors did not even realize that was the case during the EPYC 7001 launch. When you inserted drives, the entire systems would lock, reboot or the drives would not be seen. Every EPYC server call I had for a good year and a half asked about this since I saw it as a key feature.
  • Linus is totally right, everything was developed on Xeon first, now is being moved to EPYC and others. Arm servers all have similar challenges, but with an even smaller support budget and install base. This is a very big deal.
  • Let me give you an example. We had a test system for the Naples launch. If you issued a "restart" command in the OS, you would need to go to the data center, go into the server, pull the CMOS battery, wait a few seconds, put it together, and turn it on. This was an AMD AGESA issue near launch but it is also why I refused to review servers on launch day and why some of our benchmarks were delayed. I flat refused to review a server/ platform that could not reboot. It was fixed a few days later.
  • With PCIe Gen4, something interesting is happening. DGX A100 is using EPYC so the support for that type of platform and AMD is going from essentially zero in 2019 to very good today. That is where AMD is going to forge ahead. Companies with PCIe Gen4 accelerators are using AMD EPYC to showcase their products today.
  • For most general-purpose workloads, EPYC is awesome
  • For higher-end storage, that video was somewhat strange. Most people are not doing ZFS or basic RAID on these boxes. ZFS is designed for disks plus flash/ RAM caching. We can go beyond that, but that was the base design for the solution. Usually, we see custom software or scale-out storage for high-end NVMe arrays. Remember, Microsoft/ QCT got over 25GB/s (big B) performance per node using only four PCIe Gen4 SSDs per node. Software is a big deal when getting to this level of performance. Generally though, these solutions are focused on saturating per-node network bandwidth, not local array performance. If you only have 100Gbps of uplink, you do not need 500Gbps of local performance.
  • Intel VROC would show big performance loss well before 25GB/s if you were using that for RAID
I hit the point where I was comfortable where we have a lot of EPYC 7001 nodes for test lab infrastructure and hosting infrastructure. EPYC 7002 is better. The maturity of the platform is better. What you are starting to see in 2020 is that the big vendors now have enough time to iron out the bugs in their systems that they are getting very good.

Software, like the NVMe hot-plug issue, takes more time. Where you are seeing the big/ high-end EPYC Rome support is in workloads with PCIe Gen4 accelerators. Companies can get to market faster with their Gen4 devices on AMD than Intel so are developing on AMD first which is not how the industry ran for the last decade.