A dedicated Epyc Rome 128 core PCIE-Gen4 NVMEoF storage server - completed

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Dreece

Active Member
Jan 22, 2019
503
161
43
Well I finally got rid of all my pcie3 nvme drives and replaced with 12.8tb gen4 U.3 drives and 6.4tb gen4 HHHL cards.

All I can say is that latest tech really is blisteringly fast. I did have to track down networking bottlenecks (mainly configuration not optimised for nvmeof), but once all the 100G gates were opened correctly at both switch and nic driver levels (windows was a nightmare! linux was a rockstar!), I now have the bandwidth and latency of a quick and efficient datacentre in my homelab.

I'm still relying on a SAS SSD pool (15.36tb drives) for archives, and a small pool of sas12tb spinners strictly for backups with monthly to lto8 (although I did consider just using some of the ssd pool for this but then I figured it's nice to keep something from the medieval days in the setup out of respect).

Anyway, this is where the whole debate on power-savings kicks in, less drives, larger slower ssd drives, for cold data, keeps the overall watts down. I have a zwave infrastructure monitoring with remote individual rack server power control via openhab all configured (didn't really need it, but hey looks great via grafana). Right now just under 2kw for my whole setup including all the VMs plus one physical workstation plus 3 virtual workstations (missis, kid, mediacentre).

Hot data is now on HHHL cards and warm data on a hot-swappable U.3 pool. Not using raid anywhere except in the raid10 backups pool, if an ssd fails, I swap it out, do a restore. I executed a few disaster scenarios, literally a few clicks, zero perspiration lol (honestly with the way my hair has been magically vanishing over the past few decades, this is a godsend, my brows are here to stay :D)

The idea going forward is simple, everything from software to hardware is configured in such a way that I can easily upgrade as and when. Having no hard-reliance on any particular software/hardware platform keeps things exceptionally flexible. The only thing in this setup that will probably remain a constant is the spinner pool and tape infrastructure for backups, because that is what makes the whole system stress-free production class.

Anyhow, 2021 is here, and it appears the nightmare of 2020 isn't over yet, but I wish all a Happy New Year nontheless, all the best!
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
What software stack are you running on and what was the associated cost;)
 

Dreece

Active Member
Jan 22, 2019
503
161
43
What software stack are you running on and what was the associated cost;)
No stack as such. Just a barebone install of Debian manually configured via some scripts as the target storage server, a bit of nvmetcli to create the configs for drives until I got my head round all the config variables, then a fair bit scripting and automation via sed to do drive-id updates and nvme reloads on swapouts automatically, as long as I'm just swapping out one drive at a time it works great, could enhance the script to do more later if needed. For windows I use starwind's nvme initiator on the consuming servers, and just regular linux functionality for linux servers. From that point on everything is as standard, like physical drives on the servers themselves.

Associated costs always start off low and then you lose track, you get to the end of it and rather not look back at the bills, on a positive note I did get some great b2b deals as I bought a few things in bulk for other jobs. Overall wasn't cheap, but considering how much fully packaged off-the-shelf servers are priced, I achieved the same at poundland rates :cool:
 
  • Like
Reactions: Rand__

Dreece

Active Member
Jan 22, 2019
503
161
43
Now that's interesting since I just got a PM1735 (not yet installed). I wonder if you have already done some benchmarking on it, since users here and there complain about its performance.

How most tend to typically benchmark, I don't take it that seriously. I now tend to look at more the queue depth across multiple servers in parallel considering a storage server is serving all nodes (physical and virtual). So running a few nvmeof tests in parallel I deduced that each server was getting around whatever was divisable over the 100G pipe out of the storage server thresholded just under maximum drive speed, and when I doubled up the pipe by distributing the load across multiple paths then that increased the distributed throughput, in the end I settled for just one 100G link for the data as I could never see a scenario in my case where I would need serious bandwidth in parallel across multiple servers.

ps. yes at times the pciex8 PM1735s are indeed slower in comparison to the pciex4 CM6-V U.3 models, though to be frank, as far as enterprise gen4 SSD is concerned, both deliver quite well across my lab. Personally I wouldn't recommend either of these for high performance cutting-edge workstation requirements, better to get the high throughput QD1 prosumer drives for that (believe it or not!). The load/delivery curves on these drives are geared as per their intended use as we all know, and nvme drives are quite reliant on having good work-free cores available to them, one really shouldnt' think a free-core as the task-manager in windows shows it is actually a free core, Windows does bizarre things with nvme drives, every minute is a different scenario with windows lol. Testing via barebone linux I would advise is the best way forward with a configuration which replicates the intended workloads as some already do religiously*. (Except the Drag Racing Queen workload scenario isn't really an STH'er thing and that can be fudged using all manner of methods as we know)
 
  • Like
Reactions: balnazzar

balnazzar

Active Member
Mar 6, 2019
221
30
28
ps. yes at times the pciex8 PM1735s are indeed slower in comparison to the pciex4 CM6-V U.3 models, though to be frank, as far as enterprise gen4 SSD is concerned, both deliver quite well across my lab. Personally I wouldn't recommend either of these for high performance cutting-edge workstation requirements, better to get the high throughput QD1 prosumer drives for that (believe it or not!).
Indeed I bought the PM1735 for workstation use, but mine are quite peculiar ws requirements. I needed dwpd 3 endurance since the machine will hibernate many times per day, and hibernating with 128/256gb of ram is quite a write-intensive task, so prosumer gen4 m.2 ssds were not an option for me.
The only prosumer drive that maybe could take it would be the 970 pro, which is the only (afaik) MLC ssd left on the market, but I really wanted to take advantage of gen4 pcie (I have a ROMED8-2T mobo).
I use linux, so I'm wondering if you would do some quick benchmark with FIO, so that we can compare once I get my workstation built and the pm1735 installed. I'm manly interested in random read. Thanks!
 
  • Like
Reactions: Dreece

Dreece

Active Member
Jan 22, 2019
503
161
43
I personally feel that the new range of Optane drives that STH talked of on their front-page are going to be the unicorns of ultimate workstation builds... when a workstation build has a BIG budget, the new Optanes is the only way forward, and that 100DWPD figure just blows everything out of the water (maybe overkill for workstations, though usually big budgets for workstations tend to veer into the overkill department anyway).

With more sensible builds with mere-mortal budgets in mind, if endurance is a must, then these enterprise PCIE4 drives fit the need quite well, hands-down.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
No stack as such. Just a barebone install of Debian manually configured via some scripts as the target storage server, a bit of nvmetcli ...
So basically debian as nvme target with linux native /windows starwind initiator. Naturally RDMA, with 25 or 100G cards?

Associated costs always start off low and then you lose track, you get to the end of it and rather not look back at the bills,...
Lol, I feel ya
 
  • Like
Reactions: Dreece

balnazzar

Active Member
Mar 6, 2019
221
30
28
I personally feel that the new range of Optane drives that STH talked of on their front-page are going to be the unicorns of ultimate workstation builds... when a workstation build has a BIG budget, the new Optanes is the only way forward, and that 100DWPD figure just blows everything out of the water (maybe overkill for workstations, though usually big budgets for workstations tend to veer into the overkill department anyway).
You mean the P4800X? Agreed. They are the best out there by a huge margin.


With more sensible builds with mere-mortal budgets in mind, if endurance is a must, then these enterprise PCIE4 drives fit the need quite well, hands-down.
And indeed the best price I could find for the PM1735 1,6 Tb was 400e, while the best I could find for the P4800X 1,5 Tb was 4430e.
Man, it's eleven times the PM1735. Definitely out of my grasp.
 

Dreece

Active Member
Jan 22, 2019
503
161
43
I made a pact with my purchasing account (my alter-ego), that I would never consider buying any SSD less than 6.4tb and ideally always aim for 12.8tb (except for boot drives). In the end the numbers always work out similar anyway, so why beat about the bush when we already know that eventually we'll need more drives.

I can only imagine if they did do a 6.4tb model of this P5800X it would be priced at such a level that only those with 'fool' written across the forehead will be paying for the privilege. (PC isn't my strong-point lol)
 
  • Like
Reactions: balnazzar

acquacow

Well-Known Member
Feb 15, 2017
787
439
63
42
Why so many cores? You don't need a fraction of those for a storage server.

I was building 16-core systems 10 years ago that pushed over 50GB/sec as storage servers...

We were limited to about 25GB/sec per CPU back then though... Dell and supermicro made some 2U servers with 4 dual-socket nodes, we would use those as storage heads and push 200GB/sec out across a few infiniband links.
 

Dreece

Active Member
Jan 22, 2019
503
161
43
Why so many cores? You don't need a fraction of those for a storage server.

I was building 16-core systems 10 years ago that pushed over 50GB/sec as storage servers...
There are a few other things going on in the server which are heavily data reliant but that is a whole different affair.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
I'm intrigued why you need it all ;) most usually need a piece of the pie or two... lots of cores, super fast storage, super fast networking, high capacity drives.. yet you've got a beast with IT ALL. Can you share what your home lab does that you need multiple 12TB+ SSD and millions of IOPs, etc, etc, for ? Are you actually getting near any % of utilziation on this, 5%, 10%, 25%, or is it for future \ fun :D