Supermicro 24x NVME Server Build (proxmox + truenas)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

89giop

Member
Dec 4, 2020
49
30
18
Hi All,

after years of trying to get the hardware together for this, I have finallly made it happen.

Got a good deal on a 2113S-WN24RT (2U 24x NVME backplane), finally my dream of having enough pcie lanes for each drive + a 100GBE adapter (do I need it? NO! is it cool? Heck yeah!!!)

I also scored a good deal on 12x 11TB 9200 ECO NVMEs + I just got a decent deal on 2x 12.8TB 9300 MAX. Also got a 50GBE CX5 that I crossflashed to 100GBE (thanks to the forum).

The systems unfortunately came with a rev 1.0 H11SSW-NT board which meant no Rome upgrade. I then decided to pull the trigger on a H12SSW-NTR, the relevant pcie 4.0 risers and retimer cards so that the whole system would be pcie 4.0 capable (supposedly the included BPN-NVME3-216N-S4 can handle pcie 4.0 signals anyway).

I am yet to recieve the riser boards but I did get a 7443 as well ast 8x64GB 3200MHZ with the H12SSW-NTR. Unfortunately that board has an issue, but I was able to find a rev 2.0 H11SSW-NT and a 7K62.

I will definitely continue with my pcie 4.0 upgrade at a later date but for now I will set up the system.

I am thinking that 48 Cores would be wasted as a storage node, plus the thing idles at 260W, so I want to consolidate all my system onto it. I have installed PROXMOX and running TrueNas as a VM.

All 14 drives and 100GBE NIC were easy to passthrough. I am now trying to optimise my Truenas set-up to see if I can saturate 100GBE (because why not???)

So far I have played around a bit and it seems that a 8 drive z1 pool (92 ECO) with 32 threads will just about saturate it.
Single thread performance though is only 389MB/s. Not sure if this is normal. I have only tested this with TN-Bench so far.

I have a few questions for you guys:

1. Given the fault tolerance of SSDs vs HHDs (and that I will have a rust pool for back up) + the ability to resilver a lot faster, would it be ok to do a 12 wide raidz1 vded ?

I feel like 8 wide raidz1 would be good, but then I am left with 4 disks. Alternatively is 12 wide raidz2 ok (or would 2x 6 wide raidz1 in a pool be better?) This pool mainly just be media storage.

2. I was thinking about mirroring the 2x 12.8TB 9300 MAX and using those to store all my critical data.Am i better off just chucking those in as part of the above pool instead, have more resilience and store my critical data along with my media pool? (realistically I probably only have 4TB at best of data that is important to me).

3. Any particular advice on tuning the pool/system for NVME? I haven't tried SMB transfers yet, but I am worried it's going to run single threaded.

I have only seen a few posts on someone setting up a system like this and they are a bit dated now. So whilst I am a noob when it comes to this stuff, I thought it would be good to share this project.

Cheers,

Gio
 
  • Like
Reactions: Boris and abq

89giop

Member
Dec 4, 2020
49
30
18
Just thought I would upload the benchmark as well.

I do find the results intersting because I turned off the cache so I am a bit confused about the last result. How can 8 Drives yield 38GB/s read when the max bandwith for 8x U.2 lanes is 4GB/s (32GB/s)?
1757685253257.png
 

89giop

Member
Dec 4, 2020
49
30
18
12 wide raid z1 is pretty the same as before, whilst read performance is 10% better and now within the confines of pcie bandwith for 12 drives. I feel that by allocating more threads I could get a bit more out of it.
1757686566405.png

2x 6 wide raidz1 vdevs offer very similar performance for write, but they net in almost 48GB/s for reads.
1757686956216.png
 

89giop

Member
Dec 4, 2020
49
30
18
Back for more with 48 Threads just to see what happens

12 wide raid z1 - again questionable read speed with 48 threads
1757689140718.png


2x 6wide raidz1 vdev -braking the laws of pcie bandwith again now just with 24 cores (headscratch)
1757688627083.png

Not sure if this all needs to be taken with a grain of salt, but it looks like allocating 48 threads on a 12 wide raid z-1 is enough to saturate the 100gbe link from the drive. I guess I should try a 12 wide raid z2 pool tomorrow.

I guess Iperf is a story for another day :)

Also regarding power consumption:

525W Peak when running 48 thread test and hammering 12 drives.

Idle is approx 260W with

8x64GB DDR4 3200
1x Dual 100GBE Mellanox CX5
1x AOC-SLG3-4E4R redriver
2x AOC-SLG3-2E4T retimer (should try AOC-SLG3-2E4R which is a redriver istead of retimer, but I accidentally ordered AOC-SLG3-2E4 which has a plx chip and will probbaly use even more power)

1757689574982.png
 
Last edited:
  • Like
Reactions: SnJ9MX

nexox

Well-Known Member
May 3, 2023
1,988
990
113
How can 8 Drives yield 38GB/s read when the max bandwith for 8x U.2 lanes is 4GB/s (32GB/s)?
Your PCIe math is off by a bit - if you're running PCIe 4.0 each U.2 connector can so 64Gt/s, so a bit over 6GB/s per drive.

As far as getting that speed over a network, you're going to want to learn about RDMA/RoCE and the various file serving protocols that use them.
 
  • Like
Reactions: abq

89giop

Member
Dec 4, 2020
49
30
18
Your PCIe math is off by a bit - if you're running PCIe 4.0 each U.2 connector can so 64Gt/s, so a bit over 6GB/s per drive.

As far as getting that speed over a network, you're going to want to learn about RDMA/RoCE and the various file serving protocols that use them.
Well, it will be once i upgrade to PCIe 4.0, but for now it's a PCIE 3.0 system (plus the drives are still pcie 3.0 anyway).

Totally agree that getting that speed over the network will require some tinkering!
 

nexox

Well-Known Member
May 3, 2023
1,988
990
113
Well, it will be once i upgrade to PCIe 4.0, but for now it's a PCIE 3.0 system (plus the drives are still pcie 3.0 anyway).
Ah, I admittedly skimmed some of that. I would guess the benchmark isn't quite accurate at those IO rates, especially if it doesn't run for very long.
 
  • Like
Reactions: 89giop

89giop

Member
Dec 4, 2020
49
30
18
But will this equal the workload you will run on it? If not then these kind of benchmarks will be only useful for bragging :D
Couldn't agree more, I dont actually need 100GBE. 10GBE has felt slow for a while now when I'm migrating a couple of terabytes of video at a time, but if we're honest 25GBE would've been a significant enough upgrade.

At this point it's just become something I want to for the sake of learning and having fun. Now that I think about it...this isn't that different to a decade ago, when I was trying to make my car go in a straight line as fast as possible LOL
 
  • Like
Reactions: abq

89giop

Member
Dec 4, 2020
49
30
18
ok after a really long struggle with being capped around 40gbps on iperf, I realised Truenas Scale is still using iperf3 ver 3.12 which does not support multi threaded tests.

I used this code to run 4 instances on server side:

for i in {0..3}; do
port=$((5201 + i))
iperf3 -s -p $port &
done

and this on client side:

serverIP="<your_server_IP>"
for i in {0..3}; do
port=$((5201 + i))
iperf3 -c $serverIP -p $port &
done

and now suddenly I was getting 100gbps

1757772210441.png


This is with 2 servers directly connected with each other. RDMA means there is no CPU impact really, it's some prretty great stuff! I will need to see if ROCE works once I connect the CX5 through the Mikrotik CRS504-4XQ-IN, if not I will just to have all my 100gbe NICS connected directly.

Next I will need to play around with moving data from and to the nvme poo at 100gbe. It looks like SMB does not support RDMA, so I will need to play around with NFS and iSER. NVME-oF is also unfounately not option at the moment, unless you have an Enterprise verion ofTrueNAS.
 

kapone

Well-Known Member
May 23, 2015
1,893
1,268
113
ROCE = RDMA Over Converged Ethernet...

If RDMA works, ROCE will work. Whether your applications/drivers etc can/will be configured correctly to use it, is a different issue. ROCE/RDMA will work with zero switch support...up to a point. If the switch is heavily utilized and is not configured for PFC/ECN/DCB etc, RDMA may crash and burn.

SMB does not support RDMA
It does.

so I will need to play around with NFS and iSER
NFS supports RDMA as well. NFSoRDMA. iSER is...iSCSI with RDMA.

NVME-oF is also unfounately not option at the moment, unless you have an Enterprise verion ofTrueNAS
Barring TrueNAS, it's built-in into modern Linux kernels.
 
  • Like
Reactions: abq and nexox

89giop

Member
Dec 4, 2020
49
30
18
ROCE = RDMA Over Converged Ethernet...

If RDMA works, ROCE will work. Whether your applications/drivers etc can/will be configured correctly to use it, is a different issue. ROCE/RDMA will work with zero switch support...up to a point. If the switch is heavily utilized and is not configured for PFC/ECN/DCB etc, RDMA may crash and burn.
Thank you. I probably misphrased what I wanted to say. My rudimentary understanding (from what I admit was a very cursory search) is that ROCE (at least it seemed for v2) implementation of RDMA would work without any switch support (kind of like you said). Originally the Mikrotik did not support RDMA, so I was curios to see how it was going to play out. However it now seems that they have support for ROCE v2 anyway so it's all good :D


I did not realise SMB-Direct was available on Truenas, Have I missed something here?

ROCE = RDMA Over Converged Ethernet...
Barring TrueNAS, it's built-in into modern Linux kernels.
Good point, however this is a Truenas build. Perhaps as I get more comfortable with Linux, I can use a different Linux OS for NVME storage anyway (as it seems that Truenas still is limited in taking advanatge of NVME pools anyway).
 
  • Like
Reactions: nexox

Iaroslav

Active Member
Aug 23, 2017
127
33
28
39
Kyiv
We're using 24xNVMe dual CPU X11/H11-H12 systems in production - I wouldn't advise to mess up with these gen3/gen4 risers and retimers, you may end up spending the same as you've bought out-of-the box. For few disks and direct BPN connections it may be ok, but I haven't made it work on H12 with gen4 BPN, yet gen3 disks and gen3/gen4 risers and gen3 retimers - in a few days under real load disk just drops out. Gen4 retimers are too expensive, not saying about cables - expensive or almost impossible to find.
Respect your hardware choice, but for us it's only RAID1 pairs or software JBOD where acceptable and intel 100G cards, skip 25G and for sure 40G)
 

89giop

Member
Dec 4, 2020
49
30
18
We're using 24xNVMe dual CPU X11/H11-H12 systems in production - I wouldn't advise to mess up with these gen3/gen4 risers and retimers, you may end up spending the same as you've bought out-of-the box. For few disks and direct BPN connections it may be ok, but I haven't made it work on H12 with gen4 BPN, yet gen3 disks and gen3/gen4 risers and gen3 retimers - in a few days under real load disk just drops out. Gen4 retimers are too expensive, not saying about cables - expensive or almost impossible to find.
Respect your hardware choice, but for us it's only RAID1 pairs or software JBOD where acceptable and intel 100G cards, skip 25G and for sure 40G)
Thanks for the advice.

The box was only 750 + 750 a H12SSW-NTR + 150 for risers + 140 for the slg4-4e4t (admittedly i have not bought the 2x slg4-2e4t, but they will run me another 230). Unless I find another amazing deal on a 2114S-WN24RT, I would be nowhere near the price.
Even if i can manage to skip the retimers and have the 16 on board ports running at pcie 4.0 and the other 8.0 for $1500, I'd call that a massive win! The other $150 for the risers is more so i can run the cx5 at ocie 4.0 anyway.

What remains to be seen is how the gen 3 BPN behave with pcie 4.0 drives. But honestly, for what I paid for the original box, it's almost criminal not to try it haha
 
  • Like
Reactions: Iaroslav

tubs-ffm

Active Member
Sep 1, 2013
266
83
28
I am thinking that 48 Cores would be wasted as a storage node, plus the thing idles at 260W, so I want to consolidate all my system onto it. I have installed PROXMOX and running TrueNas as a VM.
Interesting to read that someone is running TrueNAS virtualized as a "real" and big storage server.

This is exactly what I am boding. To be precise I was running Proxmox before and now changed to TrueNAs under XCP-ng. SAS controller passthrough and als NIC passthrough. It runs good and it performs well. But in my case it is a home lab and I am compromising to keep small the number of boxes standing around in a family household and the energy consumption low. I never thought someone would do this with the "real" server.
 

nereith

Member
Mar 23, 2019
58
25
18
ok after a really long struggle with being capped around 40gbps on iperf, I realised Truenas Scale is still using iperf3 ver 3.12 which does not support multi threaded tests.
Just wondering if you have tested SMB between your server and a Windows client? What transfer speeds are you getting? (with or without RDMA)
 
Last edited: