NVMe Storage Solution

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
Hey all,

So in typical fashion I've gotten myself way in over my head and come here asking for some help. The goal of this was to have an all NVMe storage solution that would be the data stores for all of my ESXi nodes.

Right now I'm sitting on a Supermicro 2028U-TN24R4T+ running two E5-2623V3's and 32GBs of Samsung DDR4 2400 RAM. I have four Intel P3520 1.2TB drives and four 900P 280GB drives.

Initially I was going to run this as a FreeNAS system and take the hit on performance, since, it was assumed that with this many NVMe drives, even if I took a 50% hit on performance I could still max out a 40Gbs connection. Well...not so much. The quote is the post from the FreeNAS forum where I was looking for some tips on NVMe pool tuning.

So here is the code and some of the results. This is with the 4 Intel 900Ps striped.

I had gstat open in another shell to monitor the drive usage. During the first write test the drives maxed out at about 75% usage, and CPU hit 80% with 1.4GB/s writes. The next test was the reads. Drives hit 60% usage and CPU hit 30% usage with 1.3 GB/s reads. Next write test was done with 1M block size. Drives hit 55% usage during the write test and CPU spiked to 55% then dropped down to 40% with 1.8GB/s writes. After that we have the 1M block size reads. Drives hit 50% usage and CPU was at 20% usage with 1.36GB/s reads.

root@freenas:~ # dd if=/dev/zero of=/mnt/test/tmp.dat bs=2048k count=80k
81920+0 records in
81920+0 records out
171798691840 bytes transferred in 120.296428 secs (1428127962 bytes/sec)

root@freenas:~ # dd if=/mnt/test/tmp.dat of=/dev/null bs=2048k count=80k
81920+0 records in
81920+0 records out
171798691840 bytes transferred in 131.082277 secs (1310617234 bytes/sec)

root@freenas:~ # dd if=/dev/zero of=/mnt/test/tmp.dat bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 23.600602 secs (1819854979 bytes/sec)

root@freenas:~ # dd if=/mnt/test/tmp.dat of=/dev/null bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 31.544821 secs (1361544365 bytes/sec)

Next up is the P3520. I didn't bother to do the 2048K block size since it seemed that the 1M block size worked better for this one. First test was writes which maxed at 40% drive usage 40% CPU usage and 1.95GB/s writes. Then did the read test and saw 95% drive usage with 10% CPU usage and 900MB/s reads.

root@freenas:~ # dd if=/dev/zero of=/mnt/test1/tmp.dat bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 21.942736 secs (1957352666 bytes/sec)

root@freenas:~ # dd if=/mnt/test1/tmp.dat of=/dev/null bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 47.276556 secs (908477220 bytes/sec)

And finally all 8 Drives striped together. CPU usage was never above 5-10% for either test. The write test saw all the drives hit around 20-30% usage and 1.9GB/s, while the read test saw the p3520's hit 90% usage while the 900p's were sitting at 15% and 1.4GB/s.

root@freenas:~ # dd if=/dev/zero of=/mnt/test/tmp.dat bs=1M count=50k
51200+0 records in
51200+0 records out
53687091200 bytes transferred in 28.401039 secs (1890321400 bytes/sec)

root@freenas:~ # dd if=/mnt/test/tmp.dat of=/dev/null bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 30.062220 secs (1428692634 bytes/sec)


So this is where I stand. The 900P's are capable of nearly 10GB/s reads and 8GB/s writes when striped, yet I'm barely seeing 20% of that. The P3520's are capable 6.4GB/s reads and 5.2GB/s writes when striped yet I'm seeing around 40% of that. Odd thing is the numbers are really close to each other no matter how I set up the drives.

So just for kicks I decided to do just a single 900P pool and see what the performance was. And guess what? It was exactly the same as if it was a striped pool! So it seems I'm only getting one drives performance out of a pool...No idea what gives about that.

root@freenas:~ # dd if=/dev/zero of=/mnt/test/tmp.dat bs=1M count=50k
51200+0 records in
51200+0 records out
53687091200 bytes transferred in 36.308151 secs (1478651213 bytes/sec)

root@freenas:~ # dd if=/mnt/test/tmp.dat of=/dev/null bs=1M count=40k
40960+0 records in
40960+0 records out
42949672960 bytes transferred in 32.836506 secs (1307985465 bytes/sec)
The tl:dr version of it is, I'm only getting 1 drives worth of performance (if that), no matter how I set up the pools.

So I decided to try Windows' Storage Space. Local storage performance was better, but still somewhat a mystery. The 900P's in a striped pool yielded a mixed bag of results with results ranging from 1/4th of a drives performance (512b - 16kb) to almost 100% the performance of all 4 drives (64kb - 1MB) and then back down to 50% of all 4 drives performance (2MB - 64MB). All of this was at QD of 4. The P3520's had odd write > read performance under 32kb and about 1 drives worth of performance then at 64kb and up it started to come around on how it should look.

Right now I'm leaning towards going with Windows since it seems to yield more performance than FreeNAS, but I'm open to suggestions or if there are some pro tips for tuning FreeNAS/Windows to help with an all NVMe pool. Looking for any guidance!
 

Nizmo

Member
Jan 24, 2018
101
17
18
38
Should optimize the connection settings, example : RDMA (on), Send and Receive Buffers (max), Jumbo Packet.

Also, for some reason "Balanced" power profile is defaulted for most server installs, set to "High Performance".
 
  • Like
Reactions: CookiesLikeWhoa

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,674
2,083
113
Should optimize the connection settings, example : RDMA (on), Send and Receive Buffers (max), Jumbo Packet.

Also, for some reason "Balanced" power profile is defaulted for most server installs, set to "High Performance".
This+ more.

You're going to have to tweak ZFS to maximize performance based on your work load, and then start tweaking network configurations, etc...
 

Nizmo

Member
Jan 24, 2018
101
17
18
38
running two E5-2623V3's and 32GBs of Samsung DDR4 2400 RAM
Are you feeding the CPU's 8 x 4GB modules? Or your only giving dual channels to each CPU? These are 4 Channel CPU's. Correct memory installation would be 8 x 8GB or 8 x 4GB afaik.

I would also note that the CPU's listed, only support max 1600/1866Mhz ram per Intel, which is odd I havent seen much less than 2133Mhz ddr4.
Intel® Xeon® Processor E5-2623 v3 (10M Cache, 3.00 GHz) Product Specifications
 
  • Like
Reactions: CookiesLikeWhoa

Patrick

Administrator
Staff member
Dec 21, 2010
12,519
5,836
113
Sorry to say this, but you need to be off of FreeNAS if you want to get near 40GbE speeds these days.

FreeNAS is a great solution for 1GbE and maybe 10GbE NAS, but remember it is still primarily focused on being a homelab / SMB office tool.
 

SlickNetAaron

Member
Apr 30, 2016
50
13
8
44
Sorry to say this, but you need to be off of FreeNAS if you want to get near 40GbE speeds these days.

FreeNAS is a great solution for 1GbE and maybe 10GbE NAS, but remember it is still primarily focused on being a homelab / SMB office tool.
Interesting... huge statement (and I don’t doubt it at the moment)

So where do we go from FreeNAS? Preferably with data integrity


Sent from my iPhone using Tapatalk
 

gea

Well-Known Member
Dec 31, 2010
3,197
1,205
113
DE
The three Open-ZFS alternatives are Free-BSD, Illumos (free Solaris fork, ex OmniOS) and Zol. Only the base OS is performance relevant not a GUI for management.

I use OmniOS (Illumos) with 40G to my filers but not for 40G but for many 10G users (especially video workstations). I have made tests with 4 x Optane 900P in a Raid-0 setup on OmniOS and was not even near to 5 GB/s throughput on a local benchmark, only about the half. As ZFS is a high security and not a high performance filesystem it would not be easy to come near to 5 GB/s throughput. Mainly ZFS needs the superiour RAM caches for high performance.

What I have seen is that a genuine Oracle Solaris 11.3 with ZFS v37 was near or faster to the result of OmniOS with 4 Optane but with only two of them. I was not able to test Solaris with 4 Optane as only two were detected. The new Solaris 11.4 with a newer ZFS v43 detects all Optane. I can add a benchmark but this may be limited due beta code.

see http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf
chapter 5
 
Last edited:
  • Like
Reactions: T_Minus and Patrick

acquacow

Well-Known Member
Feb 15, 2017
788
439
63
42
Your fastest solutions are going to be doing rdma from a linux host runing mdadm for any raid work you want done...

Any more layers you add are going to increase latency and eat up I/O cycles.

Benchmark some of that to get your max theoretical numbers, then put a filesystem on them and bench that overhead, then add any other raid levels and benchmark that.

-- Dave
 
  • Like
Reactions: T_Minus and Patrick

SlickNetAaron

Member
Apr 30, 2016
50
13
8
44
@SlickNetAaron what is a huge statement? That FreeNAS struggles to achieve 40GbE speeds? I think we have discussed that many times.

[
Hmm.. I guess it’s not such a huge statement... it just struck a chord with me somehow. I’m not sure I can quantify why.

I’ll have to look around a bit. I’ve been lurking for a while and fail to recall a big discussion on FreeNAS + 40gb. I’m no fan of FreeNAS, just trying to find something that “magic” solution lol
 

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
Should optimize the connection settings, example : RDMA (on), Send and Receive Buffers (max), Jumbo Packet.

Also, for some reason "Balanced" power profile is defaulted for most server installs, set to "High Performance".
Have the buffers maxed and RDMA on. Turned off Jumbo packets as that actually hurt network performance. Though that could be more tuning on my end to get it to work.

Found out the High Performance bit when I was benching and every bench would come out different. Tried to figure out what would cause such inconsistencies then it dawned on me that MS likes to have "balance" as the default. Once I enabled "high performance" the numbers went up and so did the consistency. Also enabled "performance" in the CPU power management section of the BIOS. Didn't seem to affect numbers really, but did make them more consistent between benches.

Are you feeding the CPU's 8 x 4GB modules? Or your only giving dual channels to each CPU? These are 4 Channel CPU's. Correct memory installation would be 8 x 8GB or 8 x 4GB afaik.

I would also note that the CPU's listed, only support max 1600/1866Mhz ram per Intel, which is odd I havent seen much less than 2133Mhz ddr4.
Intel® Xeon® Processor E5-2623 v3 (10M Cache, 3.00 GHz) Product Specifications
I did not know about the max frequency of RAM for that CPU. Miss there. I wonder how much that will hurt performance overall. It is running with eight 4GB modules though, so it should have all the channels in use.

The other thing that I was thinking about is that I know this backplane is running to two PLX chips, that each have 16 lanes. Which gives me 32 lanes total for 24 drives. I assume that will have some negative affect on performance, eventually, but how much I can't figure out.

I will look into Linux options as I've seen that mentioned a few times for performance, though I'm not nearly as comfortable with Linux. Here's to learning!

Edit: Thank you to the Patrick and the others as well! I figured FreeNAS might not be able to work with this solution, but there was hope!
 
  • Like
Reactions: Nizmo

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
To get this to speed you will need to try and tune a lot.
As said, mdadm instead of zfs may be your choice.

Then there you certainly want to use blk-mq, have a look at the io-schedulers and related things.
Also, properly watching how/if interrupts are spread over CPU's can help finding bottlenecks.
(at this speed, often one core get's hammered because 'something' just doesn't use the other cores)
With your setup, also numa-nodes will be of interest.

For the transport, if possible skip iscsi and try nvemof -> srp -> iser in this order.
(with the corresponding tunings, i.e. num_channels for srp - what requires blk-mq)

Good Luck,
Alex
 

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
Update/Advice Request!

So I managed to iron out the performance on the Windows' Server and got all the drives running at their theoretical speeds locally with the advice provided in this thread! Over the network is another story with "meh" 10Gb connections (slightly better sequential speeds, but 4x the 4K performance of my FreeNAS running HDDs) and 40Gb connections that are worse than 1Gb. (Not sure what is going on there yet)

To help eliminate some trouble, I was considering trying to go with vSAN to keep this all on vmWare. I have a VMUG subscription, so I have the key for it and the cluster would be 5 nodes. The only downside is none of my nodes, except the NVMe one, have local storage since my data stores are on a FreeNAS server via iSCSI and from what I've gathered each node needs to provide some storage to the SAN. Would this likely be a better idea than trying to get a separate OS up to provide storage out?
 
  • Like
Reactions: T_Minus

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
Ah thank you for the link. Seems that vSAN is a hot mess.

Wish I had saved my screen shots from my work with the W2k16 server I'm on. Have the four 900P drives in a striped array and created a 475GB target for the ESXi host. Connected via 10Gb to the ESXi host I was able to get 576MB/s reads/writes with 16Q and 16 threads. 1Q 16threads sits at around 500MB/s. 4K performance was around 20MB/s r/w.

With the current FreeNAS system I get around 500MB/s r/w in sequential and 5MB/s r/w in 4k. That system is set up with 256GBs of RAM and 12 4TB HDD's mirrored in 6vdevs.

All of this is with a Windows 10VM with 4GBs of RAM 4vCPU @3.1GHz on a 35GB drive.

Looks like the search will continue.
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,638
1,772
113
Well you should be able to get that with optane as slog (via datastore disk as it is still a lottery to get optane to run in Freenas VM) - have you tried that?
 

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
The original goal was to set this system up with 8 900Ps in a striped array and 16 P3520's in 4 Raidz vdevs. The 900Ps would be the data store and the P3520's would the be the rsync. After setting up the 4 900P's in a striped array I tested the performance over iSCSI to an ESXi host and got performance that was worse than my current HDD FreeNAS, around 300MB/s sequential and 3MB/s 4k. Did some DD testing as seen up in the OP and found that the system just couldn't really use the drives and from the advice given above, moved away from FreeNAS and ZFS as a whole.

Next step was to try a 2k16 server to see if I could get better performance. The initial performance was way better than the FreeNAS system and after working with it for a while I was able to get incredible performance on the local machine. Next step was to transfer that to the network. That is proving to be more difficult.
 
Last edited:

acquacow

Well-Known Member
Feb 15, 2017
788
439
63
42
The original goal was to set this system up with 8 900Ps in a striped array and 16 P3520's in 4 Raidz vdevs. The 900Ps would be the data store and the P3520's would the be the rsync. After setting up the 4 900P's in a striped array I tested the performance over iSCSI to an ESXi host and got performance that was worse than my current HDD FreeNAS, around 300MB/s sequential and 3MB/s 4k. Did some DD testing as seen up in the OP and found that the system just couldn't really use the drives and from the advice given above, moved away from FreeNAS and ZFS as a whole.

Next step was to try a 2k16 server to see if I could get better performance. The initial performance was way better than the FreeNAS system and after working with it for a while I was able to get incredible performance on the local machine. Next step was to transfer that to the network. That is proving to be more difficult.
You shouldn't use dd as your testing tool. You need something multi-threaded with proper ability to control thread count and queue depth/etc.

I suggest looking into using "fio"

-- Dave

Sent from my Moto Z (2) using Tapatalk
 

CookiesLikeWhoa

Active Member
Sep 7, 2016
112
26
28
35
You shouldn't use dd as your testing tool. You need something multi-threaded with proper ability to control thread count and queue depth/etc.

I suggest looking into using "fio"

-- Dave

Sent from my Moto Z (2) using Tapatalk
Thank you for the info! I hadn't even heard of that to be honest. I did try the FreeNAS install again to see what the performance was like over the network and it was worse than my current set up on every account. Think I can officially rule FreeNAS out for now.

I'm thinking a lot of the over the network performance issues are related to latency. When running the benchmarks I'm noticing that the throughput is jumping on the network cards. It will jump to 10Gb/s then down to 2 and back up a lot. With the HDD array on the FreeNAS it just stays right at 5Gb/s . This makes me think that something on the network is killing the performance of the drives.

I decided to install Hyper-V onto the W2k16 server to see what performance was like locally for VMs. Was not disappointed.900P.PNG Hyper-V.PNG
I'm pretty sure that the 4MB ATTO results were actually Hyper-V starting to cache the bench in RAM since the "max" the drives can theoretically do is around 10.4GB/s reads and 9.2 GB/s writes.

I might end up going down this path and using this server as a Hyper-V server. Unfortunately, outside of the desktop world, I do not have a whole lot of experience with Windows.
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Do tell us a lil' bit more about your network setup. Switch vendor/model, jumbo in the mix or not, desired/preferred protocol for network stg access. Remember not all switches are up to the task of IP SAN duties, have to be REAL careful here otherwise it is an exercise of futility (AKA 'what's the definition of insanity'...) you know the rest of that quote right? :-D