NVMe Storage Solution

CookiesLikeWhoa · Apr 17, 2018

For the 40GbE side I have an Arista DCS-7050QX as the switch with both nodes using Chelsio T580-SO-CR Ethernet cards with FS modules and fiber. No jumbo frames where turned on. I didn't dive too much into what was going on with the 40Gb as that was just another layer of complexity that I shouldn't be sorting out just yet.

10Gb side is an Unifi US-16-XG switch, no jumbo frames. The cards are an Intel 520-DA in the ESXi node and the W2k16 server is using the onboard Intel 540 (10G base-t). This might be where a lot of the problems are. The unifi switch is showing 10Gb speeds but those ports are known to be hit or miss. I don't have a spare 520-DA card at the moment to test it, but my feeling is that it might help.

whitey · Apr 20, 2018

Well hell, I am pretty sure the Arista switch is up the task packet buffer wise, not so sure abt the Ubiquiti, everything else looks like it passes the sniff test. When you go to 10GbE and above jumbo is almost a necessity if you really wanna squeeze every last bit out of the network but if you are not focusing on the network layer at this time maybe a go-back item when you do decide to circle back arnd to that.

bogi · Apr 23, 2020

Hello all,

This thread was favorite one to me. I have something to add and ask at the same time.

Since last post there were many updates in hardware and OS for taking advantages of 10Gb and 40Gb networking

I'm building setup for VFX office, computer graphics; so SMB share is priority for fast files transfers; primarily reads secondary almost equally important writes.
All artists in the studio are connected with MLNX Connect x2 cards RDMA supported under Widnows10Pro for workstation.

There are from 10 to 12 workstations in use. Not in full network load all the time. We do primarily file sequence readings, writes are ~ 30 to 35% of traffic sometimes even less.
We load primarily image files – sequences; sometimes they are 1mb to 3mb each; sometimes they are 40 to 60mb each, with or without compression. We load 60 files to 600 files, all deepening on the shot we work on.

Current setup that we have is ZFS freenas 11.1 u7 attached to network CISCO Nexus 3064 with 40Gb Mellanox CNX3 single adapter.
ARC has 128gb of DDR3 ram; single e5 v2 10 core CPU; 12 x HDD HGST 6TB SAS and L2ARC Samsung m2 500GB. Pool has 3 vdev on zfs2 lz4 compression. Jumbo Frames ON.
Maximum file transfer that I got was 17Gb and at one short point 20Gb per sec. over that Mellanox ConnectX3 40GB adapter.

Building new storage system with same X9 c602. Supermicro chassis 2U with 12 drives SAS3 8TB HGST 128mb cash; LSI 9300 IT mode.

But this time Solarflare 40GB Dual-Port QSFP SFN8542 SR204; also got 4 NVMe 2TB drives XG5-P placing them directly to PCIe. For this system I have reserved 256GB of ram DDR3.
Dual CPU system with e5 2667 v2

Question is… what road to take for best usage of SMB file share over 40GB network.

1. ZFS freenas 11.3 u2 or wait for truenas new train

2. Windows storage spaces

3. Throw out LSI 9300 IT and get 9300 SAS3 2GB RAM with BBU, hardware raid card under Windows Server 2016 or 2019 for SMB share.

4. Or something else you recommend?

Freenas 11.1 u7 turns out as very stable system but it’s not snappy or take advantage of 40Gb connection.

Please advise!

Best,

Bogdan

i386 · Apr 23, 2020

bogi said:
2. Windows storage spaces

Unless you go s2d and all mirrors storage spaces performance is very disappointing.

gea · Apr 23, 2020

Other option would be Solarish based system ex Oracle Solaris 11.4 with native ZFS or OmniOS with very newest Open-ZFS features like ZFS encryption, sequential resilvering, system checkpoints, trim etc and the kernelbased, multithreaded SMB server and S3 cloud support. Connext X3 is only supported in Solaris but newer X4,5,6 in OmniOS 151034. OmniOS offers stable and long term stable and a commercial support option. Even without you get regular security fixes.

Current features: omniosorg/omnios-build
next stable release, may: omniosorg/omnios-build

Setup and webbased management of a ZFS storage appliance based on a regular Solarish Unix OS
see napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris : Manual

bogi · Apr 24, 2020

i386 said:
Unless you go s2d and all mirrors storage spaces performance is very disappointing.

Can s2d run on a single node?
My guess was if going from Freenas to “StorageSpaces” or S2D is that I will be able to get better 40Gb Network performance and use same hardware.
Freenas is not handling 40gb well "straight" like Windows. All hardware I have form decent software raid solution.
NVMe cashing on s2d sounds good too. These 4 x NVMe xg5-p drives I have on top of 12 SAS3 8TB can make it fast and improve IOPS I guess, …than saturate dual Solarflare 40Gb adapter.

Hope that can run in Multichannel on Windows too, my guess ?

@gea Thank you for your suggestions and answer. I will take it as option and explore further for sure.
Napp-it was on my mind earlier. Problem is that not that many people share online information about it. It will take time for me to explore it and set it up properly.
Maybe I didn't look for it in the right places earlier. I feel much secure with ZFS and what is offering in software raid, not a big fan of Window. Tools that we use in Visual Effects are made for Win. platform in most cases and Windows to Windows in network works good.
Freenas 11.1 U7 runs in our office stable for full year, but performance over 40Gb is far from what can be expected. Not sure about Freenas 11.3 and MVMe drives implementation in full. As friend of mine said if going with it he will wait for U8 version.
Time fly and I want to use available hardware and improve our file server performance in reads and writes.

i386 · Apr 24, 2020

bogi said:
Can s2d run on a single node?

No. S2D (Storage Spaces Direct) requires at least two identical nodes: Storage Spaces Direct Hardware Requirements

bogi · Apr 24, 2020

i386 said:
No. S2D (Storage Spaces Direct) requires at least two identical nodes: Storage Spaces Direct Hardware Requirements

So you are saying Storage Spaces classic one is not way to go for perfomrance?

Rand__ · Apr 24, 2020

So from what you describe it sounds as if most of your data should be in the cache most of the time - that means that your actual pool speed is a secondary aspect (hopefully) and speeding that up will only help in parts - enlarging cache space (memory) might help in case you got a lot of cache misses (check arcstat python scripts)

Now network performance is another issue -
o/c there are some optimization options for FreeNas network performance (buffers et al). as well as for your clients (have you optimized them btw?) - also you could change to newer NICs to make use of architectural improvements (at lest for cx3's cost should be negligible, $20 a pop used)

However, the only real major step you can take here (after optimization has been done) will be to move to an RDMA capable environment, which (unfortunately) at this point still means a Windows server - for example running a Starwind cluster.

I'd consider that only if you are sure you maxed out the optimization capabilities since well its still Windows with Windows filesystems ...

bogi · Apr 24, 2020

Thank you Rand!
No quick solution is there, except classic hardware raid card and bam! raid10 on twelve drives on widnows.

All of you guys helped so much.
Sys admins in visual effects industry are exploring and looking for large and fast file storage all the time.

Not an easy job to make Formula 1 file storage

SRussell · Apr 26, 2020

Patrick said:
Sorry to say this, but you need to be off of FreeNAS if you want to get near 40GbE speeds these days.

FreeNAS is a great solution for 1GbE and maybe 10GbE NAS, but remember it is still primarily focused on being a homelab / SMB office tool.

As a couple of years has passed since this comment....

Do you still see FreeNAS as not a great solution for 10+GbE?

Rand__ · Apr 26, 2020

Curious what Patrick will say but I'd say it still holds true - despite being able to squeeze this out of it:

Code:

Z:\>fio --direct=1 --refill_buffers --norandommap --randrepeat=0 --group_reporting --ioengine=windowsaio  --size=100G --bs=128k --iodepth=1 --numjobs=16 --rw=write --filename=fio.test -name win1
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
win1: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=windowsaio, iodepth=1
...
fio-3.16
Starting 16 threads
Jobs: 16 (f=16): [W(16)][100.0%][w=4038MiB/s][w=32.3k IOPS][eta 00m:00s]
win1: (groupid=0, jobs=16): err= 0: pid=3640: Tue Dec 31 15:49:51 2019
  write: IOPS=31.5k, BW=3939MiB/s (4130MB/s)(1600GiB/415932msec)
    slat (usec): min=5, max=386, avg=11.77, stdev= 1.52
    clat (usec): min=58, max=459371, avg=466.82, stdev=1164.02
     lat (usec): min=73, max=459382, avg=478.59, stdev=1164.00
    clat percentiles (usec):
     |  1.00th=[  347],  5.00th=[  359], 10.00th=[  367], 20.00th=[  379],
     | 30.00th=[  388], 40.00th=[  396], 50.00th=[  408], 60.00th=[  420],
     | 70.00th=[  433], 80.00th=[  457], 90.00th=[  594], 95.00th=[  996],
     | 99.00th=[ 1074], 99.50th=[ 1352], 99.90th=[ 2638], 99.95th=[ 3228],
     | 99.99th=[ 3982]
   bw (  MiB/s): min=  299, max= 4752, per=45.66%, avg=1798.68, stdev=105.62, samples=13154
   iops        : min= 2384, max=38009, avg=14381.82, stdev=844.97, samples=13154
  lat (usec)   : 100=0.01%, 250=0.01%, 500=86.59%, 750=6.42%, 1000=2.18%
  lat (msec)   : 2=4.66%, 4=0.13%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 250=0.01%, 500=0.01%
  cpu          : usr=5.80%, sys=2.30%, ctx=0, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13107200,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3939MiB/s (4130MB/s), 3939MiB/s-3939MiB/s (4130MB/s-4130MB/s), io=1600GiB (1718GB), run=415932-415932msec

Note this was with 12 HGST SS300's striped, 512G Ram, a Gold 5122 and 100G networking (Chelsio with TOE active) - so a lot of caching (despite 100G test size), and it did not scale up much more on my system.
I think if your user to disk to cpu core to QD ratio is ideal then you can squeeze a lot of out it (but this is like 1/3 of actual disk performance, a device (mirror) per Job and it only scales with more users/more disks (overall), or faster drives (per user).

Search

NVMe Storage Solution

CookiesLikeWhoa

Active Member

whitey

Moderator

bogi

New Member

i386

Well-Known Member

gea

Well-Known Member

bogi

New Member

i386

Well-Known Member

bogi

New Member

Rand__

Well-Known Member

bogi

New Member

SRussell

Active Member

Rand__

Well-Known Member