NAS OS for NVME-OF and/or 100GbE

Discussion in 'NAS Systems and Networked Home and SMB Software' started by Drooh, May 4, 2019.

  1. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0
    I’m new, so hopefully I am posting in the correct forum.

    I’ve been through several OS trials & testing over the last several months, through several hardware systems, and am still looking for an easy to implement OS that can provide the necessary software for achieving high speeds that will saturate 100GbE.

    QNAP and Synology Hardware, Ethernet and Infiniband, FreeNAS, XigmaNAS, NAS4Free, Xpenology, UnRAID, and I am sure there are some I have forgotten.

    Now that I’ve established I like the UI and functionality of UnRAID, I need to get my NvME Server going. The aforementioned software options didn’t provide me the ability to consistently saturate anything over 10GbE, so I’m looking elsewhere.

    The original plan was a tiered system with 16 1 or 2TB NvME’s, 16 SSD’s, and 64 HDD’s.

    Since I couldn’t find an OS that would do the three tier system while maintaining high transfers speeds/low latency, I am at the crossroads where I need to finally take care of the NvME Server.

    I have so much surplus, that my configuration options are open.

    At the very least, I want to utilize
    -16x 2TB Intel NvME’s
    -connections to UnRAID servers need to be a minimum capability of saturating 10Gbe
    -need to be able to at least achieve 40GbE speeds, and preferably 100GbE.
    -be able to consistently do-so with both Windows and Mac Machines.

    The client machines all have Chelsio 6200’s or 580’s, as do most servers. Some servers have Mellanox Connectx-3 or connectx-4. And I have a few spare t580’s and ConnectX-4’s.

    I haven’t been able to find NaS software that will handle, my last thought was to use Windows 2019 for the NvME Server, then hope to find a solution for the potential problems with Mac transfers.

    Any other options I should be considering?
     
    #1
  2. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,659
    Likes Received:
    400
    Do you need a single server?
    Or can you do a cluster, something like ceph or glusterfs?
     
    #2
  3. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0
    Single server is most preferable, but I am open to options of utilizing a cluster. My knowledge is limited on glusterfs and ceph. Ceph has been recommended for the future when I will eventually need to scale multiple PB’s and beyond. I could reallocate an additional server without too much trouble.
     
    #3
  4. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    609
    Likes Received:
    153
    I am interested to see what you ended up with. My experience with open source software and some serious SSD/NVME combo got me no where near the performance I expected. I tried zfs, hardware raid, and ubuntu with mdam with 24x HGST 12gb sas 800gb sas drive with each rated at almost 1GB read and 600MB write and and I still find it hard to saturate dual 10GB nic (access via iscsi or nfs). I initially thought its because of the 12Gbps backplane that is on my case, then I bought a new case with BPN-SAS3-216A-N4 backplane and 3x 93000 LSI controller expecting to significantly increase performance with less than satisfactory result. I then tried with 4 raid 0 samsung nvme 1725 (3.2TB) and got similar result. Not sure if the motherboard/cpu combo is the limiting factor but I concluded that it is extremely difficult to build a NAS that can sustain 10Gb of throughput with off the shelf hardware. Sounds like you have a bigger budget so your result might be better than mine.
     
    #4
  5. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0

    About 90% of your experience mimics mine, along with hardware, almost the whole 9.

    I will note it doesn’t have to be open source. I’m pretty open to any option. I’d like to keep the additional expenditures minimal, but the need is relevant and urgent enough that I’m happy to go back in any direction, to arrive at the end goal.

    I tied enterprise offerings from qnap and synology, along with some lower-end offerings from dell and ibm. Performance was no better.

    I don’t know why there are inconsistencies, sometimes high performance, sometimes
    Low. And I don’t know where the bottlenecks are. I even hired consultants throughout the experience, and while they helped make some headway, I’m still short.

    Sad but true, if I’d have purchased a high-end enterprise solution up-front, I’d likely
    Have spent less money. But the knowledge gained this route is worthwhile. I’m just tired of dealing with it, and it’s near the point of time where I won’t be able to push forward with my products unless this is solved.

    The servers and clients are all high-end, maxed out machines. iMac Pro’s, Modern-gen servers with Xeon scalable processors, custom workstations with i9 x-series processors. All the machines have 128GB+ RAM. So the machines are powerful, the NIC’s are powerful, the array’s are full of some of the fastest drives. No question the OS, network protocol, file system have to be the issue. I want to believe there is something out there that will do it. There’s gotta be.

    Believe it or not, the fastest solution I found was within UnRAID. It isn’t supported, but NvME array got me the closest, but with no trim on the array, no support, and false errors, it just doesn’t make sense to push that direction. I could always come close to saturating 20GbE, with no inconsistency.

    On every single OS, a single NvME outperformed a raid array over network. I have no explanation for that. VM’s I hosted on the array reported 25 gigabytes per second r/w sustained. So, no thermal throttling, no issues, over network, I just can’t explain it, but it wouldnt saturate 10GbE, but a single NvME would.

    I had an array of 16 SSD’s with LSI 9300, and it did saturate 10GbE, but again, sometimes it moved at a snail’s pace.

    I had an array of IronWolf spinners that outperformed the flash storage in terms of sequential r/w over the network. (On an LSI 93xx RAID). No idea why.

    Recently someone suggested to me
    Windows Server data center edition with the new storage spaces. I think it’s storage spaces direct or something like that. Calls for 2 nodes, but supposedly it will do the task. That’s pretty much my last hope. The other final thought is revisiting infiniband. Maybe that’s the ticket. NVME-OF should be the go-to protocol, but my research is just beginning, and I’m still unclear on where to begin.
     
    #5
  6. dandanio

    dandanio Member

    Joined:
    Oct 10, 2017
    Messages:
    57
    Likes Received:
    20
    I wanna yell: "you are doing it wrong"! :)

    Here is what I do each time I get a call like this: I pull out an iperf (3 preferably) on both sides and I measure. If I can saturate the 100 GbE then it means the networking stack is optimal (FreeBSD has by far the fastest TCP/IP stack I measured). If not, I play with: cables, switches, switch buffers, drivers, network card configs, network card hardware, selective acks, other settings (MTU). Lather, rinse repeat until close to 100 GbE is achieved.
    Then I worry about my storage. I pull out a fio and/or a bonnie++ and I measure: IOPS, latency, concurrency, CPU/MEMORY load against the throughput. If I can saturate 100 GbE, I move on. If not, I play with: hardware, more hardware, queues, raid levels (10! a must in your case), maybe filesystems. Lather, rinse repeat until close to 100 GbE is achieved.
    Then I worry about the networking protocol stack: SMB, ISCSI, NFS, etc. If I measure 100 GbE, I am happy. If not, I play with: versions of protocols (SMB!, NFS!), software implementations (iscsi!), configuration, underlying filesystem, buffers, multipathing, iostat, iotop, top, etc.
    Then I look over the hardware. I make sure everything benchmarks properly, and no problems are found.
    Then I pull out a tcpdump/wireshark and look for: packet flow, packet orders, packet drops, packet retransmissions, other thing that might look out of order.
    I am sorry, but I do not use web clients and I do not use proprietary tools so I can't help you with UnRAID...

    Have fun and Light speed (pun intended), I love such challenges!
     
    #6
  7. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0
    I like you process Dandanio. Trust me, I’m happy for people to tell me I’m wrong.

    I’m very pragmatic, and I’m superb at a few things, mediocrely good at several, and some things, just learning. I’m very pragmatic and practical. I’m a creative and a vision guy, with high capabilities in technology, network design and complex systems design just isn’t my thing.

    Your process is not dissimilar to what I’ve done over the past several months. There’s a portion of chicken before egg I have though. That is, the OS. That’s the only thing you didn’t mention. And where I struggle. To start out like that, using iperf to test the network stack, means I’m testing on Windows client machine or Mac client machine, but the server has to have an OS to test. I’m full circle to that point.

    I started with the original intent of doing the suggested process, but as my projects had several other pieces in-process. I had to establish an OS with functionality and UI acceptable to my needs, one that could also provide temporary safe storage for my
    Data.

    I’m running essentially an innovation lab, so as we’ve spun up ideas, I’ve had to flex and shift.

    Replicated JBOB concatenation was my best solution for the majority of data that’s either archival or only accessed by a few people with low frequency.

    The same replicated JBOD concatenation would work fine for these first tier of NvME storage utilized for testing. With NvME, a single drive will saturate (or nearly saturate) a 40GbE line. For that purpose a triple replicated array serves the purpose well. The data on this is important, but it’s test data, my goal is under failure of array 1, switch testing to array 2, rebuild array 2 from array 3. Then array 1 is now array 2, in the event of another failure. The 2.5” ssd array in the next tier is there in case of catastrophe. This is as larger 32tb raid 10. It is also mainly used as a 4th replication of the NvME tier, and it can rebuild the data from the NvME array in a suitable time frame. And then the final tier is the aforementioned JBOD, which is also has critical data replicated in a hardware raid 10.

    The goal is to not be reliant on long rebuilds in the event of failure.

    For my usage, low latency is vital, and sequential and sustained transfer speed is the most important transfer metric.

    In the future, IOPs will become more important, as will being able to quickly scale on the order of petabytes.

    That gets me back to the matter at hand, the Os to use with the first step in your process. I’ll give Free BSD a go today, on a test server.
     
    #7
  8. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0

    I forgot FreeNAS is FreeBSD based. I had the same trouble with freeNAS.

    Problem specific to that distribution?

    Is there a distro you recommend for my purpose?

    FinAlly, is the assumption ZFS for the array?
     
    #8
  9. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,213
    Likes Received:
    722
    ZFS is not a high performance but a high security/ high capacity filesystem. While it offers superiour rambased read/write caches to compensate this, you have some basic limits.

    Ex
    A single modern disk can give up to around 200 MB/s but only on a pure sequential load on the outer tracks. Count more 100 MB/s on an average load. This means around 1 Gb/s network. For 100G= 10 GB/s you need around 100 of such disks in a raid-0 setup.

    Even with the fastest NVMe at all, an Intel Optane you can count around 2 GB/s what means you need 5 of them in a raid-0 setup to achieve 10 GB/s=100 Gb/s theoretically. Trying Unraid for 100G where only 1 disk is active at a time is absurd.

    The OS differences are not soo essential between the Unix options Free-BSD and Solarish or Linux. Its more a matter of driver quality and settings. But even withe the fastest (and most feature rich) ZFS server Oracle Solaris 11.4 with a genuine ZFS I would see 100G=10 GB/s steady read/write rates as not possible. Maybe a cluster filesystem with many, many nodes may be capable after a lot of fine tuning. Mostly you are perfect achieving near to 10Gb = 1 GB/s steady pool performance with a single server.
     
    #9
  10. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,469
    Likes Received:
    503
    What about a bunch of SAS3 SSDs ?:)
     
    #10
  11. RageBone

    RageBone Active Member

    Joined:
    Jul 11, 2017
    Messages:
    230
    Likes Received:
    45
    or a few "consumer" m.2 NVMEs ? Raid0 like 4 with paper 3GB/s each so 12GB/s total paper and synthetic speed.
    i assume the 16 2tb ssds should be capable of way more then saturating 40GbE.

    Though i guess that software-support is the biggest problem, because without rdma, your hardware will cap you quickly.
    I had the best performance and experience with iSER on linux to linux, where as windows can't do iscsi+rdma (iSer), though i'm a noob and don't have much experience. But i guess iscsi is not suitable in your usecase.
    For me, your scenario with mixed windows and mac clients is rather problematic.

    SMB Direct is currently the one i have my hopes on and that i fiddle with, but there is the problem. Way to much fiddling.
    Linux and Samba RDMA / SMB direct seems doable, windows is wonky on rdma support depending on the version (home, pro, what ever) and i have no clue about the macs, i doubt it a bit, but hey, they should be running samba too. if it isn't a version from hell then that should be good too.
    FreeNas has added experimental smb direct support a while ago, but that is still experimental and can kill things.
    So unless you test that seriously, Freenas and probably freebsd are out as options.
    Leaves Linux and Windows.
    i guess uinraid is an ubuntu derivative ?
    SMB Direct could and should in my opinion work on a reasonably up to date Linux.

    If you want to be certain as strong as i dislike windows, i guess windows server is you only option on the smb-Direct route

    I have no clue how NVMEoF support on any of the OSes is, i assume Mac will be a big no, and windows a very strong Maybe and probably No.
     
    #11
  12. kapone

    kapone Active Member

    Joined:
    May 23, 2015
    Messages:
    615
    Likes Received:
    245
    Nutanix?
     
    #12
  13. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0
    All great points and great info to consider. Thank you.

    I figured, somewhat, that would be the case. I didn’t know if the NvME’s would make much difference.



    The drives themselves aren’t the issue. If I directly attached the half of the NvME array, I’m at nearly 25 gigabytes per second read and write.

    The 2.5” ssd array of 16 gets close to that as well.

    In UnRAID, the NvME’s and SSD’s are two separate BTRFS array’s. Speed there is not the problem. My VM’s that exist on the NvME array have a r/w just as good as direct attached to bare metal windows or Mac install. I run disk benchmark on the array, and it tests well. Transfers from the NvME array to the SSD array achieve that speed as well.

    So it’s the network protocol that can’t keep up.



    After reading again and again, I’m starting to gain a new perspective. It does all come down to the protocol for network stack and for the file share.

    I didn’t give FreeBSD a run today, because of the realizations.

    I’m starting to think windows server is the only option. I can demo acronis files connect which is supposedly a high performance file sharing protocol for windows to Mac.

    And I guess the last thought is, what are my big boy enterprise options to achieve what I need? I’d like to not spend more than another $10k, but I don’t have much option but to go back to the bank for whatever it takes.

    And maybe it’s not possible to achieve at this point for Windows and Mac Clients, given current protocols.

    My product is infrastructure as a service, cloud-delivered. It’s quite a blow to the product, if I can’t deliver at least 40GbE speeds, and without being able to achieve 100GbE speed, I’m dead isn’t the water before end of year.

    DAS could work for testing, but at some point that won’t suffice.

    If your curious as to the product and why I need this speed, what I can say is the data is all extremely high fidelity AR/VR content, from multiple octadeca camera arrays.
     
    #13
  14. amalurk

    amalurk Member

    Joined:
    Dec 16, 2016
    Messages:
    130
    Likes Received:
    22
    Never tried it but from its FAQ and free.....
    What is BeeGFS?
    BeeGFS is the leading parallel cluster file system, developed with a strong focus on performance and designed for very easy installation and management. If I/O intensive workloads are your problem, BeeGFS is the solution.
    https://www.beegfs.io/
     
    #14
  15. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,469
    Likes Received:
    503
    My comment re SAS3 SSDs was more aimed at @gea since they were suspiciously missing from his list (i.e. nvme too expensive/complicated), sata too slow - so sas3 might be a compromise.

    But in general I agree its not the raw disk speed that is the problem - I have had similar experiences with a bunch of raid 10 optanes that were slower over the network than a single one locally.

    I agree with @RageBone that RDMA will be the way to go on this, but pickings are slim with that - and even slimmer with IB instead Ethernet - Linux and windows are the only choices i think unless you go proprietary.

    As @gea already said - the common solution to this would be to have dozens of nodes but that will o/c depend on enough consumers to make use of the distributed approach.
    Very interested in this since I was looking for similar options (albeit with less cash and no actual business need;))
     
    #15
    RageBone likes this.
  16. BoredSysadmin

    BoredSysadmin Active Member

    Joined:
    Mar 2, 2019
    Messages:
    233
    Likes Received:
    52
    #16
    Rand__ and RageBone like this.
  17. Drooh

    Drooh New Member

    Joined:
    May 4, 2019
    Messages:
    28
    Likes Received:
    0
    I want to note, on the single NvME vs Array; I was trying to say a single NvME over network is faster than the entire array of NvME's over the network. That's the weirdest problem to assess. The only thing that I can think is the BTRFS RAID array's interaction with the network stack is somehow different than a single attached drive's interaction with the network stack. I have not a clue why or where to start in an assessment. Given this is the case, there may be some legitimacy to trying to implement some sort of pooling software for linux. Or just upgrade to 4TB NvMe's, possibly 8tb NvME's (some are on the way), keeping single files below 4TB's should not be an issue. That may work as a low(ish) cost solution.

    My preference is Ethernet as opposed to Infiniband. With many of the servers outfitted with the capability, I thought there may be some benefit with that protocol.

    I'm going to look into the node approach today. Seems like it may be the only solution to get there. I might start by:

    -giving storage spaces direct a shot today, with two nodes.
    -meanwhile, research other clustered options. I have three servers i could easilt reallocate, and perhaps an additional 2-3 I could, if absolutely necessary.
    -If I can achieve 40GbE saturation with single NvME's, that's a potential solution that will buy me 6-12 months.
    -And, finally, it may be time to choke down the expenditure on a proprietary solution. I'll start researching that today, as well.

    This was a PHENOMENAL read. Thank you for posting. I need to read it several more times, to process a little more of the info contained therein. I understood about half of what they were saying, and the other half, I will have to familiarize myself with.

    In general, I didn't have any concept that there were this many factors at play. NUMA Scaling, RAM Speed, RAM Utilization, many layers of fine tuning CPU performance, and many layers of tuning tweaks, all have significant impact. I am kinda thinking, the key is contained in that article.
     
    #17
  18. RageBone

    RageBone Active Member

    Joined:
    Jul 11, 2017
    Messages:
    230
    Likes Received:
    45
    i guess at this point, the key is still broken up and scattered over multiple problem-areas until you dig deeper into one of those.
    protocol-stack, OS capability and insights how far does windows like you looking into it? Other hardware aspects like NUMA, sometimes hidden NUMA like on the xeon e5 2699V4 22c where like 6 or so cores are not directly connected to memory, on Epyc in general except the upcoming Naples, etc.
    PLX pcie on board peer to peer NVMEoF ? ....... Never mind.
     
    #18
  19. oxynazin

    oxynazin New Member

    Joined:
    Dec 10, 2018
    Messages:
    25
    Likes Received:
    4
    Have you looked at spdk? (Storage Performance Development Kit)
    https://dqtibwqq6s6ux.cloudfront.ne...e-reports/SPDK_nvmeof_perf_report_19.01.1.pdf
    I want to try it by myself just for education purposes but have no time at the moment.
    I don't know is it production ready but if you are experimenting you may try it and post results here.

    Also for RAID for NVMe you can try Intel VROC. The key is inexpensive for you to try it I think: https://www.amazon.com/Intel-Compon...ds=vroc+intel&qid=1557198183&s=gateway&sr=8-3
    https://www.intel.com/content/dam/w...fs/virtual-raid-on-cpu-vroc-product-brief.pdf
    Again have no experience with it, just as brainstorm.
     
    #19
  20. RageBone

    RageBone Active Member

    Joined:
    Jul 11, 2017
    Messages:
    230
    Likes Received:
    45
    i think VROC is only relevant on intel x299 and above, when you want to boot of a nvme raid.
    AMD obviously has their free equivalent but i don't think that is currently relevant.
     
    #20

Share This Page