Yet another scale-out storage on commodity hardware question

dualamd · Feb 23, 2017

Dear STH friends,

I work for a retail company. We need a scale-out storage product, which should be running on commodity hardware.
There is another thread asking the same question. But our need is different, so i create this thread, asking for community's wisdom.

Requirements:
_ At least 1M [4K 7:3 concurrent read + write] IOPS on sata ssd (ex: 32x-64x samsung sm863, supermicro microcloud. Less ssd to reach that mark is prefer)
_ Block storage or file storage is ok.
_ 10Gbps ethernet sfp+
_ Best price for performance.
_ Majority usage is VM image.
_ Minor usage is replication across datacenters (we have SMF optic direct-link between DCs)
_ Data integrity + safety
_ Client is a mix of: Linux LXC, Xen, KVM, Esxi (migrating away), Windows (if available)
_ We're running ~ 400 VMs [SQL, IIS, Redis, Nginx cache, CDN content ...]. Most demand on IOPS are DB VMs.

Product that have been installed/tested :
_ Ceph (Rados)
_ ScaleIO
_ Lizardfs
_ Beegfs
_ Glusterfs (not suitable for VM image)
(EDIT)
_ compuverde vNAS (can't test b/c free license does not allow all-flash)
_ openvstorage

So far, ScaleIO is the fastest product.
Is there any alternative solution?

Any input is appreciated.
Best regards.

capn_pineapple · Feb 23, 2017

Honestly, to me at least, this seems like the kind of thing you'd want to stick on Nutanix or a similar vendor.
If you're talking about a distributed (multiple B&M stores accessing over the net), production, scale out storage setup , with requirements like that... I'm not trying to be a corporate shill, but to me you're in the realm of vendor solutions, Nutanix fits the bill from my POV due to the multi-hypervisor support and scaling storage per node added (with both VM and node redundancy built into the software). vSAN would also work but you're moving away from ESXi.

Aside from that, you've looked at all the solutions I would've suggested already.

amalurk · Feb 23, 2017

Can you share how beegfs compared to the others?

gea · Feb 23, 2017

A single SM 863 can give around 20k-30k random write iops on steady load. If you stripe them for performance or capacity you need at least 30 SSDs in a raid-0 or 60 SSDs in a Raid-10 alike config for 1M raw iops Read iops is a lesser problem.

So your goal is propably not achievable with redundancy does not matter the solution. Another problem is reliability with that many SSDs. This is where ZFS is the best freely available option especially as you mention data security and reliablity as a must.

Although ZFS is not a technology developed for performance but for data security at first place, it offers some mechanism that helps a lot with larger arrays, mainly the option to add as many mirrors per pool as needed for a wanted iops performance and quite the best read/write cache strategies.

What you can try (You can download Solaris for free for test, demo or development):
The fastest ZFS server comes from the mother of ZFS, Oracle Solaris 11.3 with their genuine ZFS v37. For 32 SSDs you should use a server with 4 x LSI 3008-8i HBAs (avoid expander based solutions) and add at least 1-2 GB RAM per TB data. Add a very fast NVMe as L2ARC to additionally cache a sequential read workload (RAM readcache is for metadata and small random reads, you also want additionally around 5GB RAM for writecache per 10G network link or 20GB for a 40G network uplink).

With 32 SSDs in a striped Raid-10 setup you can have up to around 16 x 25k = 400k raw iops, a bit more from user view due the RAM caches. Propably you find, you do not need that many raw iops and a pool from 3 x 10 Raid-Z2 is fast enough and cheaper/higher capacity.

I am not sure if you can reach 1M iops but as you ask here you may ask for "affordable" solutions. This may be a quite cheap solution.

dualamd · Feb 23, 2017

capn_pineapple said:
Honestly, to me at least, this seems like the kind of thing you'd want to stick on Nutanix or a similar vendor. ...

Thank you for your input. I will evaluate Nutanix.

amalurk said:
Can you share how beegfs compared to the others?

Will do

gea said:
A single SM 863 can give around 20k-30k random write iops ...

Thank you Gea.
Actually, we've been using ZFS (OpenIndiana + OmniOS) since 2012.
ZFS destroys all tested FS on performance + it has every of data safety features checked.

Its disadvantages are:
_ No HA. There is a commercial addon for this, but this addon is not scale-out enough for our need. IIRC, each HA require 2 machines sharing SAS JBOD. Sas ssd pricing is not good. High performace server with all-NVME-disk is another problem.
_ We're using ~10 zfs pods, 12x / 24x ssd / hdd per pod already. Maintenance of those pods now costs too much effort (backup, balance vm among pods, ...)

ghandalf · May 12, 2017

Hi,

I know, this thread is from February, but maybe I have also a nice possible solution.

We will start to check ZetaVault in the next weeks, so maybe, I can share some info about it later.
ZetaVault is based on ZoL and doesn't use JBODs, instead, it uses storage servers which serves its storage to the headnodes via infiniband/network or even NVMEoF.
You can have up to 100 headnodes and about 1000 storagenodes.

About your BeeGFS experiences it would be great, if you can share them!

T_Minus · May 12, 2017

@dualamd you could also not use the SM863 and use Intel S3710 instead and go from 25K write IOPs to 40K that's a huge jump just going to a different SSD.

I've not worked on any solutions this large but it may be worthwhile to start looking into the SATA Queue Depth limitations and if they will be a factor in your workload as well as controller based queue depth limitations. I know some HBA can be flashed to have > QD than they did by default, and this may simply be due to how LSI firmware works between model #s / oems / etc...

I googled this quick and found a couple good articles:
"As I listed in my other post, a RAID device for LSI for instance has a default queue depth of 128 while a SAS device has 254 and a SATA device has 32"

I thought this was very very useful information:
In Synchronet’s VSAN labs just between the two firmware on H310/LSI 2008’s we saw a difference of 10x better IOPS and 30x better latency.

Why Queue Depth matters!
Disk Controller features and Queue Depth?

There's also other performance tests out there showing that ultimately you should be only using 4-5 SSD per-HBA to get optimal performance, obviously this is likely very hard to do but maybe 6 is possible...3 per-channel, 6 total per-hba.

Jeggs101 · May 12, 2017

Which nodes will you have in the microcloud? Don't most of these require you have like 2-4 data disks per node?

If you really want that type of IOPS, NVMe would be the first place to start and go outside the microcloud then pipe data into chassis.

dualamd · May 19, 2017

T_Minus said:
@dualamd you could also not use the SM863 and use Intel S3710 instead and go from 25K write IOPs to 40K that's a huge jump just going to a different SSD.

I've not worked on any solutions this large but it may be worthwhile to start looking into the SATA Queue Depth limitations and if they will be a factor in your workload as well as controller based queue depth limitations. I know some HBA can be flashed to have > QD than they did by default, and this may simply be due to how LSI firmware works between model #s / oems / etc...

I googled this quick and found a couple good articles:
"As I listed in my other post, a RAID device for LSI for instance has a default queue depth of 128 while a SAS device has 254 and a SATA device has 32"

I thought this was very very useful information:
In Synchronet’s VSAN labs just between the two firmware on H310/LSI 2008’s we saw a difference of 10x better IOPS and 30x better latency.

Why Queue Depth matters!
Disk Controller features and Queue Depth?

There's also other performance tests out there showing that ultimately you should be only using 4-5 SSD per-HBA to get optimal performance, obviously this is likely very hard to do but maybe 6 is possible...3 per-channel, 6 total per-hba.

We didn't choose intel s3700 because its price/mb is high.
Actually, we have 3 pools of storage: hdd --- sata ssd --- nvme.

Testing ssds are connected to intel ahci controller. Therefore, QD32 is the maximum.
I would like to see DFS's efficiency, so majority of test is in low QD.

Jeggs101 said:
Which nodes will you have in the microcloud? Don't most of these require you have like 2-4 data disks per node?
If you really want that type of IOPS, NVMe would be the first place to start and go outside the microcloud then pipe data into chassis.

Sorry, i don't understand your question, could you rephrase it?

Each node of microcloud server, has 2x 3.5" hdd in the front bay, and 2x samsung 960 pro nvme inside m.2 slot.
It also has 2x M.2 slot on riser card for future expansion.
Super Micro Computer, Inc. - Products | Accessories | Add-on Cards | AOC-SLG3-2M2

T_Minus · May 20, 2017

If you would consider used (with warranty) on ebay you can buy them for more than 50% off

$150 for 400GB is not bad $200-250 for 800GB depending on what you can get and how many you need. These numbers hold true for SAS2 and SAs3 assuming you don't need 24 at once

you should be able to find/get what you want if you have a little time and don't need large #s.

dualamd · May 21, 2017

T_Minus said:
If you would consider used (with warranty) on ebay you can buy them for more than 50% off
$150 for 400GB is not bad $200-250 for 800GB depending on what you can get and how many you need. These numbers hold true for SAS2 and SAs3 assuming you don't need 24 at once you should be able to find/get what you want if you have a little time and don't need large #s.

Thank you.
But i'm working for a corporation. Purchase/warranty from ebay is a big trouble process.

dualamd · Aug 15, 2017

Here are our evaluation results:
Due to NDA with some vendors, some names have to be removed.

Testing hardware:
_ Supermicro fattwin F618R2-RTPTN+, and microcloud 5038MD-H8TRF
_ 32x ssd sm863 2TB on 8 Fattwin sleds. Each sled has 4 ssd
_ Each fattwin node has 2x 10GBe sfp+ to ibm g8264f switch.
_ Each microcloud node has 2x 10GBe sfp+ to ibm g8264f switch.
_ Storage OS: ubuntu 16.04 run as VM under Esxi 5.5U3 (most storage product), or run native (storpool)
_ Storage packages are installed on ubuntu VMs.
_ Storage service is exported to Esxi 5.5 (NFS, iSCSI LUN).
_ 8 ubuntu 16.04 VM are spawned as data consumer. These VMs running on the same physical fattwin server.
_ 8 ubuntu 16.04 VM are spawned as data consumer. These VMs running on different supermicro microcloud server.
_ If storage product has native protocol, 8 VMs on microcloud are also granted access mount-point.
_ Each data client format btrfs on vdisk.
_ fio is run concurrently on 16 clients. IO-depth: 4, random rw, rwmixread 60. Blocksize: 4K, 8K, 16K, 64K

This is the result of our evaluation, tailor for our use-case. It may meaning-less to somebody.
Your use-case/hardware maybe vastly different to ours, therefore, please don't hesitate to contact vendor representatives to have POC sessions. They're very happy to help.

This is read+write throughput in KB.
http://i14.photobucket.com/albums/a343/dualathlon/Esxi/stor1.png

------------------

Then, two fastest products are exported block device to clients, using its own native protocol.
A OmniOS ZFS pod with 24x ssd sm863, raid-0 scheme, aslo export iSCSI Lun to the same clients.
fio run directly on /dev/lun
I think this result has some bottleneck with OmniOS nic, i don't have time to do fine tuning this ZFS pod.
However, i post this result to see differences between 2 fastest products.

Raw device read:
http://i14.photobucket.com/albums/a343/dualathlon/Esxi/storrawread.png
Raw device write:
http://i14.photobucket.com/albums/a343/dualathlon/Esxi/storrawwrite.png

----------

Tested/contacted product:
_ Beegfs
_ Ceph (Rados): little disappoint
_ Glusterfs: good performace, but not suitable for VM image (brick healing lock files)
_ Lizardfs
_ openvstorage: i can't get free version work properly
_ ScaleIO: good performance. Single node & aggregated bw is good.
_ compuverde vNAS: good performance. Support replica + erasure coding. Aggregated bw is good.
_ Nutanix: local reseller refuse to sell software only license => not test a product that we won't use
_ Datacore virtualSAN: each instance run on Windows + demo license is limited to 4 nodes
_ Storpool: good performance. Support replica-2 + replica-3. Single node & aggregated bw is good. Native protocol has a limit of 64 nodes.
_ Hedvig: don't have time to evaluate
_ Scality RING: too large for our use-case
_ Atlantis USX: doesn't reply to our contact form
_ StarWind Virtual SAN: each instance run on Windows + demo license is limited to 2 nodes.

Search

Yet another scale-out storage on commodity hardware question

dualamd

New Member

capn_pineapple

Active Member

amalurk

Active Member

gea

Well-Known Member

dualamd

New Member

ghandalf

New Member

T_Minus

Build. Break. Fix. Repeat

Jeggs101

Well-Known Member

dualamd

New Member

T_Minus

Build. Break. Fix. Repeat

dualamd

New Member

dualamd

New Member