Shared storage options for VMware ESXi

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ecosse

Active Member
Jul 2, 2013
463
111
43
Not looked at this yet but see the link below - a free-mium option with a 24TB RAW limit - should be enough for most home lab block storage :)

Maxta goes Freemium and enlarges VP count with new hires

What I am really struggling with at the moment is failure avoidance at a reasonable cost. For VSAN you need 5 nodes to set a failures to tolerate=2 (and I'm not sure whether that means just 2 drives a la RAID-6 or some other type of failure yet) Scale-IO is mirroring. My simple replicated VSA with its RAIO-10 can survive at least 1 node failure and 4 drive failures. Am I missing something with VSAN / ScaleIO?
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
You are correct that ScaleIO is mirroring data chunks and reserves one node's worth of storage as 'spare' since ScaleIO will instantly start reduplicating lone blocks the moment there is a node or drive failure. It doesn't wait. Since this reads from and writes to nearly all the drives at once, you get back to a protected state very quickly. If you have nodes that are more likely to fail together (in the same rack for example) you can group them into fault sets so that both copies of any data aren't saved within the same fault set. This allows more than one node to go down at once as long as they are in the same fault set.
Since it is free with no restrictions for personal use, I'm running three nodes hyperconverged with ESXi 6.0. I'll be moving to 10Gb networking and a 4th node this week. ScaleIO works fine with small clusters but I'd suggest 10Gb and SSDs since a small number of nodes and spinning HDDs get overwhelmed by random IO fairly quick.
 
  • Like
Reactions: ecosse

spazoid

Member
Apr 26, 2011
92
10
8
Copenhagen, Denmark
You are correct that ScaleIO is mirroring data chunks and reserves one node's worth of storage as 'spare' since ScaleIO will instantly start reduplicating lone blocks the moment there is a node or drive failure. It doesn't wait. Since this reads from and writes to nearly all the drives at once, you get back to a protected state very quickly. If you have nodes that are more likely to fail together (in the same rack for example) you can group them into fault sets so that both copies of any data aren't saved within the same fault set. This allows more than one node to go down at once as long as they are in the same fault set.
Since it is free with no restrictions for personal use, I'm running three nodes hyperconverged with ESXi 6.0. I'll be moving to 10Gb networking and a 4th node this week. ScaleIO works fine with small clusters but I'd suggest 10Gb and SSDs since a small number of nodes and spinning HDDs get overwhelmed by random IO fairly quick.
Would you care to go into a bit more detail about your scaleio setup? As far as I know, not a lot of people here run it in small setups (3-4 nodes). I'm thinking seriously about going the scaleio route instead of vsan for the next iteration of my homelab, but a lot of people say the performance is underwhelming in such small configurations.

Any info would be very helpful like specs, performance, hiccups, experiences, thoughts - you get the point.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Sorry if this derails the thread a bit!
My ScaleIO experience is limited since I'm only at 3 nodes, only with 1Gb Ethernet, and haven't yet tried separate storage-compute yet. That said, here are some of my thoughts:
-Installing is easy if you are going hyperconvered (use the vsphere plugin) or two layer (use the installation manager). Trying to mix the two during the initial setup is much more manual so I didn't attempt it.
-I've failed nodes and drives as tests. ScaleIO even in my setup is pretty quick to get you back to a safe state. With ScaleIO you are back to a protected state before you'd get to replace a dead drive in ZFS or normal raid setup.
-The GUI requires 64bit java and I had trouble getting it working on my desktop but it works perfectly on another computer. Once it is working, the GUI is very nice. Lots of data and control without getting confusing.
-One missing feature that seems odd is I don't see any way for it to email alerts in case of a failure. I only see SNMP. I might be missing something here.
-You will need extra storage in one node to initially run vcenter and the webgui VM since using the plugin for the initial setup makes things easy. Otherwise 16GB or 32GB SATADOMs are all you need. The plugin isn't super user friendly but after the setup you don't really use it much.
-I chose ScaleIO for the flexibility it offered. Nearly any computer can server storage, access storage, or both. Add or remove hard drives or servers at will. My experience here has been awesome although rebalancing data to a new drive is a little slow due to my setup.

My setup:
3 nodes (going to 4th in 2 weeks) with Xeon E3-1230 CPUs, 32GB ram, 16GB SATADOM and two 3TB 5400 rpm sata drives. Running ESXi 6.0U1 and ScaleIO2.0. Each node is using just a single 1Gb connection with everything on one subnet. My LB6M just arrived so I'll be going to 10Gb this week. I'm waiting till black friday to get some SSDs. Basically my cluster is the absolute smallest and slowest you could build.

Performance:
Moving large amounts of data is about 50MB/s and running multiple VMs at once will result in a lot of random IO instead of sequential IO. Random IO to 5400rpm drives over 1Gb isn't ideal. For what I'm doing (bulk media storage plus some VMs) the performance is perfectly fine. If you have a more serious load then 10Gb and SSDs I believe would perform great even at small cluster sizes. I'm moving to 4 nodes since I'll have the hardware and usable space will increase from 1/3rd to 3/8ths of raw. I'll start my own thread with performance notes after I see what changes 10Gb, SSDs, and a 4th node give you. Screen shot from my file server VM running on the ScaleIO storage.
 

Attachments

  • Like
Reactions: spazoid

ecosse

Active Member
Jul 2, 2013
463
111
43
Thanks grog (the great) - very interesting. Why do you think though that a scale io disk rebuild will be any faster than a raid0 hot spare kicking in? Yeah I understand the fault sets and I suppose at a scale level it might work but the redundancy still appears light years behind what you can get out of SAN's or even VSA's.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Did you mean a raid10 instead of raid0? ScaleIO will be much faster. Here is why:
When you have a drive fail in a raid 10 it rebuilds by copying data from the surviving drive in the mirror to the hot spare drive. You are reading from one drive and writing to one drive so you are limited to the write speed of one drive.
When a drive in ScaleIO fails, the software identifies which blocks only have one copy and it immediately starts making duplicates. Blocks that now have only one copy are spread across every drive in the cluster except for the drives in the same node as the failed drive. If you have a 5 node cluster that is 4 out of 5 or 80% of all the drives in your cluster. The rebuild reads from all those drives at once. The newly duplicated blocks gets written to (spread across) every single remaining drive in the cluster. The rebuild speed is limited by the combined write speed of your entire cluster. Depending on how many drives you have, this could be dozens of times faster than writing to a single drive. ScaleIO has a setting that limits this rebuild speed so that it doesn't affect the performance of your VMs.
I'm not a storage expert so if anyone has documentation that says I'm wrong, feel free to post it. In my small ScaleIO cluster with 1Gb networking, I found in testing rebuild times were very very fast (under 20 minutes for a failed 3TB drive) for the reasons I just mentioned plus the fact that ScaleIO only moves actual data. This might be a apples to oranges comparison but most raid cards I'm familiar with have to copy the entire drive's worth of space; empty or not.
 
  • Like
Reactions: ecosse

ecosse

Active Member
Jul 2, 2013
463
111
43
Did you mean a raid10 instead of raid0? ScaleIO will be much faster. Here is why...
No that makes sense - thank you. Still not enough to go with a storage configuration where I have data loss if simultaneous disk failures exceed the count of 1! At least not for my "critical" stuff.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Even though ScaleIO does a lot beyond the fast rebuilds (checksumming, scrubs, snapshots, etc.) to protect your data it still doesn't remove the need for backups. Before you mentioned raid10 volumes replicated to another server. Since that is a backup and not part of the same storage volume, it isn't really fair to compare that to ScaleIO. In comparing ScaleIO with a single raid10 volume, both will result in data loss by two simultaneous disk failures if it is the wrong two disks. There are pros and cons to each one. With raid10, it has to be a specific disk. With ScaleIO, it has to be a disk in a different server and it has to fail in a very small time window. Personally I've never seen two disks in different servers fail within 30 minutes of each other. I have seen power supply issues kill more than one disk in the same server at once. Your millage may vary, random internet advice not under warranty, please have backups.

A missing feature of ScaleIO is the ability to sync or send its volumes to a separate ScaleIO cluster. My best guess is that enterprise user's aren't clamoring for that feature since any disaster would be met by restoring backups onto a new replacement cluster or they are using something else to sync the data at a higher level from inside a VM.

@spazoid I have not tested VSAN and I don't know how its features (or how it handles rebuilds) compares with ScaleIO. It is possible that it performs a lot better than ScaleIO for 3 node clusters. I went with ScaleIO because I store a lot of data and care about that more than running VMs. The fact that I could take an old desktop (too slow or unsupported for VMware), install CentOS, and use it as a storage node was important for me.
 

ecosse

Active Member
Jul 2, 2013
463
111
43
Even though ScaleIO does a lot beyond the fast rebuilds (checksumming, scrubs, snapshots, etc.) to protect your data it still doesn't remove the need for backups.
.
Granted but going back to backups is a PITA pure and simple. I want to go back to backups for corruption or FLR, not for simple HA. Btw I'm not comparing scale IO to a single RAID-10, I'm comparing it with a simple 2 x VSA replication - its only fair since the HC sans are replicating :)

Having said that, I sit here with a 3TB all flash VSAN going begging so I suppose I should at least test it :)

 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
VSA replication is nice and if you are using raid10 as the base for each host then yes it can handle more simultaneous failures than ScaleIO. It would take a lot to resort to backups. I've never used it but everything I've read says that it is fast as well since you are accessing local storage. A downside for a data hoarder like me is that such a setup only gives you 25% of raw as usable. ScaleIO gets you more percentage wise as you get more nodes, but even with only 4 nodes you get 37.5% usable which is 50% more space than replicated VSA. Of course you give up some speed and redundancy to get it.
I don't mean to sound like I'm preaching ScaleIO but..... it isn't built on top of another raid. Aside from not needing a raid controller and that point of failure (ever had drives not play nice with a raid card?), it gives you flexibility. Got various sized hard drives? It doesn't care as long as the nodes are somewhat balanced. Have some old 1TB drives? Toss them in. If they start to fail you can replace them with drives of any size you want or choose not to replace them at all! Don't try that at home with a normal raid or even ZFS. I don't know how this compares with VSAN but that flexibility should hopefully save me from doing a huge storage upgrade as I run out of space or my hardware gets old.
 

spazoid

Member
Apr 26, 2011
92
10
8
Copenhagen, Denmark
A missing feature of ScaleIO is the ability to sync or send its volumes to a separate ScaleIO cluster. My best guess is that enterprise user's aren't clamoring for that feature since any disaster would be met by restoring backups onto a new replacement cluster or they are using something else to sync the data at a higher level from inside a VM.
Isn't that exactly what Protection Domains are for? Making sure that at least one replica of your data is on a specific set of SDS', so even if DC1 goes down, all your data is still available from the SDS' running in DC2.
 

ecosse

Active Member
Jul 2, 2013
463
111
43
VSA replication is nice and if you are using raid10 as the base for each host then yes it can handle more simultaneous failures than ScaleIO. It would take a lot to resort to backups. I've never used it but everything I've read says that it is fast as well since you are accessing local storage. A downside for a data hoarder like me is that such a setup only gives you 25% of raw as usable. ScaleIO gets you more percentage wise as you get more nodes, but even with only 4 nodes you get 37.5% usable which is 50% more space than replicated VSA. Of course you give up some speed and redundancy to get it.
I don't mean to sound like I'm preaching ScaleIO but..... it isn't built on top of another raid. Aside from not needing a raid controller and that point of failure (ever had drives not play nice with a raid card?), it gives you flexibility. Got various sized hard drives? It doesn't care as long as the nodes are somewhat balanced. Have some old 1TB drives? Toss them in. If they start to fail you can replace them with drives of any size you want or choose not to replace them at all! Don't try that at home with a normal raid or even ZFS. I don't know how this compares with VSAN but that flexibility should hopefully save me from doing a huge storage upgrade as I run out of space or my hardware gets old.
Fair enough - this comes down (largely) to a different design philosophy - for block storage a least. I want conformity as much as possible, that should predicate repeatable performance and availability. I'm not wanting to throw any 1Tb drive at my premium storage tier, and I want it to survive (so yes I have seen two drives fail at roughly the same time, both as a drive failure and particularly when I was using a Norco 24 bay chassis with those crappy sas connectors - they used to drop-out like crazy.) I can't speak for ScaleIo but VSAN is not completely HBA agnostic from both a supportability and performance perspective - there are queue depths to consider and there was corruption issues if you used onboard intel chipsets (in the beta version if I remember correctly).
The no forklift upgrade path is a plus though - assuming that survives as scale io moves through its versions, though I could easily extend my existing RAID-10 volumes if I wanted.
Btw on file I am using snapraid, so more akin to your philosophy. But again based on a 6 live +2 partity configuration. That has saved me more than once :)
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
@spazoid Protection domains are a group of storage servers that act together to form a cluster. If you have multiple node failures within a protection domain, it will only take down that protection domain and the other protection domains will be okay. It is really only useful if you have a very large number of nodes. Protection domains do not sync any data between them. They are completely separate from each other from an IO and performance standpoint. Protection domains are grouped together in a ScaleIO system for management purposes only. The client software can access volumes from any number of protection domains.

@ecosse I've also been saved by raid6 arrays when more than one drive drops out. My storage before ZFS was that Norco 24 bay chassis! I just haven't seen two drives in different servers fail within 30 minutes of each other although it certainly could happen. The cost of these forklift upgrades is something I've come to hate since whatever solution I want to move to needs enough storage to handle all the data before I can repurpose the old hardware. I'm paying to double my storage each time. Even if ScaleIO doesn't work out long term, at least I can remove drives from it one at a time as I transition to my next storage adventure. That is a feature I wish regular raid or ZFS had. Does anyone know if VSAN can do that?
 

spazoid

Member
Apr 26, 2011
92
10
8
Copenhagen, Denmark
@grogthegreat Right, I was thinking of fault sets. Define each site as a fault set, and ScaleIO will make sure that mirrored data will be in different fault sets. It's not replicated, I suppose, as in async hourly based replication, but data is mirrored between your sites. It wont increase your fault tolerance in terms of numbers of disks or nodes (and wont use extra capacity either), but it will allow you to have an entire copy of your data in other fault sets, which could for example be a datacenter.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
I don't think you can use fault sets like that. ScaleIO treats fault sets sort of as one large node. The fault sets are still within the same cluster so the performance of your cluster would be really really bad if part of the cluster was separated by a WAN link.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Moved to a 10Gb switch (LM6B) without making any other changes. Running a low QD test on a single VM against a single volume probably isn't the best test but more data is better than none. I have not done any performance tuning or even tested the node to node bandwidth yet. Some point soon I'll be adding a 4th node as well as buying some SSDs to form a SSD volume for VM boot drives. I'll share any tests you'd like to see.
 

Attachments

  • Like
Reactions: spazoid

spazoid

Member
Apr 26, 2011
92
10
8
Copenhagen, Denmark
This is great, do you by any chance have a similar test result with 1gbit equipment? I have a new server on its way and started prepping my current esxi hosts, so another week or two and I'll hopefully be up and running.
Have you done any testing on the effects of rfcache and/or RAM cache?
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
This is great, do you by any chance have a similar test result with 1gbit equipment? I have a new server on its way and started prepping my current esxi hosts, so another week or two and I'll hopefully be up and running.
Have you done any testing on the effects of rfcache and/or RAM cache?

Just above in post #24 I did the same test on 1Gb using the same setup. I have not played with rfcache. These results were with 2GB of ram cache per node enabled but I don't think the test got cached since we would probably see higher numbers.