Hyper-converged 2 node build is done!

Red or Blue

  • Red

    Votes: 5 50.0%
  • Blue

    Votes: 5 50.0%

  • Total voters
    10

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
Build’s Name: S2D hyper-converged 2 node cluster

Node1 (Blue1):

Node2 (Red1):

Other hardware used in build:

Usage Profile: I am an MSP and this cluster is going to be used to house a dozen or two VMs. Some for fun (plex), some for my business (Altaro offsite backup server). It will host a virtual firewall or two as well as all of the tools I use to manage my clients.


Other information:
I’ve been running VMs on an older overclocked 4790k with a single SSD for quite a while and decided it was time to step up my game. I’ve been adding VMs regularly and pushed the server to its limits so a higher capacity system was in order. A single system would be simple but no fun and MS came out with that new fangled storage spaces direct technology so I decided I would be an early adopter and see how much pain I could endure.

I’ve been in the technology game since I was about 14 when I made friends with the owner of a local computer shop. I ended up managing the shop and eventually went to school and became a full time Sys Admin. Since I’ve been rolling my own hardware for so long that is generally my preferred way to go when it comes to personal projects. I am also space limited so I had to figure out a way to do this without a rack. I decided to build the two node cluster using off the shelf equipment so that it would look, perform, and sound the way I wanted. Did I mention I hate fan noise? If it’s not silent it doesn’t boot in my house… A rack will definitely be in my future and the use of standardized equipment means I can migrate the two servers to rack mountable chassis with minimum effort.

The end result is a two node hyper-converged Windows cluster using Supermicro motherboards, Xeon processors, and Noctua fans (silence is something you can’t buy from HP or Dell unfortunately). It only has the ability to withstand a single drive or server failure but that won’t be a problem since it is “on premise.”

The two nodes are connected on the back end at 40Gbps using HP branded Connectx-3 cards/DAC which handles the S2D (Storage Spaces Direct), heartbeat, and live migration traffic. On the front end they are each connected via fiber @ 10Gbps to a Ubiquiti US-16-XG switch and @ 1Gbps to an unmanaged switch connected to a Comcast modem (this allows me to use a virtual firewall and move it between the nodes).

Each node is running a Supermicro microATX board and an Intel V4 Broadwell based CPU. I found a great deal on a 14 core (2.5Ghz) QS (qualification sample) CPU on ebay for one of the nodes. I purchased a 6 core (3.6Ghz) Xeon for the other node. The logic behind the two different CPUs is to have one node with higher clocks but a lower core count and the other with a high core count CPU but with lower clocks. This does impose a 12 core per VM limit (the max number of threads on the 6 core CPU). Any more cores and the VM could only run on the 14 core node.

In order to use S2D on a 2 node cluster with all flash storage you need a minimum of 4 SSDs, two per server. I am using a total of 6 Samsung PM863 drives, 3 per server. Since a 2 node cluster can only use mirroring I am able to utilize half of the total storage, about 1.3TB. Since this configuration can only handle a single drive failure with significant risk to the cluster I will be adding an additional drive to each server in the future. Having an additional drives worth of “unclaimed” space allows S2D to immediately repair itself to the unused drive if there is a failure, similarly to a hot spare in a RAID array. Performance is snappy but not terribly fast on paper, 60K read IOPS, 10K write IOPS.

I also decided to play with virtualized routers and am currently running pfSense and Sophos XG in VMs. By creating a network dedicated to my Comcast connection I am able to migrate the VMs between the nodes with no down time other than a single lost packet if the move is being lazy. I will be trying out firewalls from Untangle and a few others to see which works best.

The hardware build process went very smooth thanks to the helpful people on this forum and the deals section. I saved a lot of money by buying used when I could. I had that nifty Supermicro fluctuating fan problem that I managed to fix thanks to more help and both nodes are essentially silent, the Noctua fans run around 300rpm on average and haven’t gone above 900 under prime95.

Power consumption is right in line with what I was hoping for.
  • Node1 idle: 50W
  • Node 2 idle: 45W
  • Node 2 Prime95: 188W
Neither node puts out enough heat to mean anything and even under full load they are both dead silent. Due to the oddities of the Styx case airflow is actually back to front and top to bottom. This works great with the fanless power supplies as they have constant airflow.

The software configuration was a good learning experience. There are about a million steps that need to be done and while I can do most of them in my sleep S2D was a new experience. Since S2D automates your drive system I ran into a problem I wasn’t expecting. The major gotcha I found involved S2D grabbing an iSCSI share the moment I added it to the machines. It tried to join it to the pool and ended up breaking the whole cluster… twice. Admittedly I knew what would happen the second time but I’m a glutton for punishment apparently. Other than that everything has worked flawlessly. Rebooting a node causes a 10 minute window in which the storage system is in a degraded state while the rebooted node has its drives regenerated. I can move all VMs from one node to the other in just a few seconds (over a dozen VMs at the same time which only uses about 12Gbps of bandwidth) or patch/reboot a node without shutting anything else down.

Overall I think MS hit a home run with Server 2016. The drive configuration is one of the most flexible of all of the hyper-converged solutions out there and their implementation has been rock solid no matter what I’ve thrown at it. The biggest drawback of a 2 node setup is that it can only handle a single point of failure but I will be mitigating most of that in the near future.

This project came about mostly because of STH so thank you Patrick for the great site! I also want to thank everyone that helped by answering all of my questions, especially those about the Connectx cards, this thing would have died young without the help. I have some commands and a mostly finished outline available if anyone wants to build something similar.

Here are some links I found useful while doing this project:
Storage Spaces Direct in Windows Server 2016
Fault tolerance and storage efficiency in Storage Spaces Direct
Don’t do it: consumer-grade solid-state drives (SSD) in Storage Spaces Direct
Windows Server 2016: Introducing Storage Spaces Direct
Resize Storage Spaces Direct Volume

Some random thoughts:
I wouldn't put a 2 node cluster into production for any of my clients. The minimum recommended is a 4 node cluster but I think 3 nodes would work well if be a little inefficient in terms of storage capacity.

S2D has a mind of its own, don't screw with it, just let it do its job. If you fight it you will lose, ask me how I know...

I've built a 2 node cluster using 1Gbps links, don't bother, it's functional but not much more, the 10Gbps recommendation is pretty accurate.

Most of what I googled was for server 2012 since 2016 is so new. There are enough resources available to find an answer to almost anything but it does require a bit of searching. I had issues multiple times because the pre-release versions of Server 2016 used slightly different syntax than the release version, thanks MS.

And now for the pics:

20170114_133710.jpg 20170119_140705.jpg20161222_183639.jpg 20170118_135306.jpg20161222_192354.jpg 20161222_211145.jpg 20170113_193059.jpg 20170119_134228.jpg 20170119_140518.jpg 1.JPG 2.JPG 3.JPG
4.JPG 5.JPG
 
Last edited:

vl1969

Active Member
Feb 5, 2014
611
68
28
can I ask a few questions please :)
I am interesting in how did you do your storage setup?
maybe you can help me figure out if I can convert my setup in near future.

I have build out HV HA cluster at work based on Hyper-V server 2012 R2 setup.
we needed a reliable virtualization solution, as we run all of the servers as Vm on the cluster.
File Server, SQL server 2 domain controllers , etc.


I have 2 Dell PowerEdge R730dx servers, but they are not 100% identical.

Server 1 is Intel® Xeon® Processor E5-2620 v3
Server 2 is Intel® Xeon® Processor E5-2620 v4

the rest is the same, 64GB ram, 2x200GB RAID-1 SSD for System disk on a PERC H730 raid controller.
4x2TB raid-10 for data on the same PERC.

at the moment I am using Free StarWind setup (2 node license) but as they change the license to exclude GUI in free version I might be looking for replacement.

how should I reconfigure my storage to implement the Server 2016 with S2D?
will the setup like yours work for HA cluster properly. what do you mean it can only tolerate single point failure?

can you please elaborate ?


 

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
can I ask a few questions please :)
I am interesting in how did you do your storage setup?
maybe you can help me figure out if I can convert my setup in near future.

I have build out HV HA cluster at work based on Hyper-V server 2012 R2 setup.
we needed a reliable virtualization solution, as we run all of the servers as Vm on the cluster.
File Server, SQL server 2 domain controllers , etc.


I have 2 Dell PowerEdge R730dx servers, but they are not 100% identical.

Server 1 is Intel® Xeon® Processor E5-2620 v3
Server 2 is Intel® Xeon® Processor E5-2620 v4

the rest is the same, 64GB ram, 2x200GB RAID-1 SSD for System disk on a PERC H730 raid controller.
4x2TB raid-10 for data on the same PERC.

at the moment I am using Free StarWind setup (2 node license) but as they change the license to exclude GUI in free version I might be looking for replacement.

how should I reconfigure my storage to implement the Server 2016 with S2D?
will the setup like yours work for HA cluster properly. what do you mean it can only tolerate single point failure?

can you please elaborate ?
Of course, let me see if I can help out:

The storage setup consists of a single PLP (power loss protection) Intel drive to boot each machine. For the S2D pool I have 3 PLP Samsung PM863 sata drives in each machine. PLP is an absolute requirement for SSDs or S2D goes into write-through mode and your performance becomes lower than a HDD. Since I have a 2 node cluster (wouldn't recommend, you need at least 3 for the proper amount of resiliency) and the volume is using mirroring I can use exactly half of the space.

As far as a single point of failure I mean that in my 2 node setup the cluster can tolerate the failure of a single piece of hardware, either an entire server or a single drive in either server. BUT with the loss of a single drive in a server the second server MUST remain online since all of the missing data is coming from it. There is a solution to this, you add more drives, create a volume that has a size of N-1 (ex, with 4 drives your volume would span a max of 3 of the drives leaving 1 drives worth of space "unclaimed") and then if you have a drive fail S2D will automatically rebuild itself to the spare area. There is still a small window in which the whole thing could go down due to a second failure but it should only be a few minutes while it rebuilds. Does that make sense? If not let me know and I'll elaborate further.

On to your servers, I'm not positive about mixing generations of processors but I *think* it should work. The only issue I see is that the V4 has new instructions so that might stop it from working with the V3. I don't have any way to test but I would do some google searching to see if anyone has gotten it to work. I use two completely different processors from the same gen without issue so there is some flexibility there. The simple fix is to purchase a V4 cpu and then you know it will work properly.

Here is the big BUT in your equation, you may not be able to use those servers with S2D period. I say may because I need to do more research on the PERC H730 controller. S2D only supports SATA, SAS, and NVMe drives, NOT drives sitting behind a RAID card... That means the only way to use drives connected to the H730 is to put it in passthrough mode. I use HP almost exclusively and you can't do it with their equipment but I'm not sure about Dell... If you can put it in passthrough mode then you should be able to use the servers with the caveat of either using a bare drive to boot from or a second RAID controller so you can boot off of a RAID array. That may or may not bother you but in a production server I would prefer a RAID 1 boot drive at least.


So a few final points:
I wouldn't put a 2 node cluster into production, it's better than a single node but not by much. 3 should be the actual minimum

S2D is an automated system, it will claim any drive you plug in. This means that if you are going to use iSCSI along side it you need to set it up before enabling S2D on the cluster. I had to redo my setup because my witness disk was iscsi based and I tried to set it up after S2D, it wasn't having it.

Ebay is your friend, for the back end you can purchase Connectx-3 cards and cables for very little and there are some 40Gbps switches on ebay that are also inexpensive. I think there is an 8 port switch that I may get if I build a third node at some point.

MS recommends using at a minimum a 10Gbps network, I would agree. I built a test cluster on a 1Gbps network and it was functional but not much more. Ubiquiti has a decent inexpensive 10Gbps fiber/copper switch.

If you do decide to create a cluster pm me for my notes, I have an outline that I created as I went along.
 
Last edited:

vl1969

Active Member
Feb 5, 2014
611
68
28
thanks Jeff,
well my situation is a bit more complicated than you can imagine :).

#1. my employer is not tech oriented, we a factory. the IT department is basically Me, and 2 other people
who have other responsibility as well. most hardware things is on me.

#2. both servers are relatively new, on is about one year old, give or take few month,
second is about 5 month old. so can not buy new CPUs what I have is what I got.

#3. EBay is not my friend, as owner is totally against shopping for this anywhere by official suppliers.

#4 I believe I can set the PERC to simple HBA mode. I think it came set to simple and I configured the raid on it. so that should be ok.

#5 my concern is that I do not have drives with PLP, all my drives are enterprise type, but I do not believe they have PLP. the servers on UPS though.

also I do not have SSDs for cache. the SSDs I have are for OS install.


third server is not an option at the moment either.
but isn't my current setup (a 2 node Hyper-V cluster )is in the same boat?

forgot to mention that both server have dual port 10GB nic and quad port 1GB nic
 

superfula

Member
Mar 8, 2016
88
14
8
Hey great write-up!!

I've been messing with a 3 node and 4 node S2D setup as a proof-of-concept.

We are using the Xeon D-1541 Supermicro boards as the base.
Each node has:
1x X520-DA2 for the 10 Gbe speeds with a US-XG-16 switch
1x Intel S3700 SSD drive
32 GB of memory

I've also played with adding 2x 4 TB SATA drives (Seagate and Hitachi enterprise drives) to each node and using the S3700 as cache.

Now the problem...regardless of the drive setup, writing to the S2D CSVs is terribly slow. Like 11 MB/second slow.
From the CSVs to any node's C: drive is fine, as is C: to C: and they are certainly using the 10 Gbe cards given the speeds I've seen. Yet for some reason write speed to the CSVs is awful.
I've tried creating the CSVs as mirrored or striped in 4 node or using -PhysicalDiskRedundancy 1 in 3 or 4 node with just the SSD drives.

I thought I'd throw out the conundrum to someone who has setup S2D in a similar fashion. Perhaps I'm just missing something easy.
 

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
thanks Jeff,
well my situation is a bit more complicated than you can imagine :).

#1. my employer is not tech oriented, we a factory. the IT department is basically Me, and 2 other people
who have other responsibility as well. most hardware things is on me.

#2. both servers are relatively new, on is about one year old, give or take few month,
second is about 5 month old. so can not buy new CPUs what I have is what I got.

#3. EBay is not my friend, as owner is totally against shopping for this anywhere by official suppliers.

#4 I believe I can set the PERC to simple HBA mode. I think it came set to simple and I configured the raid on it. so that should be ok.

#5 my concern is that I do not have drives with PLP, all my drives are enterprise type, but I do not believe they have PLP. the servers on UPS though.

also I do not have SSDs for cache. the SSDs I have are for OS install.


third server is not an option at the moment either.
but isn't my current setup (a 2 node Hyper-V cluster )is in the same boat?

forgot to mention that both server have dual port 10GB nic and quad port 1GB nic
Well if I were you I would do the best with what I've got, a 2 node cluster is in many ways better than a single node so go for it! I have not seen very good speeds out of hard drives so keep that in mind, it's possible that without SSDs the performance will be so low as to be unusable, it's definitely worth trying out though. As far as I know PLP only matters with SSDs (don't quote me on that), I've never heard of a HDD with PLP. Good luck!
 

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
Hey great write-up!!

I've been messing with a 3 node and 4 node S2D setup as a proof-of-concept.

We are using the Xeon D-1541 Supermicro boards as the base.
Each node has:
1x X520-DA2 for the 10 Gbe speeds with a US-XG-16 switch
1x Intel S3700 SSD drive
32 GB of memory

I've also played with adding 2x 4 TB SATA drives (Seagate and Hitachi enterprise drives) to each node and using the S3700 as cache.

Now the problem...regardless of the drive setup, writing to the S2D CSVs is terribly slow. Like 11 MB/second slow.
From the CSVs to any node's C: drive is fine, as is C: to C: and they are certainly using the 10 Gbe cards given the speeds I've seen. Yet for some reason write speed to the CSVs is awful.
I've tried creating the CSVs as mirrored or striped in 4 node or using -PhysicalDiskRedundancy 1 in 3 or 4 node with just the SSD drives.

I thought I'd throw out the conundrum to someone who has setup S2D in a similar fashion. Perhaps I'm just missing something easy.
Sounds like a fun project! My experience using HDDs was with a proof of concept 2 node cluster in which I used 2x 320gb sata drives per node that were from circa 1985. In other words they were SLOW. I still managed to get about 35MB/s write from a VM so I"m not sure why yours is even lower (and I was only using 1Gbps links). Are you using dual or triple mirroring or parity on the volume? I know parity slows things down but I'm not sure by how much. You may also want to check things like RDMA to make sure the network isn't the bottleneck. Are you getting low speeds even with just the SSDs? The minimum is 2x SSDs per node so if you are doing it with 1 drive per node that could cause issues. I can send you my build notes if you want to go through them, they aren't terribly detailed but you might find something you missed that makes all the difference. To be honest what you've described should work pretty well so make sure to post any fix you find as I"m curious to know what the heck is going on.
 

superfula

Member
Mar 8, 2016
88
14
8
Thanks for the feedback!

Man, even 35 MB/s would be better but that's still less than I was expecting in a mirrored setup. Parity sure.

I've tried dual and triple mirroring as well as parity on the volumes. Oddly each is seemingly capped at 11 MB/s.

Slow speeds even if the SSDs are the only drives that S2D pulls in, so they are the data drive. The cards and switch aren't RDMA capable, but I can't imagine it's a network issue since speeds are fine when copying FROM the CSV to a C:, a different node or even a USB drive attached to a different node.

I did think that one SSD cache drive per node may not help but I ran across a post on reddit by one of the devs stating that 1 NVME drive or 1 SSD drive as the cache is fine, you just don't get the protection in case one dies.

Yeah, go ahead and PM me your notes. I have to think it's just something dumb I'm missing but who knows.
 

vl1969

Active Member
Feb 5, 2014
611
68
28
frankly, I am not sure what my current speed is.
I am using StarWind for my 2 node cluster setup. can't tell how fast or slow it is though :-0
 

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
frankly, I am not sure what my current speed is.
I am using StarWind for my 2 node cluster setup. can't tell how fast or slow it is though :-0
That is an easy one to solve. Simply run a drive benchmark utility from within a vm, it will give you a rough idea of what the underlying drive system can do. ATTO is a good simple quick one to download that doesn't require an install and will give you throughput numbers. Samsung magician prior to version 5 is my favorite if you can find it, the disk benchmark also gives you IOPS numbers as well as throughput.
 

Jeff Robertson

Active Member
Oct 18, 2016
421
113
43
Chico, CA
Thanks for the feedback!

Man, even 35 MB/s would be better but that's still less than I was expecting in a mirrored setup. Parity sure.

I've tried dual and triple mirroring as well as parity on the volumes. Oddly each is seemingly capped at 11 MB/s.

Slow speeds even if the SSDs are the only drives that S2D pulls in, so they are the data drive. The cards and switch aren't RDMA capable, but I can't imagine it's a network issue since speeds are fine when copying FROM the CSV to a C:, a different node or even a USB drive attached to a different node.

I did think that one SSD cache drive per node may not help but I ran across a post on reddit by one of the devs stating that 1 NVME drive or 1 SSD drive as the cache is fine, you just don't get the protection in case one dies.

Yeah, go ahead and PM me your notes. I have to think it's just something dumb I'm missing but who knows.
Ok, with a bit of thought it seems to me your cluster is acting like the CSV is in write-through mode. Good reads but horrendous writes fits the bill perfectly. I am not sure how to determine if that is the case but if you find a good powershell command that tells you post it here for the rest of us! The S3700 has PLP so that shouldn't be a problem, do you have the latest sata drivers from intel installed by any chance? I wonder if the eRST drivers need to be present for the OS to detect PLP... Also, using a single SSD instead of two *might* put it into write-through mode since there is no protection. If you can try building a 2 node cluster with 2 S3700s in each node you might be able to nail that down as the cause. If the 2 node cluster is fast then it looks like 2 SSDs per node is essentially a hard requirement. Good luck and post anything you find!
 

DaveP

Member
Jul 30, 2016
50
2
8
39
Dang it, I don't know why I look at these threads. Just makes me want to go spend money I don't have...