New Proxmox VE 4.0 Cluster Build

Patrick · Dec 6, 2015

Had a bit of fun today setting up a 7 node all flash (SATA, SAS and NVMe) Promxox cluster in one of the racks today.

Here is the picture I took while just doing the ugly test run (you can even see the KVM cart attached to one of the Xeon D nodes.) Ended up working on cabling after the fact but forgot to snap a picture.

For those wondering here is the Proxmox VE 4.0 cluster control:

This is now the 4th Proxmox cluster (third VE 4.0) in the rack. Here is how the previous three fared:

The VE 3.4 4-node cluster was destroyed to swap to 4.0. Seemed easier than doing the upgrade but it was running fine.
One Proxmox cluster died due to 4x Kingston V200 drives not handling Ceph logging in three nodes.
Had an issue with bringing up a NIC in one of the four nodes. Ended up just making a new cluster (#4) then migrating a few VMs using ZFS send/ receive. Significantly easier than I would have expected. Probably could have kept 3 alive but decided to just use the 3 new nodes to make cluster #4 and then re-join four nodes to the new cluster.

Speeds are extremely fast though. The NVMe drives in the 2U Intel systems are all set up as ZFS mirrors. Transferring between nodes is >1GB/s using ZFS send/ receive.

Being that this is now V4 of the cluster (not including previous VE 3.x generation clusters elsewhere) here are a few tips:

As you can see each of the nodes are labeled with a number -0x. That 0x is the last two digits of the IP address for the NICs. So fmt-pve-01's main external NIC would be 10.0.1.201 and the internal Ceph 10Gb NIC is 10.0.2.201. That naming convention makes troubleshooting really easy.
Running too few nodes e.g. 3-4 was an issue when I added Ceph to the cluster, specifically when a node or two would fail. Now have extra nodes just to buffer the quorum in the event a node or two is rebooting/ failing:

Code:

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.0.1.201
0x00000003          1 10.0.1.202
0x00000005          1 10.0.1.203
0x00000006          1 10.0.1.204
0x00000002          1 10.0.1.205
0x00000007          1 10.0.1.206 (local)
0x00000001          1 10.0.1.207

Note - a better way to do this would have been to use 01 as the master so it got Nodeid 01 but I was not able to in this instance.

Using NVMe backed ZFS mirrors has been awesome. If something goes wrong, it is very simple to re-import those nodes.
Having a few nodes with larger root drives has worked well. I have two nodes now that I can use to create local VMs then migrate the VMs quickly to other nodes/ storage.
The Proxmox VE 4.0 initial installation checklist is really nice to keep handy. It has greatly reduced my provisioning time for new nodes.
The total setup (4x Xeon D and 3x 2P E5) nodes idles in the 800w range even with a fairly expansive set of disks.
If things go wrong in your installation, troubleshoot first, then add everything to a cluster. Adding too few nodes due to issues or adding troubled nodes just to get the node count up is not a wise idea.

For those wondering, it is a different Netgear XS712T switch in there. I swapped it out this weekend.

LXC works well, but I would have rather seen Docker used to be frank.

Biren78 · Dec 7, 2015

What is the Ceph disk/ capacity there?

Patrick · Dec 7, 2015

23 OSDs and something like 13.5TB formatted. There is an additional ~2.4TB / node (on the E5 V3 nodes) of NVMe and there are also boot SSDs that are local ZFS stores.

PigLover · Dec 7, 2015

Very nice. The 10Gbe is all base-T, right? What are you using for a switch?

Your starting to reach some scale here but it looks like you are still running it hyper-converged (storage and VMs sharing nodes). Have you considered what issues might tip you towards isolating storage - or is the converged model looking like it could still grow for a bit?

Also, this comment is interesting:

Patrick said:
Running too few nodes e.g. 3-4 was an issue when I added Ceph to the cluster

You've touched on one of the biggest issues of scale-out designs: they have to have enough scale to make sense before they really work well. You might not have even reached that scale yet.

Patrick · Dec 7, 2015

PigLover said:
Very nice. The 10Gbe is all base-T, right? What are you using for a switch?

Your starting to reach some scale here but it looks like you are still running it hyper-converged (storage and VMs sharing nodes). Have you considered what issues might tip you towards isolating storage - or is the converged model looking like it could still grow for a bit?

You've touched on one of the biggest issues of scale-out designs: they have to have enough scale to make sense before they really work well. You might not have even reached that scale yet.

Right now it is just using the Netgear XS712T. 12 ports (2 can be SFP+ instead of base-T). There is a 1Gbase-T HP V1910-24G (external facing) and one of those Tripp-Lite combo switches (management NICs only).

One thing I learned from the mixed disk/ SSD previous cluster is that the disks slow you down a LOT. Went from 500MB/s writes to 1.2GB/s writes. Removing 6 disks had a huge impact on performance.

I think the hyper converged model is working well for now. Limited a bit by power I had available for this/ hardware. I did move networking out to a separate box and there are a few other nodes there for admin/ monitoring.

The big benefit is that I can mix storage between local ZFS and clustered Ceph. If I split the nodes up, I would have 2-3 storage nodes and 4-5 hypervisor nodes. The advantage of using this model is that I have a quorum at 4 nodes but have 7 nodes total. That gives me a bit of a buffer to take a node down for an upgrade/ replacement without having to worry about a cluster node failing during that period and causing a ruckus.

Naeblis · Dec 8, 2015

@Patrick can i give you a -1 vote, or should I just go back and unlike one of your other posts.

Took me 3 edits and 45+ minutes redoing my post to meet the "Posting your build: Guidelines". I guess if you are the admin.
My inability to spell or organizing my thoughts into a paragraph that anyone else but me can comprehend, had nothing to do with those 3 edits. i'll plead the 5th for the other 4 edits and 2 hours spent on the build post.

Nice build by the way

Patrick · Dec 8, 2015

I doubt I can remember all of the hardware in there.

Naeblis · Dec 8, 2015

Patrick said:
I doubt I can remember all of the hardware in there.

That's why we document items while we build so we don't have to. If there was only a forum that we frequently visited, and said forum had a section dedicated to post our builds in. Wouldn't that be awesome.

T_Minus · Apr 29, 2016

It's been ~6mo now... how is this setup working for you?
Would you still go hyper converged or run separate storage nodes?

Patrick · Apr 29, 2016

T_Minus said:
It's been ~6mo now... how is this setup working for you?
Would you still go hyper converged or run separate storage nodes?

Really well actually. It is still running 4.1 but I really like the setup. The all flash Ceph is working well too.

There are a few bits that I might change but I have not touched the setup in quite some time.

gigatexal · Apr 29, 2016

Naeblis said:
That's why we document items while we build so we don't have to. If there was only a forum that we frequently visited, and said forum had a section dedicated to post our builds in. Wouldn't that be awesome.

chill bro

msvirtualguy · Apr 29, 2016

PigLover said:
Very nice. The 10Gbe is all base-T, right? What are you using for a switch?

Your starting to reach some scale here but it looks like you are still running it hyper-converged (storage and VMs sharing nodes). Have you considered what issues might tip you towards isolating storage - or is the converged model looking like it could still grow for a bit?

Also, this comment is interesting:

You've touched on one of the biggest issues of scale-out designs: they have to have enough scale to make sense before they really work well. You might not have even reached that scale yet.

You simply can't lump all Scale Out solutions into a single category as they are not the same. For instance, this statement does not apply to the Nutanix solution. I have several customers running the Minimum 3-node configuration getting stellar performance with all the resiliency they require for their business.

While it's true we say the first day is the worst day with our solution, that only means that it gets better as you scale because we add Controllers (CVM) and SSD with each node add whether storage only or compute/storage mix. It doesn't mean that it won't provide the performance/resiliency.

Evan · Apr 29, 2016

I am really convinced these converged scale out solutions will be the only way forward, either using the hypervisor, OS, or application clusters.
Just look at how the vendors like Microsoft have driven MSX or SQL in a world without shared storage.
SAN as we know it does I think have a limited life... It will still be around a Long while yet but I am sure start from scratch solutions will use it less and less. High speed networking is making a big difference finally.

msvirtualguy · Apr 30, 2016

Evan said:
High speed networking is making a big difference finally.

It's going to be fun to see NVMe/3dXpoint HCI because Networking will be key. Those without Data Locality may be in for a challenge.

Patrick · Jun 22, 2016

Interestingly enough, one of these nodes had a blip today. I just rebooted it and everything seems fine. Ceph stayed up, zero downtime on the storage/ applications in the cluster. The only reason I found out something went wrong was a Proxmox monitoring alert.

Will diagnose the cause soon but it is a nice feeling to see that happen.

PigLover · Jun 22, 2016

Curious if you could expand on "zero downtime on the storage / applications in the cluster".

Got it that Ceph didn't hiccup. That is as expected.

We're the applications on the node that faulted set up for HA - and they restarted on another node? Or are you saying they just restarted cleanly when the node restarted? Or perhaps there were no applications on that node - only storage - and you are saying that the rest of the cluster just didn't care?

Patrick · Jun 22, 2016

PigLover said:
Curious if you could expand on "zero downtime on the storage / applications in the cluster".

Got it that Ceph didn't hiccup. That is as expected.

We're the applications on the node that faulted set up for HA - and they restarted on another node? Or are you saying they just restarted cleanly when the node restarted? Or perhaps there were no applications on that node - only storage - and you are saying that the rest of the cluster just didn't care?

It was a storage node and had a few small VMs that were sitting behind HAproxy. So the node went down, Ceph did not have an issue. HAproxy routed traffic to VMs on other nodes.

So the OSDs went offline, the VMs went offline, but from an application standpoint nothing went down.

frogtech · Jun 23, 2016

It isn't immediately clear to me but does Proxmox have built in converged block storage solutions? Just reading the storage area of the site doesn't indicate such.

PigLover · Jun 23, 2016

Proxmox has several options to support block storage. In this case Patrick is using Ceph distributed across the Proxmox cluster for hyperconverged block storage.

Proxmox has built in tools to deploy and manage Ceph.

Sent from my SM-G925V using Tapatalk

New Proxmox VE 4.0 Cluster Build

Administrator

Active Member

Administrator

Moderator

Administrator

Active Member

Attachments

Administrator

Active Member

Build. Break. Fix. Repeat

Administrator

I'm here to learn

Active Member

Well-Known Member

Active Member

Administrator

Moderator

Administrator

Well-Known Member

Moderator