Direct Connect vSphere 7 with 3 or 4 Hosts Possible?

CA_Tallguy

New Member
May 19, 2020
28
4
3
Looking for feedback and ideas....

BACKGROUND: want to setup vSphere 7.0 and planned to go with a 2 host direct connect cluster. After realizing that I would still need a witness, I decided OK why not just go with a 3 host cluster because a 3rd host is more useful than a witness? And the more I looked at a 3 node cluster, the more a 4 node cluster seemed attractive for extra features like erasure coding and having some slack space for rebuilds.

PROBLEM: Build is balooning and I am drawing the line if I can't do direct connect. Many people say that you shoudn't really **direct connect** more than 2 hosts. I do NOT want to get a separate switch (I want to use 10gbe/40gbe for vSAN, vMotion etc and those speeds are only available to me if I direct connect). I'm building up a Dell C6300/C6320 box with up to 4 nodes and want to keep the whole build inside the chassis.

HYPOTHESIZED SOLUTIONS:

(1) FORGE AHEAD and try to chain 3 or 4 hosts togehter and experience the pain everyone predicts
(2) SCALE BACK to 2 nodes + witness on third (seems like a waste of a node but provides for high speed direct connect)
(3) "STRETCHED" FOUR HOST CLUSTER: Host A<=>B @40gbe, C<=>D @40gbe with slower link between as if stretched between sites (back to needing witness?)
(4) 2X TWO NODE CLUSTERS: similar to #3 but as two distinct two-node clusters (with a witness for each on the opposite cluster)
(5) PFSENSE: somewhere within the cluster or dedicate one of the four server nodes to it (single point of failure?)
(6) PROXMOX: try to do a MESH setup (maybe ditch vShpere): Full Mesh Network for Ceph Server - Proxmox VE

Some options for physcial connections:

(A) QSFP DAISY CHAIN: 3 or 4 nodes + try to avoid a loop with nic teaming and failover? set "routing based on originating virtual port" (?)
(B) QSFP TO 4x SFP+ breakout cables on 2 nodes.... so all 4 nodes meet up on a SINGLE physcial port (mirror on another node for failover)
(C) 1GBE SWITCH MODE: run PFsense bare metal or in another virtualized environment, and maybe have failover instance within the cluster

I'm also hoping to route WAN and private network through all nodes WITHOUT physical uplinks to every node. I'd like to route that traffic througoht vSphere using whatever interconnections I'm setting up for other traffic. Otherwise the back of this server is going to be ridiculous with 4x iDRAC cables, 4x WAN cables, 4x private network cables, and then all the cross/direct connect 40gbe/10gbe cables.

Here is what solution #3 and #4 might look like roughly....

cross-connect.png

And below is an option for using QSFP to SFP fanout cables. Could run WAN and LAN over one and then vMotion, vSAN over the other. I think the only way this may work well is if there is some huge benefit to concentrating traffic on the fanout ports.... wondering if the NIC could insulate the CPU? I don't know enough about resource costs for packet switchiing.

fanout.png
 
Last edited:

Spartacus

Well-Known Member
May 27, 2019
685
264
63
Austin, TX
So you're wanting to do a vsan setup for this then? What kind of usecase/workload do you have?
1Gb is generally fine for average/lab usage loads as long as you dedicate one of the ports for the vsan traffic and the other for vmotion/traffic.

I know you don't want to but $150 for a cheap 10gb is my recommendation if you want 10g.

You could potentially snag a used 40gb enterprise switch off the bay too.


If you wanna get real advanced you could setup one of the nodes to act as a switch (install router OS on it or something similar), and use the QSFP breakout cable with SFP+ to QSFP adapters (but thats more cost than the switch route likely).


Edit: per your pictures, how are you plugging in those breakout cables? I only see 1 QSFP port on each of those nodes
 
Last edited:
  • Like
Reactions: CA_Tallguy

CA_Tallguy

New Member
May 19, 2020
28
4
3
Thanks for the ideas and feedback. I forgot about routerOS so will look into that. I did some quick research on pfsense switching and it sounds like it probably couldn't handle 10gb+

For external switches, it's a number of factors.... (1) just that the build keeps getting bigger and bigger and never ends becasue you don't get all the bells and whistles until you are maybe at 6 servers at two locations.(2) it's a single point of failure unless I get two, (3) external switch solution will likely be slower 10gbe when I have 40gbe dual port on each server node I'd prefer to make use of (4) For my use cases, I really have no need for more than a single server except I'd like redundancy and some swap/slack space so that takes me maybe to two + witness. And then I crept up to wanting 3 nodes instead of 2+witness. And now I'm to FOUR lol. Things are already getting ridiculous so I have to come back down to earth.

The breakout cables in my second drawing are pretty cool.....I am just learning about QSFP and I guess there are 4x 10gbe channels inside a 40gbe QSFP, so it can show up and physcially break out into 4x SFP+ cables. This seems too cool and I really want to use them somehow LOL. You can get them starting at $25 on ebay.... the transceivers are built right in like DAC cables.

dell_40ge_qsfp_-4xsf_breakout_cable_50cm_powerconnect_8000_series_no_part_num.jpg

s-l1600 (4).jpg
 
Last edited:

Spartacus

Well-Known Member
May 27, 2019
685
264
63
Austin, TX
Ahh so there are 2x QSFP ports on each node?
That might make it doable then, but you still need a stupid amount of SFP+ to QSFP adapters. ~ 6 of them by my count
I found a renewed version of that adapter thats cheaper, but I'm unsure if it will be compatible with the slots.

This one might be a secondary option if the mellanox is incompatible gtek makes pretty good stuff.

Those adapters are technically intended to be connected to a switch though, not a port on a node theres a high likelyhood they wont work for that purpose.

All this also hinges on vmware seeing each of the SFP+ connections as a separate 10g port.
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
Looks like VMWare already advises the use of "stretched" topologies in a single location, for example in this document for "rack awareness"

THe other thing in more recent versions is moving away from multicast to unicast and they also seem to be saying that L3 routed networking may offer a lot more options than trying to do "L2 everywhere" switched networkng.

These topologies should work for my needs, although some features are not available in a stretched topology so I may lose out on some of that. What I'd really like is to replace the witness with another node/cluster/rack.... effectively a third site. But everything I find about stretched clusters is about two site topologies with witness at third. One site/rack/cluster etc is primary and the other is failover kinda thing.

48c9a762-a29b-48ce-828f-1f7600a98473.png
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
Ahh so there are 2x QSFP ports on each node?
Actually there are 2x QSFP+ AND 2x SFP+ ports. I don't have any onboard standard RJ45 so will need to have WAN and LAN uplinks with some of them. I could possibly add in some quad port 1gbase-t nics for uplinks then use the 4x sfp/qsfp ports on each node to max speed for inter-node connections.

That might make it doable then, but you still need a stupid amount of SFP+ to QSFP adapters. ~ 6 of them by my count
These cables all have built in transceivers. You can assemble your own out of cables and trasceivers but it's much more cost effective to buy the cables with them built in. My thought in second drawing under first post was to just get TWO of the fanout from QSFP to 4x SFP. All the traffic for the ENTIRE CLUSTER would end up running though just one cable into a SINGLE port. A second cable on another node would provide failover. (Alternatively could use one for vSAN traffic and the other for WAN/LAN or some split, hopefully with the abilty fo failover everything to just one.)

The big question about such a setup would be if that single port/card on a single node would be effective in processing all the traffic. Not to mention if that node's CPU is going to get bogged down switching/routing everything for all the nodes instead of distributing that workload. In practice, there won't be all that much traffic across any of the links except they'll be maxed out under large data operations or moving storage or VM's around.

Those adapters are technically intended to be connected to a switch though, not a port on a node theres a high likelyhood they wont work for that purpose. All this also hinges on vmware seeing each of the SFP+ connections as a separate 10g port.
 

Spartacus

Well-Known Member
May 27, 2019
685
264
63
Austin, TX
Ah I didnt realize there were SFP+ ports too (you didnt provide a whole lotta info in the first post :p at least about the equipment you have).
The only thing you're missing is connectivity from node b<->d then and some SFP to RJ45 adapters or some addon cards like you noted.
This all seems really overly complicated when you could pick up that $150 SFP+ switch I mentioned, or the 8 port version so you dont have to get this convoluted.
 

Rand__

Well-Known Member
Mar 6, 2014
4,589
912
113
The breakout cables in my second drawing are pretty cool.....I am just learning about QSFP and I guess there are 4x 10gbe channels inside a 40gbe QSFP, so it can show up and physcially break out into 4x SFP+ cables. This seems too cool and I really want to use them somehow LOL. You can get them starting at $25 on ebay.... the transceivers are built right in like DAC cables.
I hope you are aware that these cannot be used to connect a QSFP NIC to 4 SFP+ Nics - they only work in a switch or with special hw support - else only one sfp+ connection will work
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
This was my initial plan until I started reading today that people really frowned on trying to direct connect 3 nodes. I really don't get why it's a problem when the whole point of virtualization was to move away from physical equipement and vSphere is suppose to have robust networking built in.

This setup below isn't convoluted at all. It allows me to direct connect betwen 3 nodes and have failover..... all with just three QSFP cables and a handful of SFP's for RJ45 uplinks......

That is, if it isn't a can of worms for some unknown reasons that all the people on VMware reddit think it will be to connect more than 2 nodes....
https://www.reddit.com/r/vmware/comments/75pl2m
The main drawback with any of this stuff is that traffic is passing through more than 1 node in many cases rather than a switch. For example, vMotion from node B to C would pass through node A, unless somehow the failover link was ALSO an available direct path instead of failover. I don't know enough about networking to know, but maybe layer 3 routing instead of trying to do layer 2 switching could do it?

As for a separate switch -- I just can't. I'm holding the line becasue this is already getitng to be too big of a deployment. I really waned to just go for a 3 node like below, or perhaps 4 in a "stretched" configuration.


3node.png
 

Rand__

Well-Known Member
Mar 6, 2014
4,589
912
113
Ah you dont need the splitting capability then, ok :)
It might work if you can get VMWare to accept vSan traffic on multiple interface (doable) for the same cluster (not sure),
but its certainly worth a try. Worst case is for you to get two SX6012's which add another U to your setup but are reasonably cheap (as EMC with conversion) - the rack kit might be more expensive for them.
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
I hope you are aware that these cannot be used to connect a QSFP NIC to 4 SFP+ Nics - they only work in a switch or with special hw support - else only one sfp+ connection will work
I am not sure..... right now they are not a primary option I'm considering but in the cursory resarch I've done, it sounded like all 40gbe actually has 4x 10gb "lanes" underneath. If that is the case then it just seemed natural that the connection could be split into these lanes and that's why it's so easy. Nothing special about it because QSFP is just 4x (quad) SFP+ connections underneath. So they actually get aggregated to make 40gbe? I am brand new to this but that was my working assumption.
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
Interesting, thanks. I see some posts on Mellanox forums that seem to confirm what you are saying at least for their cx3 and earlier products

but this is interesitng on Intel forums.... looks like some of their NIC's can handle it!
 

Rand__

Well-Known Member
Mar 6, 2014
4,589
912
113
There are also HP Nics that can do it, but out of the box most can't ...
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
Looks like VMWare does NOT recommend placing witness on another 2 node cluster, or on another stretched cluster.


Kind of sucks.... seemed like it might be a nice solution for a number of use cases. I wonder what the issue is. In theory, wouldn't the witness be more resiliant on another 2 host cluster than on a single ESXi? Or not... because there is a lot of infrastrucure that could break? Plus, if a wiitness goes down for a short while it doesn't take down the cluster (if I'm not mistaken) so I don't see why this is so risky.
 

CA_Tallguy

New Member
May 19, 2020
28
4
3
So I have caved in and purchased a Mikrotik 10G switch but I am STILL wanting to configure some direct connections between hosts.

The main thing about having hardware switches is that I don't want to pay for an extra 1U at my data center. So I'm actually going to rig up the switch in an empty sled slot on my C6320! I purchased an empty sled to use as a drawer for easy access to this and possibly another switch while the system is racked. (I previously wanted to avoid this setup so I could have the option to add another server sled in the future.)

1593633517287.png

I may cave in and add a SECOND Mikrotik for redundancy, but more likely will add a 1GbE switch with more ports so I can have physical iDRAC connections as well. The cables on the back of these 3 server sleds are getting insane.... 3x QSFP 40G, interconnects, 3x SFP 10G, 3x Cat6 connected between sleds and hardware switches, 3x Cat6 connected to the iDRAC's, then 2 network uplinks to 2 dfferent hardware switches, and probably an interconenct between switches.

My earlier plan would have just relied upon 3 QSFP cables and then 1 network uplink on 2 different sleds.

The Mikrotik switch is pretty cool as it has dual power supply inputs and a wide input voltage range. So I am going to try to hook it up to both power supplies. In the C6300 chassis, these big fat cables run 12v and under the black plastic cover there are 4 terminals, one set for each power supply. There seem to be two of these boards (or at least two sets of cables), one under the other. But BOTH power supplies are present on BOTH boards (assuming the other board is the same) so they are NOT dedicated 1:1 to a PSU. 1 set of cables on each board corresponds to 1 PSU.

The PDU boards probably each support 2 sleds. I'll be interested to see if it's actually a duplicate board underneath the top one or if there is just some power distribution cables down there or if one of these is a daughter card to the other. When I was trying to update the firmware I didn't read anything about needing to do it on two separate boards.

s-l1600 (7).jpg

The other interesting thing back on the direct connect setup is that Mellanox cards have some HARDWARE switching capabilities. Something called an eSwitch.....

Now trying to figure out if that is available on both Connectx3 and Connectx-3PRO cards. And if this works with normal firmware or the OFED firmware they offer.
 
Last edited: