Switchless 10GbE Point-to-Point Connection between ESXi Servers [how?]

svtkobra7 · Sep 20, 2018

Myth said:
does ESX use SMB to communicate between the two servers? What protocol would it use? TCP/IP, anything under that?

There is a "default TCP/IP stack" in ESXi ... I don't know beyond that, but @Rand__ would in his sleep ...

[didn't mean to skip over your post]

Rand__ · Sep 20, 2018

I run x10's(e5)/x11's(e3) atm.

And no SMB3 on ESX, various custom protocols for different kinds of traffic types (the vlans i have are for traffic type separation, almost all are esx traffic types). O/C you can use nfs/iscsi from the higher protocol levels as well as IB, RDMA (with appropriate cards and newer versions)

Also, how many optane drives do you run now?

svtkobra7 · Sep 20, 2018

Rand__ said:
I run x10's(e5)/x11's(e3) atm.

And no SMB3 on ESX, various custom protocols for different kinds of traffic types (the vlans i have are for traffic type separation, almost all are esx traffic types). O/C you can use nfs/iscsi from the higher protocol levels as well as IB, RDMA (with appropriate cards and newer versions)

See I told you nothing to be jealous of, i.e. I'd love to have X10's (better than E5 v2s), but the cost of 512 GB of DDR4 is a lot. o/c I guess I don't really need 256 GB per server, but ZFS is quite RAM hungry.
What would 16 x 16 GB of DDR4 cost? ~$1500?

Rand__ said:
Also, how many optane drives do you run now?

ATM, only 1 900p installed in each.
I'll install the rest sitting on the sidelines as soon as I clean up this epic mess I've created with my storage pool on the primary, it wont automount due to it thinking a SLOG is still there, can't remove the SLOG (the pool is borked) ... which means I need to copy upwards of 25TB @ 1 Gbps (or hopefully 10 Gbps) to the second server, and then blow away the pool. Addition/removal of SLOG is "non-destructive" my arse!
No big deal (just need to figure out how to copy from a non-existent mount point to a mount on the other server), after all that is why I went from the 836 to 2 x 826, but it would have been nice to copy at my pleasure (not necessity) after fully testing SLOG config etc.
Me, NVMe, and ZFS = ARE NOT FRIENDS

Rand__ · Sep 20, 2018

Well prices are coming down again so yeah maybe 1500 will cover it.

Usually slog removal should not cause issues if its not part of the pool ie stripe

I assume forced removal does not help either?

svtkobra7 · Sep 20, 2018

Rand__ said:
Usually slog removal should not cause issues if its not part of the pool ie stripe I assume forced removal does not help either?

My acumen is somewhere between retarded and below average, not brain dead.
So no, I didn't accidentally add a SLOG as another vdev to the pool.
Well, I suppose it is a vdev as a log device, but you know what I mean.

Rand__ · Sep 20, 2018

Didnt think so but with that FreeNas Gui that happend to me a few times already... thankfully usually not on my main pool

svtkobra7 · Sep 26, 2018

Rand__ said:
Fat Twin and external JBOD?

Hopefully no need to do this again, but (a) wanting to get my data off that borked pool quickly and (b) tired of being limited by ~1Gbps, I figured I would complete the copy using the create your own JBOD approach, i.e.:

C1 - Integrated "HBA" => SFF-8087 Cable => C1 - SAS2-EL1 Backplane => SFF-8087 Cable => C2 SAS2-EL1 Backplane, where C1 and C2 are Chassis 1 & 2. Resulting in FreeNAS seeing both "pools" and enabling a fast copy.

O/C, I had to keep the chassis on to power the backplane, but it copied upwards of 500 MB/s, so it didn't take too long.

Now just to (a) figure out how to get the network situation under control, (b) finalize a stable SLOG solution, (c) install vCenter server, etc. LOL

Rand__ said:
so 4 10gbe is the limit you usually can get on a combo switch (24gbe + 2-4 10 gbe), else you need a 10g switch (eg netgear XS708E).

I imagine that I'd have many more options at a lower price point, if I opted out of more than 2 ports, i.e. 2x 10GbE + balance 1GbE, right? (forgoing future 10 GbE options). I'm still going to give the VLANs and direct connection a shot, but at some point, I think I may just pull the trigger on this (head will hurt from beating it against the wall).

Rand__ · Sep 26, 2018

So you just need 2 int to ext SAS cables and you are good

And nowadays the 2 10g ports options are older tech, so yes you might get them cheaper. Whether it's worth it if you can direct connect... probably not.

I'd go and look for a soft wall spot

vanfawx · Oct 2, 2018

@svtkobra7 If you import the pool with -m you should be able to "zpool remove POOLNAME DEVICENAME" to permanently remove the SLOG from your pool. You'll need to do this from the CLI. But once you've done this, your pool should import fine again without -m

svtkobra7 · Oct 3, 2018

vanfawx said:
@svtkobra7 If you import the pool with -m you should be able to "zpool remove POOLNAME DEVICENAME" to permanently remove the SLOG from your pool. You'll need to do this from the CLI. But once you've done this, your pool should import fine again without -m

Thanks for your reply, not sure how I missed it.
I'm perhaps too comfortable from CLI (vs. younger me) and as an executive summary: I'm not as smart as the average bear, being lazy and not properly slicing the slog into a manageable size = root cause and I screwed myself, and in conclusion I don't deserve NVMe SLOGs and perhaps a return to COTS appliances, with a specially ordered iteration disallowing user SSH access is in order?
Trust I put many hours into fixing the pool, getting slightly agresssive with fixes, but we were well beyond the -m flag resulting in a GUI importable and mounted pool. Thankfully, it did allow for a copy of 25TB (encrypted pool).
Joking a bit on bullet #2 ... but I've been attempting to arrive at a "stable" environment, primarily centered on SLOGGING in FreeNAS and I lazily replaced an Optane 900p 20GB v disk SLOG with a 800 GB P3700 SLOG (for bench-marking purposes). Stupid short cut, i.e. not slicing appropriately, but I was time constrained. Unfortunately size does not matter here (or maybe it does, depending on perspective), but metaslabs do and I lazily / not understanding the impact of such went from 160 to 6,400 metaslabs with that size diff. That reasonable SLOG not be removed properly resulted in the pool going bonkers trying to handle the metadata.
All is well again, all data copied off that borked pool, and I FINALLY have arrived at a STABLE SLOG config, nicely and correctly benched with and without SLOGs, after trying every combination possible. I'm testing that for stability before final implementation as FreeBSD / ESXi doesn't play nice with pass through of the 900p ATM (pass through map workaround doesn't work for me + latest patch of 6.7 with NVMe Controller used for vdisks = PSOD (thanks for the heads up @Rand__ )).

Thanks for lending the kind advice. Quick update: The 2208, er IT mode flashed 9207-8i hasn't had a single blip BTW ...

SIDEBAR: In enterprise, and what I'm accustomed to is dev => QA => prod, but you can't really call a "stable" homelab "prod" so what is the best "label" for such? My stupid question for the day.

vanfawx · Oct 3, 2018

I would personally call it "dev/test", or maybe "staging"...

And regardless of using a whole device or partition for the SLOG, you should still be able to remove it without issue. Though do you really need a SLOG? You could just set sync=disabled and save yourself an expensive NVMe device!

Regardless, I'm glad to hear your pool is back in a stable state. It's scary when your storage isn't reliable!

svtkobra7 · Oct 3, 2018

vanfawx said:
And regardless of using a whole device or partition for the SLOG, you should still be able to remove it without issue.

Trust, I don't disagree. It is supposed to be non-destructive after all, right? I've lost 2 pools (total), both times due to edge case BS related to SLOGs.

vanfawx said:
Though do you really need a SLOG? You could just set sync=disabled and save yourself an expensive NVMe device!

No, I don't need a SLOG, I need 4 for 2 servers for my storage pools + another for 2 iSCSI (so 6). Yes, I'm laughing for you. NVMe is somewhat addictive (to me) ... enough is never enough.
Everyone needs a log, but not cache. 2 striped logs per pool @ sync=always was the only way to achieve parity with sync=disabled speeds (1 = marked improvement, 2 = got me there).
To your question, no of course I don't need a single log, but I'm quite satisfied with very well in excess of 500 MB/s, sync=always (and do prefer the protection afforded, even though using a UPS, and however minor).
Cost wise the 900p has come down a bit, and I reduced my net on each by reselling the Star Citizen Sabre Raven code (@ ~$70), i.e. not as bad as you think. And then you have to factor in "extraordinary" items (sorry finance background and BS of course) such as cost savings not needing two "HBAs" ...
Fitment of 256 RAM per may have been a less meritorious build config decision than filling lonely PCIe slots with NVMe.

vanfawx said:
Regardless, I'm glad to hear your pool is back in a stable state. It's scary when your storage isn't reliable!

Your feedback is much appreciated, and helpful as always. After all in the long path here, your feedback regarding the cross flash helped get me here - so a sincere thank you.
As to the reliability, as mentioned earlier in the thread, my inclination to "tinker" / "fiddle" (coupled with lack of background / experience) is what gets me in trouble. With a complete, albeit quite large backup on site now, I sleep much better about storage integrity now (reason for moving from 1 x 826 to 2 x 826, and even though that is a downgrade in RUs per chassis, I'm space constrained.

All the best, and thanks again.

svtkobra7 · Oct 4, 2018

@Rand__ I'd love your take on the below bold, italicized remarks. Nearing the end of this dilemma. I have an interest in "leaving no stone un-turned" after it taking two blown pools to get here.

Rand__ said:
Btw, with the newest esxi hw upgrade (v14 i think) Freenas U6 complains about the nvme controller, so don't do that. Have not looked into it tbh since it does not seem to impact anything (and going back would mean rebuild):.

Soo ... where I'm going with this, I've been testing all possibilities of Optane config + FreeNAS, but in 6.7 (and on 10176752, 10/2) to land on a zPool + SLOG config that I can live with until I get the P4800x. [kidding]

Briefly, my findings, and I'll report back in a more digestable / informative manner once complete ...

6.7 + 1 Optane + pass thru map trick = no good (no boot)
6.7 + 1 Optane + RDM + NVMe Controller = no good (no boot)
6.7 + 1 or more Optanes + RDM + SCSI Controller | LSI Logic SAS for RDM (LSI Logic SAS existing controller) = good
6.7 + 1 or more Optanes + vDisk + SCSI Controller | VMware Paravirtual for VDisk (LSI Logic SAS existing controller) = no good (boot, but couldn't see disks, which I guess that makes sense).
6.7 + 1 or more Optanes + vDisk + SCSI Controller | LSI Logic SAS for VDisk (LSI Logic SAS existing controller) = good

So without introducing any further data, would you agree with the following asserts culled from your prior comments:
(1) vDisks are preferred to RDM (but essentially the same), true / false
(2) Using a vDisk + SCSI Controller | LSI Logic SAS = preferred (at least more so than NVMeController - which is moot as that controller doesn't even work in 6.7, true /false

(3) In regards to deploying this schema, which route would you take: (a) or (b) below, or (c) other?

(a) would you slice in ESXi, i.e.

nvme0 = 20GB vDisk (SLOG1 for HDD pool) + 220 GB vDisk for ISCSI (slightly smaller due to ESXi + FreeNAS boot);
nvme1 = 20 GB vDisk (SLOG2 for HDD pool) + 240 GB vDisk for ISCSI;
so I end up with two mirrored SLOGs for the HDD pool, but they are on different devices;
and I end up with 460GB of mirrored storage for iSCSI, again different devices; so
provided both devices are taking a beating and only one or the other is, I believe this is my optimal play. And if neither are getting hammered, it definitely is more performant than not striping;

(b) or would you simply attach 2 vDisks (240GB + 260GB) and slice in ZFS gpart create / add etc. to end up with the same slices?

Hopefully, those 3 questions are easy for you, but before I add a forth, let me throw you for a loop:

#1 above did briefly work (physical pass thru), but the benchmark (not a tell all o/c) shows RDM (#3) w/ ESXi overhead outperforming at higher sizes. How is that possible (rhetorical, not adding a 5th question). And this is a "clean" bench, in that it was super controlled, no other VMs, identical config, etc. The numbers are somehow lying to us, man.

6.7 - current

[using P4800x VIB]

And now to take it a step further into bizarro world, while the above was on 6.7, here is the same command executed on a 6.5 vDisk + NVMe Controller, suggesting a much more performant config @ 2,140 MB/s peak. This is higher than report on the FreeNAS forum for a data collection thread for ANY physical pass through device (it must be wrong).

6.5 - prior

[can't recall VIB, but I think it was 1.25]

So that brings me to question #4, which is hopefully also easy, which wins here ...
(a) 6.7 using LSI SAS SCSI Controller or (b) a reversion to 6.5 to use the NVMe Controller to obtain the stronger benchmarks? Which I don't see how they can be correct anyway ... I should note that as soon as I'm done testing + integrate your comments, I'm secure erasing, starting from a clean state (clean install etc) and no more tinkering, so that marginal time for a 6.5 install is no trouble.

Thanks very much for your valued feedback. I don't mean to bombard you with questions and I only ask as I think this is a personal interest of yours (my apologies if I'm being a pest).

NB1: Hopefully my thoughts weren't too disjointed and easy enough to follow
NB2: I've strayed from my normal sarcasm as I'm exhausted and don't want to be more trouble than I already am, but I believe that later graph warrants your favorite comment on the P3700, eh?

Rand__ · Oct 4, 2018

(1) vDisks are preferred to RDM (but essentially the same), true
(2) Using a vDisk + SCSI Controller | LSI Logic SAS = preferred (at least more so than NVMeController - which is moot as that controller doesn't even work in 6.7, true | while 6.7 NVMe is not working correctly in FN which I assume will change eventually

(3) In regards to deploying this schema, which route would you take: (a) or (b) below, or (c) other?
(c) - Create two identical sized vdisks and create a mirror'ed partition in FN

(4) Depends on your future plans
- don't touch it then 6.5 clearly.
- Upgrade regularly - then it does not matter as you might not have the ultimate performance all the time anyway

svtkobra7 · Oct 4, 2018

Rand__ said:
(1) vDisks are preferred to RDM (but essentially the same), true
(2) Using a vDisk + SCSI Controller | LSI Logic SAS = preferred (at least more so than NVMeController - which is moot as that controller doesn't even work in 6.7, true | while 6.7 NVMe is not working correctly in FN which I assume will change eventually

(3) In regards to deploying this schema, which route would you take: (a) or (b) below, or (c) other?
(c) - Create two identical sized vdisks and create a mirror'ed partition in FN

(4) Depends on your future plans
- don't touch it then 6.5 clearly.
- Upgrade regularly - then it does not matter as you might not have the ultimate performance all the time anyway

This is with 6.5 + 1 & also 2 16 GB vDisk slices (I didn't disregard your comments, just decided to head down a certain path much earlier than I received a rely). Feedback is much appreciated as always. Preliminary tests look booth plausible and promising.

An additional slog costs me 1% at 128k but buys me 15% at 1M. Buy it.
Ratios look in sync across the board and there are clear hard limits across the board. Buy it.
I can say I have a pool that performs at better than 500 MB/s sync write, which isn't too shabby for a raidz2 6x2x6TB pool, right?
I'm close to saturating 10 GbE @ 1M, which I think is also quite solid.
Maybe I'm second guessing myself after dozens of failures, but I finally have benchmarks that make sense and seem good to me.
The 2x is a stripe not a mirror. More to come.

dd MB/s

Rand__ · Oct 4, 2018

The most important thing is its stable and you are happy with it. Or the other way round

svtkobra7 · Oct 4, 2018

That bad hunh? I could be stable and happy with an i386 if I didn't know better existed!

Rand__ · Oct 4, 2018

Lol, nah I didn't mean it that way but I see how you can read it along these lines.
Values are good and its reasonable, so sounds fine.
O/C I'd do it differently but thats me

svtkobra7 · Oct 4, 2018

Rand__ said:
Lol, nah I didn't mean it that way but I see how you can read it along these lines.
Values are good and its reasonable, so sounds fine.
O/C I'd do it differently but thats me

Different in that 6.7 instead of 6.5. Mirror SLOG instead of stripe. And iSCSI instead of NVMe Controller?
Please don't think that I disregarded your comments, I was just very far down one path by the time I saw your reply (granted you still replied quickly and thanks).
Not finished yet, aim on day off was really just to get a "reasonable" number for a baseline and continue this new and nice feeling of stability, continue testing/learning.
We both know, I'll try everything you say as well. And we both know, that will likely be the path taken.

I didn't share the separate bench of those 2 x 240GB slices getting beaten along with the 2x SLOGs. Still validating results. Somewhat tough when you don't really know what the "right" answer is (if you will).

I don't know if it counts, but I took the blocksize to 4M and got 2GB/s+ (probably meaningless).

But the Optane v. Optane (2xSLOG stripe vs. 2xlow volblocksize zvol stripe battle was epic. "Reflexive" in a word ... aggregate throughput held well above 3GB/s. I thought that was decent for my Optane literally being drawn and quartered. I was curious where the weak link would be, low record/blocksize pool or the SLOGs (and thus the storage pool). It was really neither ... where one side of the stripe seemed slightly strained, it looked like the other side was lending a hand, I suppose like a quite rigid rubber band.

Remind me ... apples

ranges ... but your target under vSan (a tougher realm right) was 500 MB/s async and I achieved that right? I wish I had thought this through more as if I had, I could be what doubling speed with the other chassis? (theoretically)

Rand__ · Oct 4, 2018

Well i do prefer mirrors over stripes or or RAID-Z's thats true so i'd have used those - I don't have hard data to proove they are better, they are just better from a theoretical point of view for my use cases (random IO). Maybe optane alleviates this to a great part, but I have not tested it (against raidz for example). Performance will decline with use (disk filled state) on ZFS and I am far from a fresh pool. And benches are o/c not real world performance unfortunately, so you might see less in the field.

Yes 500MB/s was and is my goal, but vsan is indeed a different breed, way less performance than what could be expected. Too many variables and too little time to scientifically test all of them for me (next to the unexplainable variance one still gets [driver versions, energy saver settings, solar storms are all explanations o/c but which one was which at a certain time is hard to tell) ...

Switchless 10GbE Point-to-Point Connection between ESXi Servers [how?]

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member