So I have this Mellanox ConnectX-3 and ESXi 6.7U1 and I want to tune ..

i386 · Jan 6, 2019

Rand__ said:
Those have CX3 level connectors, not the smoother CX4/5 level ones, thats why I don't think they are cx4. Also very very cheap for cx4.

These are first generation connectx devices, from 2009.

svtkobra7 · Jan 6, 2019

Rand__ said:
No, the weirdness of the problem which might be one of my weird problems

Rand__ said:
Individually or all? I think it was just one which was not working properly.

Sorry I wasn't more clear, I didn't test 40G at each and every change like I did10G.
Well I was dropping packets on 10G, but moreso on 40G. Further, 40G speeds stunk.
I think I was missing the boat a bit on "tuning," I took a step back to reanalyze my approach, tested at the hypervisor level and had no problem achieving nearly line speed at 10G and ~33-34G / 40G.
But when I would run a test for 60 seconds, immediately run it again, and it would drop from ~33-34G to 14G, I realized where my problem was, and it was as expected, in ESXi.
I think I've gotten to where 10G runs at nearly line speed without dropping a single packet (as it should) and I'm working on 40G now which is dropping fewer packets as well.
Back to my point => your tunables in FreeNAS could be 100% correct or the most asinine values ever conceived and it wouldn't matter if your memory settings at the hypervisor level weren't correct (I had to bump a lot of the "system => advanced settings" and dive into esxcli to optimize).

Rand__ said:
Hm yeah does not look as easy as it should.

VMware is a complete PITA lately. Granted I don't have your background so it takes me 100x longer to learn something / troubleshoot / and type long messages about it , but I feel like ESXi adds so much complexity when it comes to root cause analysis and if I was on bare metal, the issue would have been remedied some time ago.
Additionally, recently I think I've realized that you can only optimize for one objective: (a) cost efficiency or (b) stability / maximum uptime. If you go for (a) that's great, as I certainly did, but when you spend $25 bucks on gen n - 2 tech, be prepared to spend dozens of hours trying to make your X-3 work with the latest release of ESXi. Regarding (b), you mention 40G in a bit, so I should point out that you can deploy 100G no problem and you can pay 10x as much, but also know you won't have any compatibility issues.

Rand__ said:
Got it, no worries. Just a case of me being a knowitall and commenting on obvious stuff

There is a bit of an interesting point here, though.
How many people may just run iperf (which doesn't show packet loss directly with the most often referenced commands), get a good enough speed, and call it a day?
I think there are three pieces to this puzzle: (a) iperf3 (as it shows retransmissions and iperf does not); (b) iperf/iperf3 -u switch (udp) which directly shows packet loss, and finally (3) monitor esxtop for actual packet loss (as packet loss shown in iperf could simply be a reporting issue). Specific to iperf3 there is a bug with how -u loss is shown (as an example).

Rand__ said:
Lots of (relatively) cheap 40G options out there nowadays
Glad you are getting there

May I ask the dumb question: What does 40G or 100G actually get you? You can saturate 10G on ZFS without having to sell a kidney; however, I think making it to 20G would be quite impressive, difficult, and expensive.
For me, it seems that 2 x 10G is more than enough, 10G for storage and 10G for vMotion / VMs / Management is plenty with room to spare. But I don't have the most amazing lab in the world, so unless you have a cluster beyond 3 hosts, I really don't see a need for a 40G fabric. (but just because I don't see it doesn't mean it isn't there)
Thanks for the kind words.

Rand__ · Jan 6, 2019

svtkobra7 said:
There is a bit of an interesting point here, though.

How many people may just run iperf (which doesn't show packet loss directly with the most often referenced commands), get a good enough speed, and call it a day?

I think there are three pieces to this puzzle: (a) iperf3 (as it shows retransmissions and iperf does not); (b) iperf/iperf3 -u switch (udp) which directly shows packet loss, and finally (3) monitor esxtop for actual packet loss (as packet loss shown in iperf could simply be a reporting issue). Specific to iperf3 there is a bug with how -u loss is shown (as an example).

Many I assume. Most will not be mesasuring raw network performance though unless they are troubleshooting or dabbling with new network cards; I agree the ones who do run it should be aware

svtkobra7 said:
May I ask the dumb question: What does 40G or 100G actually get you? You can saturate 10G on ZFS without having to sell a kidney; however, I think making it to 20G would be quite impressive, difficult, and expensive.

For me, it seems that 2 x 10G is more than enough, 10G for storage and 10G for vMotion / VMs / Management is plenty with room to spare. But I don't have the most amazing lab in the world, so unless you have a cluster beyond 3 hosts, I really don't see a need for a 40G fabric. (but just because I don't see it doesn't mean it isn't there)

Thanks for the kind words.

You mean absolutely hard facts of advantages of 40/100 vs 10 for my personal setup?
Absolutely nothing as it is running now

What it should be though, if at some point RDMA etc is running, and if all my weird issues were to be resolved - it would allow the jump from ~1000 MB/s to 3000 MB/s as theoretical maximum of the underlying hardware (nvme).
Do I really *need* that? No

svtkobra7 · Jan 6, 2019

Rand__ said:
Many I assume. Most will not be mesasuring raw network performance though unless they are troubleshooting or dabbling with new network cards; I agree the ones who do run it should be aware

Interesting ... I suppose you don't necessarily have to catch it on the front end (although ideally), in order to catch it, because it will eventually make you troubleshoot it.

Rand__ said:
You mean absolutely hard facts of advantages of 40/100 vs 10 for my personal setup?
Absolutely nothing as it is running now

What it should be though, if at some point RDMA etc is running, and if all my weird issues were to be resolved - it would allow the jump from ~1000 MB/s to 3000 MB/s as theoretical maximum of the underlying hardware (nvme).
Do I really *need* that? No

I know its NVMe damn you (I named it for you "RANdOF") [jealous]
I know I know, need / want for us = need (same thing), but still 3000 MB/s = 24 Gbps. That's a bit of headroom. "Theoretical max" = famous words!
I suppose I should have better calibrated my question, 40 / 100G + fabrics don't exist because a few home labbers started stomping their feet and demanding it. A fair % of "normal" consumers have been desiring affordable, compatible 1G+ for years, and only recently we have seen roll out of not even that, but a half (and quarter) measure of 2.5 / 5 G.
Its the DC, but what is the use case for that for enterprise? (I can google it - you don't have to answer - those are crazy speeds and I don't imagine it is simply to facilitate access to storage)

Rand__ · Jan 6, 2019

Well I assume if I were to reach 3GB probably(if tech like nvme Raid allowed it) I'd be looking for more... and more... and maybe more

And you shouldn't forget that DC's are not used by single persons like many homelabs are. So its all about density/aggregation ...

svtkobra7 · Jan 6, 2019

Rand__ said:
Well I assume if I were to reach 3GB probably(if tech like nvme Raid allowed it) I'd be looking for more... and more... and maybe more

I'll just keep on spinning

Rand__ said:
And you shouldn't forget that DC's are not used by single persons like many homelabs are. So its all about density/aggregation ...

Understood, thus my comment suggesting they aren't the end users.
I was just trying to take your figure of 3,000 MB/s which I would imagine is fast enough for many enterprises, but yet doesn't consume but a quarter of a 100G link.
Density / aggregation makes sense. Using SMCI's 1 PB in 1U example = 13 million IOPs & 52 GB/s. So I guess you need several orders of magnitude more network capacity to support it. I knew you could fit that much density into 1U these days, but didn't realize the storage was that fast. I don't want to know what that would cost.

Rand__ · Jan 6, 2019

Its not only storage per se, but also application traffic - think large hosters,, banks, insurances etc; not to mention amazon/google/chinese companies or state actors needing to sniff all traffic to ensure our safety

i386 · Jan 6, 2019

1million iops @ 4k ~4gbyte/s

fohdeesha · Jan 6, 2019

svtkobra7 said:
May I ask the dumb question: What does 40G or 100G actually get you? You can saturate 10G on ZFS without having to sell a kidney; however, I think making it to 20G would be quite impressive, difficult, and expensive.

It just takes a lot of drives. Once you give up on caring about power draw and noise a lot of fun stuff becomes possible. The ZFS array I built last year can sustain ~25gbps no problem, all spinning drives (other than slog). Don't think I spent more than $2k total. R720 with 2x MD1200's full of drives. Massive and fast, just how I like it

Rand__ · Jan 6, 2019

Nice !

And there it is again -> The "Cost - Noise|Heat|Power Draw - Performance" triangle.

svtkobra7 said:
Additionally, recently I think I've realized that you can only optimize for one objective: (a) cost efficiency or (b) stability / maximum uptime.

svtkobra7 · Jan 7, 2019

fohdeesha said:
It just takes a lot of drives. Once you give up on caring about power draw and noise a lot of fun stuff becomes possible. The ZFS array I built last year can sustain ~25gbps no problem, all spinning drives (other than slog). Don't think I spent more than $2k total. R720 with 2x MD1200's full of drives. Massive and fast, just how I like it

Specs please (not that I'm doubting ... just curious what you are running).
I spent in excess of the amount mentioned just filling up a SC826 with 10TB Easystores LOL (but I suppose with a home lab budget, or my conception of one, you can't optimize for performance and size simultaneously).
Thats an impressive # that I can't conceive (when looking at my set up). Lots of smaller drives (as you noted) and forgoing redundancy?
- Sure I get that 16 drives in RAID 0 = 3,200 MB/s but that is in RAID 0.
- As soon as you add fault tolerance your pool takes a massive hit, as in RAID 5 those same 16 drives = 1,280 MB/s.
- It isn't until you add 24 more drives (40 total), that you hit 3,200 MB/s again.
- Assumes performance = 200 MB/s per drive and as calculated here: RAID Performace Calculator - WintelGuy.com

fohdeesha · Jan 7, 2019

specs:

dell R720
2x e5-2643 v2 (best single-threaded performance you can get for this socket, and since NFS & SMB are single threaded under freenas, this was most important to me)
96GB RAM (had most of this laying around luckily)
optane SLOG

2x Dell MD1200
24x HGST 4TB SAS
split into 3 VDEVs which are each 8 disk RAIDZ2's, so full redundancy across the board

The performance penalty and raid performance calculator you mention are not applicable to ZFS. On oldschool raid modes like RAID5, whenever you modify data that doesn't exactly match the stripe width in size, which is almost always, it has to read the old data, compute the new parity, then write the new stripe synchronously which is incredibly slow. With ZFS RAIDZx, writes are always written a full stripe at a time asynchronously across all disks (thanks to variable stripe width), so reads and writes in ZFS RAIDZx should roughly match what you'd get reading and writing to that many disks (minus the parity disks) in one big raid0 stripe. Of course parity calculation and other things come into play, but on modern hardware that should never drop performance down to the levels you speak of

reason #432978347 I have no clue why people still use oldschool raid. more reading here - RAID-Z

Of course that's speaking of bandwidth, IOPS are another matter. (with 3 VDEVs you have about 3x the IOPS of a single drive, which in my low IOPS high transfer size application is way more than enough, but even VM's are incredibly snappy)

Code:

fio randwrite:
Run status group 0 (all jobs):
  WRITE: io=41121MB, aggrb=2466.8MB/s, minb=2466.8MB/s, maxb=2466.8MB/s, mint=16673msec, maxt=16673msec

fio randread:
Run status group 0 (all jobs):
  READ: io=41121MB, aggrb=2664.1MB/s, minb=2664.1MB/s, maxb=2664.1MB/s, mint=15437msec, maxt=15437msec

barely 150MB/s per data disk and I know they can do more. Judging by stats it's a cpu limit at this point

Rand__ · Jan 7, 2019

Very interesting - but that is async speed is it not? sync speed should be limited by the optane drive (limited as in ~600 MB vs 2500 MB)

svtkobra7 · Jan 8, 2019

Rand__ said:
Very interesting - but that is async speed is it not? sync speed should be limited by the optane drive (limited as in ~600 MB vs 2500 MB)

It must be async ...
O/c you are thinking, the "theoretical" max write speed of Optane is 2,000 MB/s, so how could he be sync write @ 467 MB/s more than that?
- Technically he could if for the dataset he tested, zfs attribute logbias was set to throughput, which wouldn't use a separate log device.
- But o/c the default is latency which would use a separate log device, and last I checked you couldn't push 2,467 MB/s sync write through a device capable of 2,000 MB/s writes.
But you are implying he should see sync write around 600 MB/s with ZFS default attribute values ... why?
- Not that I disagree and its interesting that you mention this figure as for some reason I seemed to end up right around there when benchmarking ...
- I spent a bit of time benching my Easystores in (a) RaidZ2 6x2x10.0 TB vs. (b) RaidZ 3x4x10.0 TB and if you look at the below benchmark summary 1 20G 900p as SLOG = 700+ MB/s, 2 x 20G 900p as SLOG = 800+ MB/s, and 2 x 20G 900p as SLOG, mirrored = 600 MB/s.
- My 12 HGSTs Deskstar NAS - 6 TB benched a bit slower without a SLOG, so when they clocked ~600 MB/s sync write, that was a higher % of asysnc writes vs. what you see below ...
- ... I thought it a limitation of pool speed. But now that I see that number again, with a faster pool, and you having referenced 600 MB/s I think it may be something else.

Code:

Write Speed - MB/s
1MB recordsize
compression=off
                      No slog    1 Optane 900p    2 Optane 900p    2 Optane 900p
                                 20G vDisk        20G vDisks       20G vDisks - Mirrored
                      async      sync             sync             sync
RaidZ2 6x2x10.0 TB    1043       734              806              594
RaidZ 3x4x10.0 TB     1149       747              856              611

NOTE: I was really looking for raw HDD speed to compare various zpool configurations and my system was "starved" of resources (ran those on either 4GB or 8GB of RAM) so it actually gets a tad better than that thanks to cache effect.

zxv · Jan 8, 2019

svtkobra7 said:
Sure I get that 16 drives in RAID 0 = 3,200 MB/s but that is in RAID 0.

As soon as you add fault tolerance your pool takes a massive hit, as in RAID 5 those same 16 drives = 1,280 MB/s.

It isn't until you add 24 more drives (40 total), that you hit 3,200 MB/s again.

The intuition that striping across mirrored pairs is faster than raidz1 or raidz2 does not hold in certain perspectives, especially for read bandwidth, and more surprisingly, even for rewrite bandwidth.

Here is an excerpt of old bonnie++ benchmark of large pools of 4TB drives zfs from calomel.org:
ZFS Raidz Performance, Capacity and Integrity Comparison @ Calomel.org

Code:

                               capacity   write       rewrite      read
24x 4TB, 12 striped mirrors,   45.2 TB,  w=696MB/s , rw=144MB/s , r=898MB/s 
24x 4TB, raidz,                86.4 TB,  w=567MB/s , rw=198MB/s , r=1304MB/s
24x 4TB, raidz2,               82.0 TB,  w=434MB/s , rw=189MB/s , r=1063MB/s

Writing the additional parity information in Z1 and Z2 reduces write bandwidth, but the percentage change in capacity is far greater.

This suggests that, ignoring IOPS, it's reasonable to have a huge single vdev and choose whichever redundancy you want, with no penalty in terms of read or rewrite speed, and relatively efficient capacity usage.

In my experience, to get the benefit of striping across mirrored pairs, you need far more effort on optimizing all of the elements: controller, cabling, drive selection, whereas a large single raidz2 vdev requires little tuning.

Rand__ · Jan 8, 2019

6-900 is the speed that most tests show as the typical optane slog speed regardless of the disk system behind it. I have not found out why sometimes you get 600 and sometimes more, pretty much independent from the actual pool speed (within that range which of course is not small, but still the same ballpark).
I think in the end its a latency topic again which limits the speed which might explain why a mirror of optane's is slower (add time to write in parallel) while a stripe of optanes might be even faster (and not even less secure since a the loss of a single striped device would only cause a rewrite of 5-10s worth of data and not a total volume loss)

Rand__ · Jan 11, 2019

If you want CX4 these might be an option:
Mellanox MCX4121A-ACAT ConnectX-4 Lx EN NIC 25GbE Dual-Port SFP28 PCIe3.0 x8 | eBay

pcie 3 x8, 25GbE should be fine for your use case

They also have the old style connector but that seems not to be an indicator as I had thought. Seems the x8 cards have that and the x16 cards have the newer one (from what I have observed without actually verifying)

Search

So I have this Mellanox ConnectX-3 and ESXi 6.7U1 and I want to tune ..

i386

Well-Known Member

svtkobra7

Active Member

Rand__

Well-Known Member

svtkobra7

Active Member

Rand__

Well-Known Member

svtkobra7

Active Member

Rand__

Well-Known Member

i386

Well-Known Member

fohdeesha

Kaini Industries

Rand__

Well-Known Member

svtkobra7

Active Member

fohdeesha

Kaini Industries

Rand__

Well-Known Member

svtkobra7

Active Member

zxv

The more I C, the less I see.

Rand__

Well-Known Member

Rand__

Well-Known Member