2x 56GbE Optimization, slow ceph recovery, and MLNX-OS S.O.S.

jwest5637 · Jul 11, 2021

I had this as a comment in another thread, but moved it here to it's own thread.

I have a three node Ceph cluster that I am diagnosing the presumed slow throughput of.

Each node has a on a connectx-3 Pro 2 port qsfp+. While I was running them via ib_ipoib, in an attempt to get past the low throughput issue, I've moved away from ib_ipoib, and am using them in ethernet mode alongside an sx6036 switch also in ethernet mode with licence key.

My primary cause for concern is low throughput across the connectx-3's, to the tune of ~50-200MB/s (not Mbit) of real world throughput to and from ramdisks on both ends, depending on the test, and ~14Gbit/s when using tools like iperf3 to avoid disk IO completely.

While I understand that even with LACP, I should not expect the full max throughput of both connections (112Gbit/s), having seen closer to 50Gbit/s with a single IPoIB connection I was expecting a lot more from Ethernet across the board (eliminating the transcoding to and from infiniband).

What prompted the whole situation was an upgrade to ceph pacific, where the long partially unsupported ceph-RDMA appears to be totally broken, forcing my hand.
I am currently backfilling a failed OSD, but at a rate of ~20MB/s (137 Days...), which does a good job of highlighting what I am dealing with, haha.

I am pretty confident the issue is not the hardware, but the configuration on either my linux nodes, or equally likely the SX6036 switch directly.
--> I highlight the switch as it was a struggle getting LACP working. I note the nodes becuase I am deepy unfamiliar with this degree of networking. This whole thing is a learning opportunity for me, but as I am confident others can relate, there are some really steep learning curves when self-teaching this stuff.

At the moment, I have two mlx4_en ethernet ports per node, running lacp X2 linux bond:

Code:

Settings for bond0:
    Supported ports: [  ]
    Supported link modes:   Not reported
    Supported pause frame use: No
    Supports auto-negotiation: No
    Supported FEC modes: Not reported
    Advertised link modes:  Not reported
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Advertised FEC modes: Not reported
    Speed: 112000Mb/s <------------------------------------------- Pretty confident I should exceed current throughput.
    Duplex: Full
    Auto-negotiation: off
    Port: Other
    PHYAD: 0
    Transceiver: internal
    Link detected: yes

The keen among us will want more details on the ceph setup, so while I am fairly confident I have a networking issue, the high level summary is:
Three nodes, each with 8x 8TB he8 SAS drives direct attached via jbod lsi hba + backplane (Man, the acronyms in this space are nuts, lol)
There is technically a fourth node, but it's not actually hosting ceph related stuff.
Again, I am pretty sure it's a network level issue, as testing is slow outside of ceph too.

Typical bonds look something like:

Code:

Features for bond0:
rx-checksumming: off [fixed]
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: on
    tx-tcp-mangleid-segmentation: on
    tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: off
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: on [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]

Following random optimization articles online (so very likely a misconfig here):

Code:

auto enp65s0
iface enp65s0 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000

auto enp65s0d1
iface enp65s0d1 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000


auto bond0
iface bond0 inet static
    address 192.168.2.2/24
    mtu 9000
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    post-up /sbin/ip link set dev bond0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000
    post-up /sbin/ip link set dev bond0 mtu 9000
    post-up /sbin/ip link set dev enp65s0 mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 mtu 9000
    post-up   iptables -t nat -A POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    post-up ifconfig bond0 192.168.2.2
    post-up /usr/sbin/ethtool -K bond0 lro on
    post-down iptables -t nat -D POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    bond-slaves enp65s0 enp65s0d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    bond-min-links 1
    bond-lacp-rate 1

Switch also set to 9000MTU, and LACP enabled.

Code:

#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#

#kernel.domainname = example.com

# Uncomment the following to stop low-level messages on console
#kernel.printk = 3 4 1 3

##############################################################3
# Functions previously found in netbase
#

# Uncomment the next two lines to enable Spoof protection (reverse-path filter)
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1

# Uncomment the next line to enable TCP/IP SYN cookies
# See http://lwn.net/Articles/277146/
# Note: This may impact IPv6 TCP sessions too
#net.ipv4.tcp_syncookies=1

# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1

# Uncomment the next line to enable packet forwarding for IPv6
#  Enabling this option disables Stateless Address Autoconfiguration
#  based on Router Advertisements for this host
net.ipv6.conf.all.forwarding=1


###################################################################
# Additional settings - these settings can improve the network
# security of the host and prevent against some network attacks
# including spoofing attacks and man in the middle attacks through
# redirection. Some network environments, however, require that these
# settings are disabled so review and enable them as needed.
#
# Do not accept ICMP redirects (prevent MITM attacks)
#net.ipv4.conf.all.accept_redirects = 0
#net.ipv6.conf.all.accept_redirects = 0
# _or_
# Accept ICMP redirects only for gateways listed in our default
# gateway list (enabled by default)
# net.ipv4.conf.all.secure_redirects = 1
#
# Do not send ICMP redirects (we are not a router)
#net.ipv4.conf.all.send_redirects = 0
#
# Do not accept IP source route packets (we are not a router)
#net.ipv4.conf.all.accept_source_route = 0
#net.ipv6.conf.all.accept_source_route = 0
#
# Log Martian Packets
#net.ipv4.conf.all.log_martians = 1
#

###################################################################
# Magic system request Key
# 0=disable, 1=enable all, >1 bitmask of sysrq functions
# See https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
# for what other values do
#kernel.sysrq=438
kernel.sysrq=1
#vm.swappiness=20



net.ipv4.tcp_window_scaling = 1


vm.overcommit_memory=1
#vm.swappiness=60
#vm.vfs_cache_pressure=200

net.ipv4.ip_forward=1
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv6.conf.default.forwarding=1
#net.ipv4.conf.all.mc_forwarding=1
#net.ipv4.conf.default.mc_forwarding=1

#vm.max_map_count = 262144
#vm.dirty_writeback_centisecs = 1500
#vm.dirty_expire_centisecs = 1500

vm.overcommit_memory=1
vm.nr_hugepages = 1024


net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

# set minimum size, initial size, and maximum size in bytes
net.ipv4.tcp_rmem="1024000 8738000 2147483647"
net.ipv4.tcp_wmem="1024000 8738000 2147483647"
net.ipv4.tcp_mem="1024000 8738000 2147483647"
net.ipv4.udp_mem="1024000 8738000 2147483647"

net.core.netdev_max_backlog=250000
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_window_scaling = 1
net.ipv6.conf.default.forwarding=1
net.ipv4.tcp_moderate_rcvbuf=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
kernel.sysrq = 1
net.link.lagg.0.use_flowid=1
net.link.lagg.0.lacp.lacp_strict_mode=1
net.ipv4.tcp_sack=1

Switch Setup, where the issue is likely preset:

Code:

Mellanox MLNX-OS Switch Management

Password:
Last login: Sun Jul 11 13:33:25 2021 from 192.168.10.132

Mellanox Switch

switch-625810 [standalone: master] > enable
switch-625810 [standalone: master] # show running-config
##
## Running database "initial"
## Generated at 2021/07/12 01:03:16 +0000
## Hostname: switch-625810
##

##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable

##
## License keys
##
license install REDACTED (Ethernet and all the bells and whistles)

##
## Interface Ethernet configuration
##
interface port-channel 1-4
interface ethernet 1/1-1/2 speed 56000 force
interface ethernet 1/13-1/14 speed 56000 force
interface ethernet 1/27-1/30 speed 56000 force
interface ethernet 1/1-1/2 mtu 9000 force
interface ethernet 1/7-1/8 mtu 9000 force
interface ethernet 1/13-1/14 mtu 9000 force
interface ethernet 1/25-1/30 mtu 9000 force
interface port-channel 1-4 mtu 9000 force
interface ethernet 1/1-1/2 channel-group 1 mode active
interface ethernet 1/7-1/8 channel-group 4 mode active
interface ethernet 1/13-1/14 channel-group 2 mode active
interface ethernet 1/25-1/26 channel-group 3 mode active

##
## LAG configuration
##
lacp
interface port-channel 1-4 lacp-individual enable force
port-channel load-balance ethernet source-destination-port
interface ethernet 1/1-1/2 lacp rate fast
interface ethernet 1/7-1/8 lacp rate fast
interface ethernet 1/13-1/14 lacp rate fast
interface ethernet 1/25-1/26 lacp rate fast

##
## STP configuration
##
no spanning-tree

##
## L3 configuration
##
ip routing vrf default


##
## DCBX PFC configuration
##
dcb priority-flow-control priority 3 enable
dcb priority-flow-control priority 4 enable
interface ethernet 1/1-1/36 no dcb priority-flow-control mode on force


##
## LLDP configuration
##
lldp

##
## IGMP Snooping configuration
##
ip igmp snooping proxy reporting
ip igmp snooping
vlan 1 ip igmp snooping querier

##
## PIM configuration
##
protocol pim

##
## IP Multicast router configuration
##
ip multicast-routing

##
## Network interface configuration
##
interface mgmt0 ip address REDACTED /16

##
## Local user account configuration
##
username admin password REDACTED

##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********


##
## Network management configuration
##
# web proxy auth basic password ********

##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID REDACTED
# (public-cert config omitted since private-key config is hidden)

##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Any chance some kind soul can point out my mistake, or point me in the right direction?

Edit: Redacting stuff that shouldn't be public.

Fallen Kell · Jul 14, 2021

Well, for one thing, a dual port connectx3 or connectx4 card will not have the PCI-E bandwidth to use both ports. A PCI-E 3.0 8x link is limited to 64gbits per second (less when adding in the overhead of the protocol, this is why the fdr speed of 56gbits existed because factoring in overhead of the buses, it was the fastest speed achievable). Dual port cards only exist for network redundancy (i.e. connecting to two different switches so that if a switch fails you are still connected) or for creating star style networks where-in the number of hops to reach a different computer is shorter over one port than the other.

Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).

necr · Jul 15, 2021

I wouldn’t expect great performance on a NAT’ted interface. Did you have the same on IPoIB? Also, what are you buffers set to (Ethtool -g)? Bond config seems correct on both sides, but I’d start without it while debugging.

jwest5637 · Jul 15, 2021

Fallen Kell said:
Well, for one thing, a dual port connectx3 or connectx4 card will not have the PCI-E bandwidth to use both ports. A PCI-E 3.0 8x link is limited to 64gbits per second (less when adding in the overhead of the protocol). Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).

Hey, Thats a really great catch, thank you for highlighting it.

Ill be swapping half the ports back to ipoib for further testing, will post some benchmarks in the next day or so while debugging further!

This because of PCIe saturation, to avoid the bonding during testing, and so I have side by sides with ipoib. It takes a bit to swap everything, as I am still learning, but will report back before too long.

tsteine · Jul 16, 2021

Fallen Kell said:
Dual port cards only exist for network redundancy (i.e. connecting to two different switches so that if a switch fails you are still connected) or for creating star style networks where-in the number of hops to reach a different computer is shorter over one port than the other.

Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).

This is a somewhat broad and sweeping statement. Redundancy is not the *only* reason why you might want a dual port card.
The ConnectX-5 EX EN cards run PCIe 4.0 16x slots and allow you to saturate 100gbit on both ports simultaneously.

If you were to run a multi-chassis load-balanced LAGG with something like an extremely powerful vpp software router, there is no reason why you couldn't hit >100gbit speeds from a single connectx-5 card using both ports.

Fallen Kell · Jul 17, 2021

tsteine said:
This is a somewhat broad and sweeping statement. Redundancy is not the *only* reason why you might want a dual port card.
The ConnectX-5 EX EN cards run PCIe 4.0 16x slots and allow you to saturate 100gbit on both ports simultaneously.

Yes, at Gen 4.0 PCIe 16x, you can run both ports, but not all ConnectX-5 cards are PCIe 3.0, and in fact only the specific EX EN card is Gen 4.0. All other ConnectX-5 cards are Gen 3.0 and as such my statement stands. And I also didn't say that "redundancy is the *only* reason", it was "redundancy or creating star style networks to reduce the number of hops to reach another computer".

Again, my point is, that until very recent cards, there simply wasn't the bandwidth to run 100gbit. So when talking about ConnectX-3 that this thread was about, I was pointing out that there isn't enough PCIe bandwidth to run at 100gbit even though the card has 2 ports that if you on the surface look at the specs are lead to believe, "wow, it has 2x 56gbit ports, I should be able to LAGG them together and get over 100gbits", when the truth is, no you can't. It will only support ~56gbits because that is all the PCIe bus can supply those cards on an 3.0 8x slot.

tsteine · Jul 17, 2021

@Fallen Kell since we are going to engage in pedantry, I should note that I specifically said "ex en" and not just "en" precisely because the "ex" naming convention for mellanox connectx-5 cards specifically refers to the card being PCIe 4.0 capable.

On the topic of PCIe back pressure and bandwidth we are in complete agreement.

MichalPL · Jul 19, 2021

I didn't max the 100GbE yet, but what I think is:

Connectx-3 40/56 (PCIE 3.0 x8) LACP can do about 7000MB/s in real life when copy files on NVME raid or single NVME PCIE 4.0

Connectx-3 40/56 (PCIE 3.0 x8) no LACP 4420MB/s over single ethernet port at 40G

Connectx-4 (PCIE 3.0 x16) can do ~11000MB/s (when I was testing it it was slightly more than 7000 but it was a NVME limit)

Rand__ · Jul 19, 2021

Just keep in mind that we are talking about aggregated bandwith here - with sufficient threads the X3's can max out a single port or even pcie, but for single threads you are limited to sub 20Gbit

Coincidentally the 14Gbit you see are in line with what my experiments have given for a single connection as well...

Now, I have not played with ceph in a long time since it was just darn slow (despite using all nvme), so I am not sure how the 8x8 will be accessed (raid 0, 1, 5 equivalent?) , but i would not think that the network is the culprit of only being able to run the sync at 20MB/s...
You can simply test that by running a file transfer on top of the running sync (ideally not to the same disk/set) , if it impacts the sync speed then NW is limited, if not then its not the cause...

jwest5637 · Mar 5, 2022

Necro'ing my own thread to keep the conversation in one place, with a ~1 year later update.

Been chipping away at this for a while now. Swapped back to IB, then ETH, tried a ton of config changes, etc.

At this point, I am not actually concerned about my recovery speed, nor my RBD VM disk performance, as everything is running smoothly for my needs, however, I've been making a bit of a hobby out of "optimization whack-a-mole" and have kept pushing performance to new heights.

Even though I've learned a massive amount through this process, I am writing now as I've reached a new bottleneck, but I am having some trouble identifying the cause!

I've evolved my testing methodology to leverage `rbd bench` for multi-threaded ceph approved benchmarking, which allows for testing across most of my drives in parallel, and spreads the load out across the nodes much better.

Here's where we're at:

Five node ceph cluster - Only one node is truly performant (AMD 5950x) and the rest are E5-26xx V3 or E5-46xx V2 era.
Proper enterprise SSDs in each node (Micron 9200, Intel p4510, S3610, etc, a bit of a mixed bag based on what I could get my hands on, but evenly spread, with better drives sometimes split via NVME namespaces to support multiple OSDs)
7x 8TB HGST he8 helium spinner OSDs per node
Three Nodes: Connectx-3
Two Nodes: Connectx-4
sx6036 mlnx switch
Two ceph storage pools "rbd" for HDD only, and "fast" for nvme SSD only

Pool "RBD":

Code:

me@rog:~# rados bench -t 64 -p rbd 30 seq                                                                                       
hints = 1                                                                                                                               
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      63       817       754   3015.69      3016   0.0744792   0.0791521
    2      63      1658      1595   3189.71      3364   0.0647262   0.0775725
    3      63      2545      2482   3309.04      3548   0.0668243   0.0755953
    4      63      3382      3319    3318.7      3348   0.0750768   0.0756062
(...)
   28      63     23946     23883   3411.45      3268    0.069693   0.0746367
   29      63     24789     24726   3410.08      3372   0.0649687   0.0746695
   30      31     25644     25613   3414.28      3548   0.0401379    0.074633
Total time run:       30.0373
Total reads made:     25644
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   3414.95
Average IOPS:         853
Stddev IOPS:          26.4897
Max IOPS:             887
Min IOPS:             754
Average Latency(s):   0.0745948
Max latency(s):       0.289386
Min latency(s):       0.0175472

Pool "FAST":
(Lower Qty SSD than HDD due to budget constraint, so reduced performance below is not 100% unexpected)

Code:

me
@rog:~# rados bench -t 64 -p fast 30 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      63       744       681   2723.73      2724    0.067426   0.0871643
    2      63      1446      1383   2765.75      2808    0.065342   0.0883275
    3      63      2128      2065    2753.1      2728   0.0777691   0.0902319
    4      63      2767      2704   2703.77      2556   0.0786513   0.0923294
...
   27      63     17447     17384   2575.19      2696   0.0735608   0.0989279
   28      63     18134     18071   2581.36      2748   0.0727419   0.0986997
   29      63     18739     18676   2575.78      2420    0.074046   0.0989121
   30      63     19380     19317   2575.38      2564   0.0608706   0.0989424
Total time run:       30.0604
Total reads made:     19381
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2578.94
Average IOPS:         644
Stddev IOPS:          29.7905
Max IOPS:             702
Min IOPS:             589
Average Latency(s):   0.0988727
Max latency(s):       0.339822
Min latency(s):       0.0108414

--> Man, Ceph really is an IOPs killer..

None of the five nodes pass 50% CPU use
None of the drives pass 50% use according to iostat
network bandwidth for any node never surpasses 800MB/s (~6.4Gbit/s), well below the 40G qsfp+ and the 64Gbit/s pcie gen 3 that @Fallen Kell mentioned above.

Questions:

Is <1000 IOPS on ceph common when running enterprise SSD that independantly clock half a million iops each or should I be looking into this further?
Why the cap at ~3500MB/s multi-threaded sequential read? (Again, I am not displeased, just on the hunt to learn and fix bottlenecks.)
Why am I seeing low usage of CPU, drives, and network, shouldn't something be capping out?
Is there a nice ceph community forum somewhere I should point questions at? The mailing list isn't very helpful.

My working theory is that I have a numa node misalignment causing data to pass through QPI links (8gbit, I think?), but I haven't figured out how to measure this, and it's sorta a half-baked theory, since all transmitting nodes are seeing the same ~750MB upload, I assume the issue is effecting the system as a whole, not just a single node underperforming, right?

Open to ideas!

jwest5637 · Mar 6, 2022

During rados benchmarking

Example iostat:

Example htop:

Sean Ho · Mar 18, 2022

default 3x replication on both pools? Write caching on the drives? Have you read the famous yourcmc ceph performance writeup? QPI on your Ivy Bridge chips would be roughly 230Gbps (3.6GHz clock, or 7.2 GT/s * 2 links * 2 B/T * 8 b/B) and should not be the bottleneck.

But ceph optimisation is a dark art to me; I too am casting around for more insight. r/ceph is a bit more friendly to small clusters than the mailing list, but doesn't get much traffic.

Search

2x 56GbE Optimization, slow ceph recovery, and MLNX-OS S.O.S.

jwest5637

New Member

Fallen Kell

Member

necr

Active Member

jwest5637

New Member

tsteine

Active Member

Fallen Kell

Member

tsteine

Active Member

MichalPL

Active Member

Rand__

Well-Known Member

jwest5637

New Member

jwest5637

New Member

Attachments

Sean Ho

seanho.com