2x 56GbE Optimization, slow ceph recovery, and MLNX-OS S.O.S.

jwest5637

New Member
Nov 1, 2019
11
0
1
I had this as a comment in another thread, but moved it here to it's own thread.


I have a three node Ceph cluster that I am diagnosing the presumed slow throughput of.

Each node has a on a connectx-3 Pro 2 port qsfp+. While I was running them via ib_ipoib, in an attempt to get past the low throughput issue, I've moved away from ib_ipoib, and am using them in ethernet mode alongside an sx6036 switch also in ethernet mode with licence key.


My primary cause for concern is low throughput across the connectx-3's, to the tune of ~50-200MB/s (not Mbit) of real world throughput to and from ramdisks on both ends, depending on the test, and ~14Gbit/s when using tools like iperf3 to avoid disk IO completely.

While I understand that even with LACP, I should not expect the full max throughput of both connections (112Gbit/s), having seen closer to 50Gbit/s with a single IPoIB connection I was expecting a lot more from Ethernet across the board (eliminating the transcoding to and from infiniband).



What prompted the whole situation was an upgrade to ceph pacific, where the long partially unsupported ceph-RDMA appears to be totally broken, forcing my hand.
I am currently backfilling a failed OSD, but at a rate of ~20MB/s (137 Days...), which does a good job of highlighting what I am dealing with, haha.

I am pretty confident the issue is not the hardware, but the configuration on either my linux nodes, or equally likely the SX6036 switch directly.
--> I highlight the switch as it was a struggle getting LACP working. I note the nodes becuase I am deepy unfamiliar with this degree of networking. This whole thing is a learning opportunity for me, but as I am confident others can relate, there are some really steep learning curves when self-teaching this stuff.


At the moment, I have two mlx4_en ethernet ports per node, running lacp X2 linux bond:
Code:
Settings for bond0:
    Supported ports: [  ]
    Supported link modes:   Not reported
    Supported pause frame use: No
    Supports auto-negotiation: No
    Supported FEC modes: Not reported
    Advertised link modes:  Not reported
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Advertised FEC modes: Not reported
    Speed: 112000Mb/s <------------------------------------------- Pretty confident I should exceed current throughput.
    Duplex: Full
    Auto-negotiation: off
    Port: Other
    PHYAD: 0
    Transceiver: internal
    Link detected: yes

The keen among us will want more details on the ceph setup, so while I am fairly confident I have a networking issue, the high level summary is:
Three nodes, each with 8x 8TB he8 SAS drives direct attached via jbod lsi hba + backplane (Man, the acronyms in this space are nuts, lol)
There is technically a fourth node, but it's not actually hosting ceph related stuff.
Again, I am pretty sure it's a network level issue, as testing is slow outside of ceph too.




Typical bonds look something like:
Code:
Features for bond0:
rx-checksumming: off [fixed]
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: on
    tx-tcp-mangleid-segmentation: on
    tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: off
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: on [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]

Following random optimization articles online (so very likely a misconfig here):
Code:
auto enp65s0
iface enp65s0 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000

auto enp65s0d1
iface enp65s0d1 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000


auto bond0
iface bond0 inet static
    address 192.168.2.2/24
    mtu 9000
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    post-up /sbin/ip link set dev bond0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000
    post-up /sbin/ip link set dev bond0 mtu 9000
    post-up /sbin/ip link set dev enp65s0 mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 mtu 9000
    post-up   iptables -t nat -A POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    post-up ifconfig bond0 192.168.2.2
    post-up /usr/sbin/ethtool -K bond0 lro on
    post-down iptables -t nat -D POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    bond-slaves enp65s0 enp65s0d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    bond-min-links 1
    bond-lacp-rate 1
Switch also set to 9000MTU, and LACP enabled.

Code:
#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#

#kernel.domainname = example.com

# Uncomment the following to stop low-level messages on console
#kernel.printk = 3 4 1 3

##############################################################3
# Functions previously found in netbase
#

# Uncomment the next two lines to enable Spoof protection (reverse-path filter)
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1

# Uncomment the next line to enable TCP/IP SYN cookies
# See http://lwn.net/Articles/277146/
# Note: This may impact IPv6 TCP sessions too
#net.ipv4.tcp_syncookies=1

# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1

# Uncomment the next line to enable packet forwarding for IPv6
#  Enabling this option disables Stateless Address Autoconfiguration
#  based on Router Advertisements for this host
net.ipv6.conf.all.forwarding=1


###################################################################
# Additional settings - these settings can improve the network
# security of the host and prevent against some network attacks
# including spoofing attacks and man in the middle attacks through
# redirection. Some network environments, however, require that these
# settings are disabled so review and enable them as needed.
#
# Do not accept ICMP redirects (prevent MITM attacks)
#net.ipv4.conf.all.accept_redirects = 0
#net.ipv6.conf.all.accept_redirects = 0
# _or_
# Accept ICMP redirects only for gateways listed in our default
# gateway list (enabled by default)
# net.ipv4.conf.all.secure_redirects = 1
#
# Do not send ICMP redirects (we are not a router)
#net.ipv4.conf.all.send_redirects = 0
#
# Do not accept IP source route packets (we are not a router)
#net.ipv4.conf.all.accept_source_route = 0
#net.ipv6.conf.all.accept_source_route = 0
#
# Log Martian Packets
#net.ipv4.conf.all.log_martians = 1
#

###################################################################
# Magic system request Key
# 0=disable, 1=enable all, >1 bitmask of sysrq functions
# See https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
# for what other values do
#kernel.sysrq=438
kernel.sysrq=1
#vm.swappiness=20



net.ipv4.tcp_window_scaling = 1


vm.overcommit_memory=1
#vm.swappiness=60
#vm.vfs_cache_pressure=200

net.ipv4.ip_forward=1
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv6.conf.default.forwarding=1
#net.ipv4.conf.all.mc_forwarding=1
#net.ipv4.conf.default.mc_forwarding=1

#vm.max_map_count = 262144
#vm.dirty_writeback_centisecs = 1500
#vm.dirty_expire_centisecs = 1500

vm.overcommit_memory=1
vm.nr_hugepages = 1024


net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

# set minimum size, initial size, and maximum size in bytes
net.ipv4.tcp_rmem="1024000 8738000 2147483647"
net.ipv4.tcp_wmem="1024000 8738000 2147483647"
net.ipv4.tcp_mem="1024000 8738000 2147483647"
net.ipv4.udp_mem="1024000 8738000 2147483647"

net.core.netdev_max_backlog=250000
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_window_scaling = 1
net.ipv6.conf.default.forwarding=1
net.ipv4.tcp_moderate_rcvbuf=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
kernel.sysrq = 1
net.link.lagg.0.use_flowid=1
net.link.lagg.0.lacp.lacp_strict_mode=1
net.ipv4.tcp_sack=1


Switch Setup, where the issue is likely preset:
Code:
Mellanox MLNX-OS Switch Management

Password:
Last login: Sun Jul 11 13:33:25 2021 from 192.168.10.132

Mellanox Switch

switch-625810 [standalone: master] > enable
switch-625810 [standalone: master] # show running-config
##
## Running database "initial"
## Generated at 2021/07/12 01:03:16 +0000
## Hostname: switch-625810
##

##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable

##
## License keys
##
license install REDACTED (Ethernet and all the bells and whistles)

##
## Interface Ethernet configuration
##
interface port-channel 1-4
interface ethernet 1/1-1/2 speed 56000 force
interface ethernet 1/13-1/14 speed 56000 force
interface ethernet 1/27-1/30 speed 56000 force
interface ethernet 1/1-1/2 mtu 9000 force
interface ethernet 1/7-1/8 mtu 9000 force
interface ethernet 1/13-1/14 mtu 9000 force
interface ethernet 1/25-1/30 mtu 9000 force
interface port-channel 1-4 mtu 9000 force
interface ethernet 1/1-1/2 channel-group 1 mode active
interface ethernet 1/7-1/8 channel-group 4 mode active
interface ethernet 1/13-1/14 channel-group 2 mode active
interface ethernet 1/25-1/26 channel-group 3 mode active

##
## LAG configuration
##
lacp
interface port-channel 1-4 lacp-individual enable force
port-channel load-balance ethernet source-destination-port
interface ethernet 1/1-1/2 lacp rate fast
interface ethernet 1/7-1/8 lacp rate fast
interface ethernet 1/13-1/14 lacp rate fast
interface ethernet 1/25-1/26 lacp rate fast

##
## STP configuration
##
no spanning-tree

##
## L3 configuration
##
ip routing vrf default


##
## DCBX PFC configuration
##
dcb priority-flow-control priority 3 enable
dcb priority-flow-control priority 4 enable
interface ethernet 1/1-1/36 no dcb priority-flow-control mode on force


##
## LLDP configuration
##
lldp

##
## IGMP Snooping configuration
##
ip igmp snooping proxy reporting
ip igmp snooping
vlan 1 ip igmp snooping querier

##
## PIM configuration
##
protocol pim

##
## IP Multicast router configuration
##
ip multicast-routing

##
## Network interface configuration
##
interface mgmt0 ip address REDACTED /16

##
## Local user account configuration
##
username admin password REDACTED

##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********


##
## Network management configuration
##
# web proxy auth basic password ********

##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID REDACTED
# (public-cert config omitted since private-key config is hidden)

##
## Persistent prefix mode setting
##
cli default prefix-modes enable
Any chance some kind soul can point out my mistake, or point me in the right direction?

Edit: Redacting stuff that shouldn't be public.
 
Last edited:

Fallen Kell

Member
Mar 10, 2020
45
13
8
Well, for one thing, a dual port connectx3 or connectx4 card will not have the PCI-E bandwidth to use both ports. A PCI-E 3.0 8x link is limited to 64gbits per second (less when adding in the overhead of the protocol, this is why the fdr speed of 56gbits existed because factoring in overhead of the buses, it was the fastest speed achievable). Dual port cards only exist for network redundancy (i.e. connecting to two different switches so that if a switch fails you are still connected) or for creating star style networks where-in the number of hops to reach a different computer is shorter over one port than the other.

Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).
 
  • Like
Reactions: Amrhn

necr

Member
Dec 27, 2017
85
29
18
121
I wouldn’t expect great performance on a NAT’ted interface. Did you have the same on IPoIB? Also, what are you buffers set to (Ethtool -g)? Bond config seems correct on both sides, but I’d start without it while debugging.
 
Last edited:

jwest5637

New Member
Nov 1, 2019
11
0
1
Well, for one thing, a dual port connectx3 or connectx4 card will not have the PCI-E bandwidth to use both ports. A PCI-E 3.0 8x link is limited to 64gbits per second (less when adding in the overhead of the protocol). Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).
Hey, Thats a really great catch, thank you for highlighting it.

Ill be swapping half the ports back to ipoib for further testing, will post some benchmarks in the next day or so while debugging further!

This because of PCIe saturation, to avoid the bonding during testing, and so I have side by sides with ipoib. It takes a bit to swap everything, as I am still learning, but will report back before too long.
 

tsteine

Member
May 15, 2019
96
25
18
Dual port cards only exist for network redundancy (i.e. connecting to two different switches so that if a switch fails you are still connected) or for creating star style networks where-in the number of hops to reach a different computer is shorter over one port than the other.

Only the connectx5 cards with 16x link will be able to reach 100gbits of speed (and only for a single port).
This is a somewhat broad and sweeping statement. Redundancy is not the *only* reason why you might want a dual port card.
The ConnectX-5 EX EN cards run PCIe 4.0 16x slots and allow you to saturate 100gbit on both ports simultaneously.

If you were to run a multi-chassis load-balanced LAGG with something like an extremely powerful vpp software router, there is no reason why you couldn't hit >100gbit speeds from a single connectx-5 card using both ports.
 

Fallen Kell

Member
Mar 10, 2020
45
13
8
This is a somewhat broad and sweeping statement. Redundancy is not the *only* reason why you might want a dual port card.
The ConnectX-5 EX EN cards run PCIe 4.0 16x slots and allow you to saturate 100gbit on both ports simultaneously.
Yes, at Gen 4.0 PCIe 16x, you can run both ports, but not all ConnectX-5 cards are PCIe 3.0, and in fact only the specific EX EN card is Gen 4.0. All other ConnectX-5 cards are Gen 3.0 and as such my statement stands. And I also didn't say that "redundancy is the *only* reason", it was "redundancy or creating star style networks to reduce the number of hops to reach another computer".

Again, my point is, that until very recent cards, there simply wasn't the bandwidth to run 100gbit. So when talking about ConnectX-3 that this thread was about, I was pointing out that there isn't enough PCIe bandwidth to run at 100gbit even though the card has 2 ports that if you on the surface look at the specs are lead to believe, "wow, it has 2x 56gbit ports, I should be able to LAGG them together and get over 100gbits", when the truth is, no you can't. It will only support ~56gbits because that is all the PCIe bus can supply those cards on an 3.0 8x slot.
 
Last edited:

tsteine

Member
May 15, 2019
96
25
18
@Fallen Kell since we are going to engage in pedantry, I should note that I specifically said "ex en" and not just "en" precisely because the "ex" naming convention for mellanox connectx-5 cards specifically refers to the card being PCIe 4.0 capable.

On the topic of PCIe back pressure and bandwidth we are in complete agreement.
 

MichalPL

Member
Feb 10, 2019
46
5
8
I didn't max the 100GbE yet, but what I think is:

Connectx-3 40/56 (PCIE 3.0 x8) LACP can do about 7000MB/s in real life when copy files on NVME raid or single NVME PCIE 4.0

Connectx-3 40/56 (PCIE 3.0 x8) no LACP 4420MB/s over single ethernet port at 40G

Connectx-4 (PCIE 3.0 x16) can do ~11000MB/s (when I was testing it it was slightly more than 7000 but it was a NVME limit)
 

Rand__

Well-Known Member
Mar 6, 2014
5,644
1,247
113
Just keep in mind that we are talking about aggregated bandwith here - with sufficient threads the X3's can max out a single port or even pcie, but for single threads you are limited to sub 20Gbit

Coincidentally the 14Gbit you see are in line with what my experiments have given for a single connection as well...

Now, I have not played with ceph in a long time since it was just darn slow (despite using all nvme), so I am not sure how the 8x8 will be accessed (raid 0, 1, 5 equivalent?) , but i would not think that the network is the culprit of only being able to run the sync at 20MB/s...
You can simply test that by running a file transfer on top of the running sync (ideally not to the same disk/set) , if it impacts the sync speed then NW is limited, if not then its not the cause...