I had this as a comment in another thread, but moved it here to it's own thread.
I have a three node Ceph cluster that I am diagnosing the presumed slow throughput of.
Each node has a on a connectx-3 Pro 2 port qsfp+. While I was running them via ib_ipoib, in an attempt to get past the low throughput issue, I've moved away from ib_ipoib, and am using them in ethernet mode alongside an sx6036 switch also in ethernet mode with licence key.
My primary cause for concern is low throughput across the connectx-3's, to the tune of ~50-200MB/s (not Mbit) of real world throughput to and from ramdisks on both ends, depending on the test, and ~14Gbit/s when using tools like iperf3 to avoid disk IO completely.
While I understand that even with LACP, I should not expect the full max throughput of both connections (112Gbit/s), having seen closer to 50Gbit/s with a single IPoIB connection I was expecting a lot more from Ethernet across the board (eliminating the transcoding to and from infiniband).
What prompted the whole situation was an upgrade to ceph pacific, where the long partially unsupported ceph-RDMA appears to be totally broken, forcing my hand.
I am currently backfilling a failed OSD, but at a rate of ~20MB/s (137 Days...), which does a good job of highlighting what I am dealing with, haha.
I am pretty confident the issue is not the hardware, but the configuration on either my linux nodes, or equally likely the SX6036 switch directly.
--> I highlight the switch as it was a struggle getting LACP working. I note the nodes becuase I am deepy unfamiliar with this degree of networking. This whole thing is a learning opportunity for me, but as I am confident others can relate, there are some really steep learning curves when self-teaching this stuff.
At the moment, I have two mlx4_en ethernet ports per node, running lacp X2 linux bond:
The keen among us will want more details on the ceph setup, so while I am fairly confident I have a networking issue, the high level summary is:
Three nodes, each with 8x 8TB he8 SAS drives direct attached via jbod lsi hba + backplane (Man, the acronyms in this space are nuts, lol)
There is technically a fourth node, but it's not actually hosting ceph related stuff.
Again, I am pretty sure it's a network level issue, as testing is slow outside of ceph too.
Typical bonds look something like:
Following random optimization articles online (so very likely a misconfig here):
Switch also set to 9000MTU, and LACP enabled.
Switch Setup, where the issue is likely preset:
Any chance some kind soul can point out my mistake, or point me in the right direction?
Edit: Redacting stuff that shouldn't be public.
I have a three node Ceph cluster that I am diagnosing the presumed slow throughput of.
Each node has a on a connectx-3 Pro 2 port qsfp+. While I was running them via ib_ipoib, in an attempt to get past the low throughput issue, I've moved away from ib_ipoib, and am using them in ethernet mode alongside an sx6036 switch also in ethernet mode with licence key.
My primary cause for concern is low throughput across the connectx-3's, to the tune of ~50-200MB/s (not Mbit) of real world throughput to and from ramdisks on both ends, depending on the test, and ~14Gbit/s when using tools like iperf3 to avoid disk IO completely.
While I understand that even with LACP, I should not expect the full max throughput of both connections (112Gbit/s), having seen closer to 50Gbit/s with a single IPoIB connection I was expecting a lot more from Ethernet across the board (eliminating the transcoding to and from infiniband).
What prompted the whole situation was an upgrade to ceph pacific, where the long partially unsupported ceph-RDMA appears to be totally broken, forcing my hand.
I am currently backfilling a failed OSD, but at a rate of ~20MB/s (137 Days...), which does a good job of highlighting what I am dealing with, haha.
I am pretty confident the issue is not the hardware, but the configuration on either my linux nodes, or equally likely the SX6036 switch directly.
--> I highlight the switch as it was a struggle getting LACP working. I note the nodes becuase I am deepy unfamiliar with this degree of networking. This whole thing is a learning opportunity for me, but as I am confident others can relate, there are some really steep learning curves when self-teaching this stuff.
At the moment, I have two mlx4_en ethernet ports per node, running lacp X2 linux bond:
Code:
Settings for bond0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 112000Mb/s <------------------------------------------- Pretty confident I should exceed current throughput.
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
The keen among us will want more details on the ceph setup, so while I am fairly confident I have a networking issue, the high level summary is:
Three nodes, each with 8x 8TB he8 SAS drives direct attached via jbod lsi hba + backplane (Man, the acronyms in this space are nuts, lol)
There is technically a fourth node, but it's not actually hosting ceph related stuff.
Again, I am pretty sure it's a network level issue, as testing is slow outside of ceph too.
Typical bonds look something like:
Code:
Features for bond0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: off
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: on [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
Following random optimization articles online (so very likely a misconfig here):
Code:
auto enp65s0
iface enp65s0 inet manual
mtu 9000
post-up /sbin/ip link set dev enp65s0 txqueuelen 20000
auto enp65s0d1
iface enp65s0d1 inet manual
mtu 9000
post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000
auto bond0
iface bond0 inet static
address 192.168.2.2/24
mtu 9000
post-up echo 1 > /proc/sys/net/ipv4/ip_forward
post-up /sbin/ip link set dev bond0 txqueuelen 20000
post-up /sbin/ip link set dev enp65s0 txqueuelen 20000
post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000
post-up /sbin/ip link set dev bond0 mtu 9000
post-up /sbin/ip link set dev enp65s0 mtu 9000
post-up /sbin/ip link set dev enp65s0d1 mtu 9000
post-up iptables -t nat -A POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
post-up ifconfig bond0 192.168.2.2
post-up /usr/sbin/ethtool -K bond0 lro on
post-down iptables -t nat -D POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
bond-slaves enp65s0 enp65s0d1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
bond-min-links 1
bond-lacp-rate 1
Code:
#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#
#kernel.domainname = example.com
# Uncomment the following to stop low-level messages on console
#kernel.printk = 3 4 1 3
##############################################################3
# Functions previously found in netbase
#
# Uncomment the next two lines to enable Spoof protection (reverse-path filter)
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1
# Uncomment the next line to enable TCP/IP SYN cookies
# See http://lwn.net/Articles/277146/
# Note: This may impact IPv6 TCP sessions too
#net.ipv4.tcp_syncookies=1
# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1
# Uncomment the next line to enable packet forwarding for IPv6
# Enabling this option disables Stateless Address Autoconfiguration
# based on Router Advertisements for this host
net.ipv6.conf.all.forwarding=1
###################################################################
# Additional settings - these settings can improve the network
# security of the host and prevent against some network attacks
# including spoofing attacks and man in the middle attacks through
# redirection. Some network environments, however, require that these
# settings are disabled so review and enable them as needed.
#
# Do not accept ICMP redirects (prevent MITM attacks)
#net.ipv4.conf.all.accept_redirects = 0
#net.ipv6.conf.all.accept_redirects = 0
# _or_
# Accept ICMP redirects only for gateways listed in our default
# gateway list (enabled by default)
# net.ipv4.conf.all.secure_redirects = 1
#
# Do not send ICMP redirects (we are not a router)
#net.ipv4.conf.all.send_redirects = 0
#
# Do not accept IP source route packets (we are not a router)
#net.ipv4.conf.all.accept_source_route = 0
#net.ipv6.conf.all.accept_source_route = 0
#
# Log Martian Packets
#net.ipv4.conf.all.log_martians = 1
#
###################################################################
# Magic system request Key
# 0=disable, 1=enable all, >1 bitmask of sysrq functions
# See https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
# for what other values do
#kernel.sysrq=438
kernel.sysrq=1
#vm.swappiness=20
net.ipv4.tcp_window_scaling = 1
vm.overcommit_memory=1
#vm.swappiness=60
#vm.vfs_cache_pressure=200
net.ipv4.ip_forward=1
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv6.conf.default.forwarding=1
#net.ipv4.conf.all.mc_forwarding=1
#net.ipv4.conf.default.mc_forwarding=1
#vm.max_map_count = 262144
#vm.dirty_writeback_centisecs = 1500
#vm.dirty_expire_centisecs = 1500
vm.overcommit_memory=1
vm.nr_hugepages = 1024
net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160
# set minimum size, initial size, and maximum size in bytes
net.ipv4.tcp_rmem="1024000 8738000 2147483647"
net.ipv4.tcp_wmem="1024000 8738000 2147483647"
net.ipv4.tcp_mem="1024000 8738000 2147483647"
net.ipv4.udp_mem="1024000 8738000 2147483647"
net.core.netdev_max_backlog=250000
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_window_scaling = 1
net.ipv6.conf.default.forwarding=1
net.ipv4.tcp_moderate_rcvbuf=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
kernel.sysrq = 1
net.link.lagg.0.use_flowid=1
net.link.lagg.0.lacp.lacp_strict_mode=1
net.ipv4.tcp_sack=1
Switch Setup, where the issue is likely preset:
Code:
Mellanox MLNX-OS Switch Management
Password:
Last login: Sun Jul 11 13:33:25 2021 from 192.168.10.132
Mellanox Switch
switch-625810 [standalone: master] > enable
switch-625810 [standalone: master] # show running-config
##
## Running database "initial"
## Generated at 2021/07/12 01:03:16 +0000
## Hostname: switch-625810
##
##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable
##
## License keys
##
license install REDACTED (Ethernet and all the bells and whistles)
##
## Interface Ethernet configuration
##
interface port-channel 1-4
interface ethernet 1/1-1/2 speed 56000 force
interface ethernet 1/13-1/14 speed 56000 force
interface ethernet 1/27-1/30 speed 56000 force
interface ethernet 1/1-1/2 mtu 9000 force
interface ethernet 1/7-1/8 mtu 9000 force
interface ethernet 1/13-1/14 mtu 9000 force
interface ethernet 1/25-1/30 mtu 9000 force
interface port-channel 1-4 mtu 9000 force
interface ethernet 1/1-1/2 channel-group 1 mode active
interface ethernet 1/7-1/8 channel-group 4 mode active
interface ethernet 1/13-1/14 channel-group 2 mode active
interface ethernet 1/25-1/26 channel-group 3 mode active
##
## LAG configuration
##
lacp
interface port-channel 1-4 lacp-individual enable force
port-channel load-balance ethernet source-destination-port
interface ethernet 1/1-1/2 lacp rate fast
interface ethernet 1/7-1/8 lacp rate fast
interface ethernet 1/13-1/14 lacp rate fast
interface ethernet 1/25-1/26 lacp rate fast
##
## STP configuration
##
no spanning-tree
##
## L3 configuration
##
ip routing vrf default
##
## DCBX PFC configuration
##
dcb priority-flow-control priority 3 enable
dcb priority-flow-control priority 4 enable
interface ethernet 1/1-1/36 no dcb priority-flow-control mode on force
##
## LLDP configuration
##
lldp
##
## IGMP Snooping configuration
##
ip igmp snooping proxy reporting
ip igmp snooping
vlan 1 ip igmp snooping querier
##
## PIM configuration
##
protocol pim
##
## IP Multicast router configuration
##
ip multicast-routing
##
## Network interface configuration
##
interface mgmt0 ip address REDACTED /16
##
## Local user account configuration
##
username admin password REDACTED
##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********
##
## Network management configuration
##
# web proxy auth basic password ********
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID REDACTED
# (public-cert config omitted since private-key config is hidden)
##
## Persistent prefix mode setting
##
cli default prefix-modes enable
Edit: Redacting stuff that shouldn't be public.
Last edited: