100gbe mellanox with 60% loss - how to diagnose?

eskiuzmi · Dec 20, 2022

I finally got these two machines to talk and it's fast, but so far it's only 15gbe fast.

Windows 10 both, both with Mellanox 455a. On the server the NIC is on a x16 gen3 and on the other the 455 is on a x8 (gen5 going to waste) because that's the best that z790 can offer. In any case 60gbe is better than 2.5gbe so I won't complain. Except that throughput shows 60% loss.

I lowered the jumbo size from 9000 to 1500 just in case. Since the twinax connecting them directly is rated for 100gbe, I don't know what could be wrong.

Do you know what I should look into to diagnose this?

nice bumps to 50gbe

i386 · Dec 20, 2022

eskiuzmi said:
Do you know what I should look into to diagnose this?

I would try a known good dac cable and If IT Shows the same Numbers. I suspect the 10gtek cable...

Stephan · Dec 21, 2022

As first measure try NTttcp on the command line. At 100 Gbps you need to make sure any performance test is using all cores and multiple TCP streams between machines.

To rule out windows, boot a recent Linux from USB sticks (using Ventoy, and e.g. some Ubuntu ISO with persistence) and run GitHub - microsoft/ntttcp-for-linux: A Linux network throughput multiple-thread benchmark tool.

Also on Linux you can use ethtool -S [interface] to get detailed statistics:

ethtool -S eno1 | grep -E "(drop|error|fail|coll)"

rx_errors: 0
tx_errors: 0
tx_dropped: 0
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_failed: 0
rx_csum_offload_errors: 0
alloc_rx_buff_failed: 0
dropped_smbus: 0
rx_dma_failed: 0
tx_dma_failed: 0
uncorr_ecc_errors: 0
corr_ecc_errors: 0

eskiuzmi · Dec 21, 2022

i386 said:
I would try a known good dac cable and If IT Shows the same Numbers. I suspect the 10gtek cable...

Alright i ordered two cables: one twinax from another vendor and an AOC.

CyklonDX · Dec 21, 2022

This is potentially the most useful 'guide'. (not only to udp)

https://events.static.linuxfound.org/sites/events/files/slides/LinuxConJapan2016_makita_160712.pdf

eskiuzmi · Dec 22, 2022

The other DAC arrived. Loss by 15%. Bending the new cable a certain way increases loss further. The first one was immune to bending loss.
So cable explains a bit of the loss, but not the assymetry between up/down and since both machines run the same OS, things could be hardware related. One machine runs at x16 and the other at x8, maybe that explains some of it but not that download is unaffected... unless these NIC allocated up and down circuit by pcie lane? then the x8 machine has all required lanes to downstream but only a few to upstream? I don't know how this symetrical 100gbe works at the hardware level.

CyklonDX said:
This is potentially the most useful 'guide'. (not only to udp)

https://events.static.linuxfound.org/sites/events/files/slides/LinuxConJapan2016_makita_160712.pdf

I didn't read all the word but the graphs look interesting, it's like multi threaded programming which I do some so I can understand some of it. Is it the same for 100gbe? Seems that bottlenecks move as we go faster. Someone mentioned increasing "cache/buffer size on the OS"

CyklonDX · Dec 22, 2022

eskiuzmi said:
Is it the same for 100gbe? Seems that bottlenecks move as we go faster. Someone mentioned increasing "cache/buffer size on the OS"

Its the same process, but don't follow exact same numbers they used - keep in mind your own hardware; and test, test and test again.
(best to keep some excel/google sheets with your results), and change only single thing at the time so you know exactly what does what.

Search

100gbe mellanox with 60% loss - how to diagnose?

eskiuzmi

New Member

i386

Well-Known Member

Stephan

Well-Known Member

eskiuzmi

New Member

CyklonDX

Well-Known Member

eskiuzmi

New Member

CyklonDX

Well-Known Member