This journey started when I noticed that my network which used to get 36Gb/sec throughput was getting 4.5Gb/sec one direction and 9.8Gb/sec the other. After many many hours of searching, trying different settings, and finally cross-flashing mellanox's latest firmware (and trying to force-reinstall drivers), I caved and tried the stupid thing I should have tried earlier. Jumbo frames. 9000MTU -> 16.0 Gbit/sec, an improvement of 6+ Gbit/sec over the 1500 MTU case. Still a very far cry from 36Gbit/sec.
With 9000MTU:
It's been a long time since I've seen Jumbo frames matter, and something seems very wrong here. It clearly seems like its CPU bottlenecked, which is amazing, since this is on a watercooled 4.4GHz Haswell-E.
For reference, it gets the same speeds over fiber through a switch and over a DAC directly from one card to another (two PCs in both cases).
IPerf3 numbers are also pretty horrible to localhost (not using any NIC at all) - at 7.xGb/sec for TCP. Again, clearly CPU limited. But that's where I thought the Mellanox TCP offload stuff was supposed to shine?
Before anyone asks, all the cards are in PCI-e x8 electrical slots at 8.0Gbps (Gen3) speeds, this was one of the first things I verified.
Has anyone ever seen this kind of issue? Latest Mellanox drivers (5.35.12978.0) from 3/8/2017, and latest Mellanox firmware (7000). I should have mentioned this before, but this is on Windows 10 x64 (latest)
Thanks in advance!
P.S. One more note, before Jumbo Frames, -P N (where N > 1) wasn't helping performance, it was just getting the same throughput broken into more pieces. I just tested with Jumbo Frames, and it seems that it now does make a positive impact - upto 25Gbit/sec @ P=4, but it still doesn't get performance to 36...
With 9000MTU:
It's been a long time since I've seen Jumbo frames matter, and something seems very wrong here. It clearly seems like its CPU bottlenecked, which is amazing, since this is on a watercooled 4.4GHz Haswell-E.
For reference, it gets the same speeds over fiber through a switch and over a DAC directly from one card to another (two PCs in both cases).
IPerf3 numbers are also pretty horrible to localhost (not using any NIC at all) - at 7.xGb/sec for TCP. Again, clearly CPU limited. But that's where I thought the Mellanox TCP offload stuff was supposed to shine?
Before anyone asks, all the cards are in PCI-e x8 electrical slots at 8.0Gbps (Gen3) speeds, this was one of the first things I verified.
Has anyone ever seen this kind of issue? Latest Mellanox drivers (5.35.12978.0) from 3/8/2017, and latest Mellanox firmware (7000). I should have mentioned this before, but this is on Windows 10 x64 (latest)
Thanks in advance!
P.S. One more note, before Jumbo Frames, -P N (where N > 1) wasn't helping performance, it was just getting the same throughput broken into more pieces. I just tested with Jumbo Frames, and it seems that it now does make a positive impact - upto 25Gbit/sec @ P=4, but it still doesn't get performance to 36...