Turbocharge your Quanta LB6M - Flash To Brocade TurboIron - Big Update!

TheBloke · Jan 19, 2018

Guys, newbie question.. could you help me make a LAG? I haven't done this for quite a few years, and last time I did it, it just worked first time.

EDIT: Actually this might be a Windows question..

I just created an active LAG on Solaris to the LB6M and it worked perfectly in seconds. That's with the same Intel X520 NICs as I'm using in my Win 10 desktop.

But if anyone does have any ideas why I can't get it working on Windows 10, I'd be most grateful.

I'm configuring this as follows (EDIT: which is now confirmed to work with Solaris, but won't with Win10):

Code:

interface ethernet 23
 port-name 10Gdesktop1
 link-aggregate active
!
interface ethernet 24
 port-name 10Gdesktop2
 link-aggregate active

Then on my desktop, I've teamed my two Intel X520-DA1s. It has various Team options. I first tried "IEEE 802.3ad Dynamic Link Aggregation", but this keeps failing immediately - all the network connections (NIC 1, NIC 2 and the Team) keep dropping, showing "Network cable unplugged", then connected, then unplugged. This happens every 5 seconds or so, and causes the OS to keep lagging as well.

So then I tried "Static Link Aggregation". This at first appears to work - I get a stable Team up on the workstation, I assign an IP to it. But then it can't ping out, and nothing can ping it. And when I check the switch, it shows no active LAG.

Here's what the switch shows when the Win10 NICs are in Static Link Aggregation mode:

Code:

switch10g#show int br
Port    Link    State   Dupl Speed Trunk Tag Pvid Pri MAC            Name
..
23      Up      Blocked Full 10G   None  No  1    0   089e.0193.0832  10Gdeskt
24      Up      Blocked Full 10G   None  No  1    0   089e.0193.0832  10Gdeskt

switch10g#show link-aggregate eth 23
System ID: 089e.0193.0832
Default Key:        2
Port  [Sys P] [Port P] [  Key ] [Act][Tio][Agg][Syn][Col][Dis][Def][Exp][Ope]
23         1        1        2   Yes   S   Agg  Syn  Col  Dis  Def  No   Ina
switch10g#show link-aggregate eth 24
System ID: 089e.0193.0832
Default Key:        2
Port  [Sys P] [Port P] [  Key ] [Act][Tio][Agg][Syn][Col][Dis][Def][Exp][Ope]
24         1        1        2   Yes   S   Agg  Syn  Col  Dis  Def  No   Ina

There are two obvious differences in this output compared to what I see when I later successfully made a LAG on Solaris: 1) the interfaces in "show int br" show as BLOCKED, not Forward, and that's probably because 2) The link-aggregates are "Inactive" ("Ina" in the last column).

The Windows Intel drivers have a "Test Switch" button under the Team settings, and I've run this several times in SLA mode and it always says "No problems detected with the switch configuration" - and I've confirmed that if I turn off link-aggregation on the switch, this "Test Switch" button does instead show an error. So the Intel drivers are at able to confirm whether LAG is turned on on the switch.

Maybe Static Link Aggregation mode is the wrong mode anyway? Perhaps the real problem is that in full LACP mode, the NICS and Team keep going "Cable disconnected" for some unknown reason.

I realise this might be a Windows or Intel question, so I'll take it elsewhere if needed. But if anyone has any suggestions I can try, I'd be most grateful.

TheBloke · Jan 19, 2018

Second question, re documentation:

In order to make the LAG, I read the docs provided on BrokeAid, but although the L2 guide does mention the TurboIron 24x at the start of the document, in the case of LAGs they don't seem to apply to this switch - they talk about commands lag, lag-hash and show lag, none of which works on the switch.

I did some Googling and I found a doc labelled as being for the TurboIron, a single huge document that appears to be all the individual Brocade docs as as a single file. It's also old - dated 2010, and appears to be for FastIron version 4.2.

This describes LAG commands that do work on my OS. Instead of creating a top-level, named LAG with the lag command, they are set under each individual interface: link-aggregate with options active, configure, off and passive.

Any ideas why the docs for 8.0.1 describe commands the TurboIron doesn't have? Maybe these were added in a later FastIron release, which the TurboIron never officially supported? Although the 8.0 docs do mention TurboIron at the start, I have noticed that mention of the TurboIron thereafter is nonexistent - eg it never appears in any of the feature availability tables. So I'm often confused about which features the TurboIron / LB6M does have versus the various other switch models the doc deals with.

TheBloke · Jan 19, 2018

Wow, my desperate endeavours to get LAG working with Windows have managed to crash the switch completely.

I had the Win 10 NICs Teamed in SLA mode, having just set the switch to 'link-aggregate passive'. I then tried changing the Win 10 NICs back to full LACP mode, as I couldn't remember if I'd recently tried LACP with passive. Then I got this in the serial console:

Code:

Exception Type 0300 (Data Storage Interrupt), appl
0202d030(msr)
247f0000(dar)
00800000(esr)
20f7e878(pc)
202a8a9c(lr)
4c49442c
End of Trace

Console stopped responding, and all links went down. Having to hard power cycle.

AT S37=0 · Jan 19, 2018

TheBloke said:
Wow, my desperate endeavours to get LAG working with Windows have managed to crash the switch completely.

Console stopped responding, and all links went down. Having to hard power cycle.

My guess is that log file space was overrun or something. There is a register in the powerpc named for this condition. fohdeesha probably knows more about that.

I'd be curious if you let the switch alone, would it reboot? It should have a watchdog reboot timer.

I know nothing of windows LACP or Brocade LACP, but the passive / active refers to whether the LACP protocol is actively sending packets. It's ok to have two active sides or one active and one passive. There is a deep and frustrating history here: Link Aggregation Confusion

fohdeesha · Jan 20, 2018

I think @AT S37=0 is right, it looks like it flooded the small logs area, granted you think they'd have planned for that. It does say data storage interrupt though, so probably that. I know you can use the port dampening command to prevent such violent link flapping, probably a good idea: link-error-disable

I know @vrod was having the same problem making a lacp group work, on both fastpath and brocade. He even said his switch froze too

AT S37=0 · Jan 20, 2018

Maybe setting up syslog in order to send logs off the switch might fix?

fohdeesha · Jan 20, 2018

@TheBloke regarding the documentation it's hard to say, the TurboIron was a bit of a one-off outsider to the existing FastIron product line, so there's probably a couple commands that are missing or different. So far it seems to be 99% identical though, I was even able to copy over an entire config from one of my FCX's

fohdeesha · Jan 20, 2018

@mixmansc grabbed some more relevant docs from Ruckus's site (who acquired the fastiron line) - EDIT: they are now in the main firmware zip, linked in the guide

mixmansc · Jan 20, 2018

Happy to have at least contributed a little...

TheBloke · Jan 20, 2018

TheBloke said:
On a different matter, could someone confirm that the following RJ45 SFP should work: AVAGO ABCU-5710RZ GbE SFP Copper RJ45 GBIC Transceiver 1000BASE-T

I went ahead and bought four of these from eBay, and yeah they work fine in the LB6M. I will likely get a few more similar, whatever I can get for super cheap, and aim to have the majority or all my 1G links go direct into the single LB6M, alongside my handful of 10G. Saves a few pennies on the electricity, and more importantly noise pollution, not needing to use my LB4M as well. Plus the software is now better on the 6M

I'll keep the LB4M though, might come in handy if I later need to expand to many more 1G ports.

One thing I didn't realise: that the 1G SFP can only link at 1GB. Now I google I see this is common knowledge, but it was news to me.

The reason I noticed is that for some reason before my server (a Tyan 2xLGA1366 motherboard) is booted, its main NICs and the IPMI NIC sync at 10Mbit. As soon as it boots they switch to 1G, but this does mean initial IPMI contact - eg to tell it to boot in the first place - happens at 10M.

I initially plugged all the 1G NICs into these new 1G SFPs, but then I realised I could no longer access the IPMI unless the server was already running. So I've reverted to having the IPMI cable go into one of the LB6M's copper ports.

Live and learn! Anyway, I'm glad to see it is possible to add 1G ports to the switch for very little money.

Terry Kennedy · Jan 20, 2018

TheBloke said:
One thing I didn't realise: that the 1G SFP can only link at 1GB. Now I google I see this is common knowledge, but it was news to me.

There are actually 2 kinds of copper SFPs - ones that do speed conversion internally and ones that require the switch ASICs to handle speed conversion. Which kind any particular switch wants depends on the switch hardware and software. Speed-converting ones need switch software support to enable the user to specify the speed, otherwise they usually auto-negotiate. Non-speed-converting ones require the switch port ASICs to support multiple speeds, as well as the aforementioned software support to let the user specify the speed. On top of that, you have various vendor lock-in where non-"qualified" (== expensive) SFPs are either ignored or generate an error and are disabled.

If you have a SFP coding tool, you can get creative if you need (for example) 100 Mbit on a switch that claims to not support this. For example, the Cisco Catalyst 3750 only supports 100Mbit on short range optics (and copper, so not really on-topic). If you need to do 100Mbit on long-range optics, you get a re-codable LR optic and code it to report itself as a SR part.

TheBloke · Jan 20, 2018

Thanks @Terry Kennedy , that's useful to know.

In other news, I got my Win 10 LACP working! In order to solve the constant connection dropping, I had to disable spanning-tree on the two 10G ports connected to the two Win 10 X520 NICs. I don't know why; I didn't have to do that to get LACP working on Solaris. I got lucky with a Google search that showed spanning tree as a possible solution to an unrelated connection-drop problem on Windows with the X520, so I tried it here without much hope and was pleased it worked immediately.

Sadly that's where the happiness ended - the bad news is that in benchmarking SMB transfers over the new LACP connections (both desktop and server using LACP on their 2 x 10G links), I haven't once achieved bandwidth in excess of a single 10G link. In fact it's actually running slower than even 1 x 10G link - averaging 7-8Gb/s in both upload and download tests.

My previous non-LACP tests achieve 14Gb/s writes & 17Gb/s reads across the two links, thanks to the multi-channel connections added by SMB 3.1 and supported by Windows 10 and Samba. That's spreading transfers across two Samba instances on the Solaris server, which isn't too realistic of normal usage. But I can get 12-13Gb/s using a single Samba instance before it bottlenecks on CPU.

So I don't think I'll be sticking with LACP Which is a bit disappointing and surprising, given that I thought Windows was meant to detect Teamed/LACP links and use them with more threads, as part of its SMB 3.1.1 multichannel code. I'm using 16 concurrent transfers (iozone) hitting two separate SMB servers on the Solaris server, so I should have more than enough concurrency. But I'm getting results as much as 60% slower as the same test performed over 2 x non-teamed links, and 30% slower than a single link.

Oh well, at least I got it working

EDIT: Actually maybe this is the problem: only one of the NICs in both the server and desktop LACPs is both sending and receiving data:

Code:

switch10g(config)#show int e23 to 24 | include rate
  300 second input rate: 1559040664 bits/sec, 39546 packets/sec, 15.64% utilization
  300 second output rate: 120 bits/sec, 0 packets/sec, 0.00% utilization
  300 second input rate: 1373458528 bits/sec, 19403 packets/sec, 13.76% utilization
  300 second output rate: 4614377120 bits/sec, 91905 packets/sec, 46.28% utilization

switch10g(config)#show int e1 to 2 | include rate
  300 second input rate: 1821802448 bits/sec, 30660 packets/sec, 18.26% utilization
  300 second output rate: 1200 bits/sec, 1 packets/sec, 0.00% utilization
  300 second input rate: 1851207400 bits/sec, 30948 packets/sec, 18.55% utilization
  300 second output rate: 774666264 bits/sec, 25531 packets/sec, 7.78% utilization

I guess 'output rate' means the NIC sending data. So one NIC in each of the two trunks is receiving but not sending; the other is doing both. That would certainly stop me going over 10Gb/s. Though I don't know why it'd give me speeds ~30% below a single link.

I don't know if this is fixable, but the fact it's on both the Solaris and Windows 10 LACPs makes me wonder if it's standard behaviour.

Maybe I'm just misunderstanding what LACP should be able to do for me in terms of performance. I knew it couldn't help with single connections, but I thought lots of concurrency would enable full use of it. Anyway, I suppose a protocol-level multi-channel system - ie SMB multichan and iSCSI MPIO or MCS - is probably always going to do better than a general network-level one.

Terry Kennedy · Jan 20, 2018

TheBloke said:
Sadly that's where the happiness ended - the bad news is that in benchmarking SMB transfers over the new LACP connections (both desktop and server using LACP on their 2 x 10G links), I haven't once achieved bandwidth in excess of a single 10G link. In fact it's actually running slower than even 1 x 10G link - averaging 7-8Gb/s in both upload and download tests.

You're probably mis-understanding how LACP works. A single TCP connection (set of source and destination IP addresses + source and destination ports) is always going to get assigned to one of the links in the LACP bundle. You can control how that decision is made (in the FASTPATH software, it is "hashing-mode #"; in Cisco IOS it is "port-channel load-balance-hash-algo X" or "port-channel load-balance X" depending on the platform). But you can't get round-robin with any LACP hash mode.

There are subtleties in LACP that require this behavior. LACP was initially intended for switch-to-switch links, not switch-to-host links. The idea was that the switch would be aggregating traffic from a number of hosts and could then distribute that traffic (relatively) evenly across the LACP bundle to another switch, avoiding a bottleneck on the switch-to-switch link.

Consider the following - an ISP has a switched backbone with nodes in New York City, NY; Newark, NJ and Philadelpha, PA. For redundancy, each node's switch has a LACP bundle of 4 10GbE links to every other node's switch. 2 of those 10GbE links go directly between the 2 nodes, while the other 2 go "through" the 3rd node via an OEO repeater). This means that 2 of the LACP links have a shorter round trip time than the other 2. [NYC to Newark directly is 15 miles; NYC to Newark via Philly is 186 miles.] If the traffic going over the LACP bundle is UDP, there is no out-of-sequence reassembly at the UDP level, and if the application does not handle out-of-sequence packets, a simple round-robin distribution across the 4 LACP links will virtually guarantee that packets will appear at the destination out of sequence.

Also, a single 10GbE link is good for about 1250MBytes/sec. If the client system's disk subsystem is limited to less than that, LACP wouldn't give your more bandwidth even if it could do round-robin.

TheBloke · Jan 20, 2018

Thanks very much for the details, Terry.

I did understand about the single connection limitation, but I had thought that by using two Samba servers on the server side, and by virtue of Windows' Multi Channel features in the latest SMB protocol, I would be running enough TCP connections to balance. However, I've just checked my actual outgoing TCP connections and it appears I have only one to each of the two Samba instances I'm using in this test.

That's strange and disappointing - the OS is supposedly meant to make more outgoing connections when you have appropriate hardware, ie NICs with Receive Side Scaling Queues, and/or when Teamed. Here's the Microsoft doc I refer to on this.

Admittedly it's a feature promoted as being part of Windows Server 2012 and later. Windows 10 definitely has it, I know for sure multi-channel connections work across two un-teamed links as I've been benchmarking that for a few days. That's how I can get 14 - 17 Gb/s across two unteamed links. I haven't actually checked how many TCP connections it opens in those tests when I'm not Teamed - I will check that in a minute.

Still, I do definitely have two outbound TCP connections, and I know it is actually using both NICs - I can see traffic on both ports of the switch. But I don't get near the performance of two links. And I have the issue of seemingly only one of the two NICs in each LACP actually sending, though both receive.

I'm going to do a quick test to see if it works any better with four or more connections. That's an academic test because I'd never use that setup in real life, but it'll be interesting to see at least.

Assuming more TCP connections does help, it would mean a key part of the problem is that Windows isn't creating multiple TCP connections to a given remote server instance, as I thought that Microsoft document indicated it would. Their diagram seems to indicate at least two connections per NIC to each server when using teamed NICs, where I only see one:

It could be that the Teaming part of the SMB MultiChannel only work on Windows server editions. For a while it was never even officially confirmed if Windows 10 supported MultiChannel at all - there are posts on this forum from 1-2 years ago of people trying and failing to get it to work. But it definitely does work great for me now on un-Teamed links. Just seemingly it's not helping at all with LACP.

On your last point: I'm definitely not hitting any other bottlenecks in this test. Doing the same test without LACP, using SMB 3.1.1's built-in multi-channelling to spread the load over two un-teamed links, gives me speeds of 14 - 17 Gb/s.

That's still not the limit of my server disk subsystem which is 2.5GB/s - 3GB/s on my Seagate WarpDrive PCIe SSD array (tested locally from the server using iozone, and fairly close to published specs.) In fact I haven't quite worked out what the bottleneck is in that test, given I'm hitting none of CPU, disk or network limits. But 14 - 17 Gb/s out of two 10Gb/s links is quite satisfactory.

However I only get those speeds when using two Samba instances on the server end. With a single Samba server - which is what I would actually use day-to-day - I hit limits of 12-13Gb/s in both reads and writes, and in that case I do know the bottleneck: the smbd instance on the server tops out at 100% of a single core, as it's not multi-threaded. My server has 12 cores/24 threads, but its LGA1366 Westmere CPUs are pretty old now, and as they only run at 2.93ghz (with only 800mhz DDR3 RAM), single threaded performance isn't too great.

I'm going to do some more tests now, just for the hell of it. I'll try four remote Samba instances, so I get more TCP connections, and see if that makes much of a difference with LACP. And I'm now curious to see how many TCP connections are opened by SMB MultiChannel when I don't use LACP. It must be at least two per server, which is more than I'm getting when LACP is on.

mixmansc · Jan 20, 2018

Interesting stuff. With what my company does, single threaded performance is of greater importance than a high number of cores (even though many of the applications we use are multi threaded). That specifically why I went with E5-2643v3 processors in both the servers and the workstations. They are only 6 core parts but their base core speed is 3.4ghz. There are other parts with a ton more cores but at the expense of raw speed, albeit many can "turbo" up to the same speeds. I did also have to balance what I considered the best bang for the buck based on my budget. Was talking to someone just the other day who is in the same business and he was all giddy about his new workstation and that he configured it with a dual 14 core processors. Apparently he did not do his homework, even the heavily multi-threaded Photoshop see rapidly diminishing returns on the numbers of cores once you hit about 6. At the point raw speed becomes much more beneficial.

Of course for other industries and applications more cores might be far more beneficial. Interested in seeing all of these results and tests though. I was planning on connecting the servers to the switch via dual DAC cables. Ideally I'm also considering a second LB6M for fault tolerance reasons but I also wonder on the same for load sharing as well. Single attach to each switch, dual attached to each, etc (can end up needing a lot of dang ports and cabling not to mention all the configuration that must be perfect).

TheBloke · Jan 20, 2018

Right - I just repeated my benchmark using four separate Samba instances on the server, thus forcing the Windows client to make at least 4 TCP connections. (Again, it should really have been 4 * 2 or more connections, but it seems MultiChan won't do that with a Team.)

This got me - in bursts - over 10Gb/s. Total speed in a 128GB write test was 10.12Gb/s. Hardly exceptional, still much slower than without LACP, but at least it's not slower than 1 link. And when I check the switch stats in this 4 x TCP connection test, I see all four ports both sending and receiving.

So yeah, with LACP I need more TCP connections to have a chance to use the bandwidth. Maybe 8 would be even better. But without using a silly number of server instances, I am limited in how many TCP connections I can open because SMB MultiChan on Windows 10 doesn't seem to detect the Team and make more than one connection per server as it states it will (at least, as it states it will on Windows Server.)

Anyway, it's not fully LACP's fault

mixmansc · Jan 20, 2018

One that just crossed my mind that I had not considered.... what if RDMA is coming into play? RDMA does not play nice with nic teaming because the packets go directly to the adapter and bypass the network stack.

Edit to add - this is specific to Server 2012 though. No idea on Windows 10 or Server 2016 if anything is different.

SMB Multichannel and SMB Direct

SMB Multichannel is the feature responsible for detecting the RDMA capabilities of network adapters to enable SMB Direct. Without SMB Multichannel, SMB uses regular TCP/IP with the RDMA-capable network adapters (all network adapters provide a TCP/IP stack along with the new RDMA stack).

With SMB Multichannel, SMB detects whether a network adapter has the RDMA capability, and then creates multiple RDMA connections for that single session (two per interface). This allows SMB to use the high throughput, low latency, and low CPU utilization offered by RDMA-capable network adapters. It also offers fault tolerance if you are using multiple RDMA interfaces.

Note: You should not team RDMA-capable network adapters if you intend to use the RDMA capability of the network adapters. When teamed, the network adapters will not support RDMA. After at least one RDMA network connection is created, the TCP/IP connection used for the original protocol negotiation is no longer used. However, the TCP/IP connection is retained in case the RDMA network connections fail.

TheBloke · Jan 20, 2018

mixmansc said:
Of course for other industries and applications more cores might be far more beneficial. Interested in seeing all of these results and tests though. I was planning on connecting the servers to the switch via dual DAC cables. Ideally I'm also considering a second LB6M for fault tolerance reasons but I also wonder on the same for load sharing as well. Single attach to each switch, dual attached to each, etc (can end up needing a lot of dang ports and cabling not to mention all the configuration that must be perfect).

Yeah it's a challenge that's been bothering the industry for years, ever since it became apparent that it was getting harder and harder to squeeze more Mhz out of the CPUs. Many apps have moved to multi-core, but many haven't, or can't. It's amazing to think that with these 12+ core CPUs, we have upwards of 40Ghz per CPU. But a given single app may be no faster than years ago unless it's been written carefully to take advantage. Of course it has helped the rise and rise of virtualisation - stick 12+ VMs on one box instead of having a dozen servers. But that doesn't help every case.

I was pretty surprised when I realised smbd was single threaded. I guess it's because they scale by launching new processes for each new connection - it launches 4 or 5 when you start it. Like Apache and other httpd's. I am sending multiple connections, but I have a feeling it directs all connections from a given client (ie by IP or Netbios name) to a single instance. So multiple additional processes from a single Windows desktop aren't helping me. I haven't investigated it much further yet - I do plan to ask their mailing list.

TheBloke · Jan 20, 2018

Guys, got another issue.. I just turned off LACP on both server and desktop, and disabled it on the switch. I wanted to go back a simple 2 x separate links.

The problem is that the switch is now blocking two of the ports, the first port in each of the old LACP groups:

Code:

switch10g#show int br eth 1 to 2 eth 23 to 24
Port    Link    State   Dupl Speed Trunk Tag Pvid Pri MAC            Name
1       Up      Blocked Full 10G   None  No  1    0   089e.0193.0832  10Gserve
2       Up      Forward Full 10G   None  No  1    0   089e.0193.0832  10Gserve
23      Up      Blocked Full 10G   None  No  1    0   089e.0193.0832  10Gdeskt
24      Up      Forward Full 10G   None  No  1    0   089e.0193.0832  10Gdeskt

I removed all LACP config from all the individual ports, eg:

Code:

interface ethernet 1
 port-name 10Gserver1
!
interface ethernet 2
 port-name 10Gserver2

No trunks exist either. Looking through the full details via "show int", I can't see anything different between a non-working Blocked port and a working Forwarding port, besides it saying Blocked or Forwarding.

I've just been trawling the docs, and I see that this Blocked is part of STP. But I don't know why disabling LACP has caused this. I tried disabling spanning-tree globally on the switch, which made no difference, nor did re-enabling it (besides kicking me out of all TCP connections while it initially put all ports into 'Learn'!)

I've tried bringing up and down the NICs on both server and client multiple times, and pulling cables. I even power cycled the server completely, but as soon as it powered the switch port changed from Down, back to Blocked. So it can't be any OS config, as it showed Blocked before the server OS had even booted.

My next step is just to power-cycle the switch, but it'd be good to learn the proper way. I can't find anything in the docs that indicates how to clear this out - I guess it's there because some state exists it doesn't like. But I have no idea what that is.

EDIT: OK I power cycled it and the ports are working fine now. Would still be grateful if anyone can figure out what happened and how I could resolve it without a cold boot.

TheBloke · Jan 20, 2018

mixmansc said:
One that just crossed my mind that I had not considered.... what if RDMA is coming into play? RDMA does not play nice with nic teaming because the packets go directly to the adapter and bypass the network stack.

Good thoughts, thanks, but my X520s don't have RDMA, only RSS

Turbocharge your Quanta LB6M - Flash To Brocade TurboIron - Big Update!

Active Member

Active Member

Active Member

Member

Kaini Industries

Member

Kaini Industries

Kaini Industries

Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Active Member