Hmm, interesting.
Assuming you've confirmed that all your backplanes always work when connected directly to your HBA, I'm wondering if your issue 1 is not only related to mine, but is in fact the exact same issue as my "staggered power-on" - just with far less severe and intermittent symptoms.
Way back on page 1 I first described the issue I was having. Specifically, when I powered up the expander at the same time as the drives/backplanes, I would only see two or three (forget which) drives out of the 15 I had connected at the time. But it was always the same drives. Or rather, it was always the same lanes. So just like you, if I swapped drives around, which drives were detected would change, according to which expander port they were connected to.
Therefore I wonder if we're seeing the same thing, just for me it affected 13 out of 16 lanes, and for you it only sometimes affects 1 out of 16 lanes.
If I'm right, then you will *never* see the issue if you do the same staggered power on (or staggered cabling) that I do. The way I found out about staggered power-on was testing in the LSI BIOS: I was doing the sort of cable swapping I describe above, thinking that I must have bad ports on the expander or something. I was in the process of unplugging cables from the expander, and then happened to start up the server before I plugged them back in. I then plugged them back in with the system on, sitting in the LSI BIOS - therefore after the drives had powered-up - and lo' and behold, all drives were detected.
So I wonder if you can do the same tests. First, power everything on at the same time with all cables connected, and drop into the LSI BIOS rather than boot the OS. Check your available drives. If you can consistently do this and get the full 16 drives, then that's already different to me. But try this a few times (powering down between each test.) See if you can get 15 drives a few times. We know your loss of 1 lane is intermittent, and perhaps it's often enough to repeatedly confirm. Below, I theorise it could be timing related.
If you can prove that you can sometimes lose 1 lane in the above test, then follow by duplicating what I did: boot everything (server, BPs and disks) with no drive/BP cables connected to the expander. So the LSI HBA is cabled to the expander, but the expander is not cabled to the BPs. Drop into the LSI BIOS, then connect up the backplane cables while watching the BIOS. See if this method always gives you 16 drives.
If your issue is intermittent you might have to repeat these tests a number of times. But if I'm right, you should notice that the missing lane only happens in the first test, and never happens when the drives are allowed to power up without being connected to the expander. In which case, it should be solved by a later FW.
Maybe your issue is different, but it certainly sounds like it could be the same or similar, despite the seeming difference in symptoms. The problem could well be related to the hardware attached - like maybe your BPs or disks initialise faster than mine. Or even that your BP/drives initialise
slower than mine. Either way, a timing issue could explain why the issue always happens severely to me, but only happens intermittently, and much less severely, for you.
In my tests, the problem goes away both with a staggered-power-on, and with staggered cabling. This indicates that the issue only occurs when the expander can see the drives while the expander is first initialising. To me this implies that the expander isn't ready to connect to the drives, but then for some reason also stops polling for them; as if they 'half detect', and therefore it stops looking. Maybe it's erroneously checking the lanes before it's ready to connect to them, and because it's
not ready, those lanes go into a hung/limbo state until either it restarts or the cables are re-plugged.
If it is anything like that, then the symptoms could definitely be dependent on the attached equipment. For example, if my drives/BPs initialise faster than yours - ie they're ready to start communicating with the expander sooner after power-on than yours are - that could put them into the window during which the expander is not ready. Whereas in your case, perhaps the drives and/or BPs taking longer to get ready would be similar in effect to a staggered connection: by the time they are ready, the expander is also ready. This might also explain why you sometimes get all drives, and sometimes miss one: the spin up time of drives can vary a little, so perhaps a small fluctuation in the start up time of the drives decides whether you get all, or miss one.
The fact that I do get 3 drives working also implies that the lanes are checked sequentially. If so, that means that the 3 lanes that work for me are the last three it checks, and by the time it gets to them, it's completed its initialisation and is ready to actually connect. The other 13 are checked too soon and always hang. I recall that the three drives that worked were always on a single BP port, which fits that theory.
As you can tell I'm completely guessing on all this. But a timing-based problem does sound plausible to me given the resolution I've found, and could describe both our sets of symptoms.
Regarding your third point - no, I've never verified it outside Solaris. Nor do I think I can, as to do so needs enough drives to perf test 2 x 6Gb connections versus 1 x 6Gb connections, and I just don't have enough spare drives. In fact I barely have any (working) spare drives left.
I suppose it should be possible to USB boot my server into Linux and run a raw read-only test - eg using dd to read raw data from every drive - that could verify the speed difference using the same drives I use in Solaris. But that would also mean re-flashing an expander to 634A, knowing that even if it did work, I couldn't benefit
So I never bothered to test that. If it becomes a big issue for someone I could try it out. But it's easy enough to test yourself.
And anyway, I still find it hard to imagine there being an OS-driver-specific issue that could be impacted by the FW of an expander. And not even in a "only one port works" kind of way, but in a way where both ports connect, and bandwidth is
slightly higher with two, just not high enough. Just seems like it must be purely related to the expander FW alone, or at most expander FW + HBA FW. But not OS driver as well.