This one has me stumped, and no amount of Googlefu seems to have brought up anything quite like this. Either that or I need to make this post so Murphy's Law throws me just the right document after I've made the post (isn't that always the way?).
Anyway:
I have two TrueNAS SCALE servers and a workstation running Windows 10. Both servers are DAC connected via copper 40GbE cables to the Windows box via matching Mellanox ConnectX-3 CX354A's across each box, all set with static IP addresses to each other. The Windows box specs are as follows:
Shortly after the installation of the 8180s, I've had issues where - under fairly heavy use with some 3D creation software (utilizing files over the network) - the Windows Task Manager will stop updating. I'd call it a freeze, only I can page through each tab just fine, even though the graphs or program resource sections of the GUI are frozen. Right clicks will effectively freeze the Task Manager though, but one will be able to force-quit it (sometimes with an auto-restart), at which point it may load again - with no updates, once again.
I can also get the Control Panel and a few other admin-oriented elements of Windows to freeze up and act unresponsive during one of these odd tantrums, various websites won't load (but others will), but most every program will load and work flawlessly while all of this is happening.
I checked the processor specs in CPU-Z and verified them with the various Intel diagnostic and validation tools as it seemed like the logical thing to do; everything came back perfect.
I was left scratching my head at this until I looked at the Event Viewer, where I've discovered this behavior is almost always concurrent with a consistent system error occurring every 10 seconds from the Mellanox mlx4_bus - right now, I'm looking at a solid hour and a half of them in a row. The modifiers sometimes change, but it's always an Event ID 48, "Post FW command failed:"
These are almost always preceded by a few warnings:
and
Speaking of stuck, so am I. I really haven't found anything to go on, and during this entire ordeal.
I've roughly confirmed the Mellanox cards go down at this point. I can't load up the Network and Sharing Center during this time, but I can check the server transfer speeds though AJA System Test (it's not iperf, but it works well to give me a rough idea if my connection is OK), and instead of 1000mb/s write and reads, I'm seeing 100mb/s transfers which suggests the connection is happening over the 1GbE NIC (all three are also connected on an unmanaged 1GbE switch and their motherboard NICs).
Other than a curious reference at Dell to an earlier ConnectX-3 driver that had stability issues on boxes built to handle more than 128 processor cores, I haven't found anything even coming close to this scenario mentioned online. Plus, the two 8180's haven't pushed me past 112, accounting for Hyper-Threading, and I'm not getting a BSOD.
Glad to provide anything I've missed. It's 2 in the morning as I write this so I could have missed quite a bit.
Any insight or ideas greatly appreciated.
-Kurt
Anyway:
I have two TrueNAS SCALE servers and a workstation running Windows 10. Both servers are DAC connected via copper 40GbE cables to the Windows box via matching Mellanox ConnectX-3 CX354A's across each box, all set with static IP addresses to each other. The Windows box specs are as follows:
- HP Z8 G4
- 2 Xeon Platinum 8180s (retired server equipment, not ES or QS)
- 128 GB DDR4
- 1TB NVMe boot drive
- One Mellanox ConnectX-3 CX354A 40/56GbE NIC (in Ethernet 40GbE mode - Driver v5.50.14740, FW v2.42.5000)
Shortly after the installation of the 8180s, I've had issues where - under fairly heavy use with some 3D creation software (utilizing files over the network) - the Windows Task Manager will stop updating. I'd call it a freeze, only I can page through each tab just fine, even though the graphs or program resource sections of the GUI are frozen. Right clicks will effectively freeze the Task Manager though, but one will be able to force-quit it (sometimes with an auto-restart), at which point it may load again - with no updates, once again.
I can also get the Control Panel and a few other admin-oriented elements of Windows to freeze up and act unresponsive during one of these odd tantrums, various websites won't load (but others will), but most every program will load and work flawlessly while all of this is happening.
I checked the processor specs in CPU-Z and verified them with the various Intel diagnostic and validation tools as it seemed like the logical thing to do; everything came back perfect.
I was left scratching my head at this until I looked at the Event Viewer, where I've discovered this behavior is almost always concurrent with a consistent system error occurring every 10 seconds from the Mellanox mlx4_bus - right now, I'm looking at a solid hour and a half of them in a row. The modifiers sometimes change, but it's always an Event ID 48, "Post FW command failed:"
Code:
Native_166_0_0: Post FW command failed. op 0x24, status 0xff, errno -5, token 0xcdd1, in_modifier 0x1, op_modifier 0x3, in_param 22bca000.
Code:
Mellanox ConnectX-3 Ethernet Adapter #2 device reports a "TX cq stuck" on cqn #0xa2 uncompleted send #0x2. Therefore, the HCA Nic will be reset. (The issue is reported in Function LogTxCqError).
Code:
Adapter Mellanox ConnectX-3 Ethernet Adapter #2 detected that OID 0x1010103 is stuck.
I've roughly confirmed the Mellanox cards go down at this point. I can't load up the Network and Sharing Center during this time, but I can check the server transfer speeds though AJA System Test (it's not iperf, but it works well to give me a rough idea if my connection is OK), and instead of 1000mb/s write and reads, I'm seeing 100mb/s transfers which suggests the connection is happening over the 1GbE NIC (all three are also connected on an unmanaged 1GbE switch and their motherboard NICs).
Other than a curious reference at Dell to an earlier ConnectX-3 driver that had stability issues on boxes built to handle more than 128 processor cores, I haven't found anything even coming close to this scenario mentioned online. Plus, the two 8180's haven't pushed me past 112, accounting for Hyper-Threading, and I'm not getting a BSOD.
Glad to provide anything I've missed. It's 2 in the morning as I write this so I could have missed quite a bit.
Any insight or ideas greatly appreciated.
-Kurt