Win 10: Task Manager won't update, NOT frozen; constant 40GbE NIC errors in Event Viewer.

cudak888 · Oct 7, 2022

This one has me stumped, and no amount of Googlefu seems to have brought up anything quite like this. Either that or I need to make this post so Murphy's Law throws me just the right document after I've made the post (isn't that always the way?).

Anyway:

I have two TrueNAS SCALE servers and a workstation running Windows 10. Both servers are DAC connected via copper 40GbE cables to the Windows box via matching Mellanox ConnectX-3 CX354A's across each box, all set with static IP addresses to each other. The Windows box specs are as follows:

HP Z8 G4
2 Xeon Platinum 8180s (retired server equipment, not ES or QS)
128 GB DDR4
1TB NVMe boot drive
One Mellanox ConnectX-3 CX354A 40/56GbE NIC (in Ethernet 40GbE mode - Driver v5.50.14740, FW v2.42.5000)

The 8180's are a recent addition; the box previously equipped with Silver 4110's prior.

Shortly after the installation of the 8180s, I've had issues where - under fairly heavy use with some 3D creation software (utilizing files over the network) - the Windows Task Manager will stop updating. I'd call it a freeze, only I can page through each tab just fine, even though the graphs or program resource sections of the GUI are frozen. Right clicks will effectively freeze the Task Manager though, but one will be able to force-quit it (sometimes with an auto-restart), at which point it may load again - with no updates, once again.

I can also get the Control Panel and a few other admin-oriented elements of Windows to freeze up and act unresponsive during one of these odd tantrums, various websites won't load (but others will), but most every program will load and work flawlessly while all of this is happening.

I checked the processor specs in CPU-Z and verified them with the various Intel diagnostic and validation tools as it seemed like the logical thing to do; everything came back perfect.

I was left scratching my head at this until I looked at the Event Viewer, where I've discovered this behavior is almost always concurrent with a consistent system error occurring every 10 seconds from the Mellanox mlx4_bus - right now, I'm looking at a solid hour and a half of them in a row. The modifiers sometimes change, but it's always an Event ID 48, "Post FW command failed:"

Code:

Native_166_0_0: Post FW command failed. op 0x24, status 0xff, errno -5, token 0xcdd1, in_modifier 0x1, op_modifier 0x3, in_param 22bca000.

These are almost always preceded by a few warnings:

Code:

Mellanox ConnectX-3 Ethernet Adapter #2 device reports a "TX cq stuck" on cqn #0xa2 uncompleted send #0x2. Therefore, the HCA Nic will be reset. (The issue is reported in Function LogTxCqError).

and

Code:

Adapter Mellanox ConnectX-3 Ethernet Adapter #2 detected that OID 0x1010103 is stuck.

Speaking of stuck, so am I. I really haven't found anything to go on, and during this entire ordeal.

I've roughly confirmed the Mellanox cards go down at this point. I can't load up the Network and Sharing Center during this time, but I can check the server transfer speeds though AJA System Test (it's not iperf, but it works well to give me a rough idea if my connection is OK), and instead of 1000mb/s write and reads, I'm seeing 100mb/s transfers which suggests the connection is happening over the 1GbE NIC (all three are also connected on an unmanaged 1GbE switch and their motherboard NICs).

Other than a curious reference at Dell to an earlier ConnectX-3 driver that had stability issues on boxes built to handle more than 128 processor cores, I haven't found anything even coming close to this scenario mentioned online. Plus, the two 8180's haven't pushed me past 112, accounting for Hyper-Threading, and I'm not getting a BSOD.

Mellanox ConnectX-3 / Pro Driver Install may cause Blue Screen crash: DRIVER_IRQ_NOT_LESS_OR_EQUAL | Dell US Virgin Islands

Glad to provide anything I've missed. It's 2 in the morning as I write this so I could have missed quite a bit.

Any insight or ideas greatly appreciated.

-Kurt

cudak888 · Oct 8, 2022

Quick update - the Mellanox driver finally BSOD'ed an hour later:

Code:

On Sat 10/8/2022 2:43:55 AM your computer crashed or a problem was reported

Crash dump file:  C:\Windows\Minidump\100822-42171-01.dmp (Minidump)
Bugcheck code:  0x9F(0x3, 0xFFFFA703F09ECA40, 0xFFFFC08ED942F110, 0xFFFFA70445BE6A70)
Bugcheck name: DRIVER_POWER_STATE_FAILURE
Driver or module in which error occurred:  mlx4_bus.sys (mlx4_bus+0x77663)
File path: C:\Windows\System32\drivers\mlx4_bus.sys
Description:  MLX4 Bus Driver
Product:  OpenFabrics Windows
Company:  Mellanox
Bug check description: This bug check indicates that the driver is in an inconsistent or invalid power state.
Analysis: A device object has been blocking an IRP for too long a time. This is likely caused by a hardware problem, but there is a possibility that this is caused by a misbehaving driver.
This bugcheck indicates that a timeout has occurred. This may be caused by a hardware failure such as a thermal issue or a bug in a driver for a hardware device.
Read this article on thermal issues
A full memory dump will likely provide more useful information on the cause of this particular bugcheck. A third party driver was identified as the probable root cause of this system error.
It is suggested you look for an update for the following driver:
mlx4_bus.sys (MLX4 Bus Driver, Mellanox).

Google query: mlx4_bus Mellanox DRIVER_POWER_STATE_FAILURE



On Sat 10/8/2022 2:43:55 AM your computer crashed or a problem was reported

Crash dump file:  C:\Windows\MEMORY.DMP (Kernel memory dump)
Bugcheck code:  0x9F(0x3, 0xFFFFA703F09ECA40, 0xFFFFC08ED942F110, 0xFFFFA70445BE6A70)
Bugcheck name: DRIVER_POWER_STATE_FAILURE
Driver or module in which error occurred:  mlx4_bus.sys (mlx4_bus+0x77663)
File path: C:\Windows\System32\drivers\mlx4_bus.sys
Description:  MLX4 Bus Driver
Product:  OpenFabrics Windows
Company:  Mellanox
Bug check description: This bug check indicates that the driver is in an inconsistent or invalid power state.
Analysis: A device object has been blocking an IRP for too long a time. This is likely caused by a hardware problem, but there is a possibility that this is caused by a misbehaving driver.
This bugcheck indicates that a timeout has occurred. This may be caused by a hardware failure such as a thermal issue or a bug in a driver for a hardware device.
Read this article on thermal issues
A full memory dump will likely provide more useful information on the cause of this particular bugcheck. A third party driver was identified as the probable root cause of this system error.
It is suggested you look for an update for the following driver:
mlx4_bus.sys (MLX4 Bus Driver, Mellanox).

Google query: mlx4_bus Mellanox DRIVER_POWER_STATE_FAILURE

oneplane · Oct 8, 2022

Probably bad PCIe fabric in your new CPUs to the point where it can't handle some PCIe scenarios (be it load, lanes, commands or otherwise).
If you put the other CPUs back and everything works as normal again you'll be able to verify it.

cudak888 · Oct 8, 2022

Let me run this theory by you then - and add something I had forgotten to mention: Could the Platinum 8180's be utilizing fabrics on the motherboard that could be part of the problem?

The reason I ask is that the board on this box - from Day 1 - will occasionally report a POST error as follows. This has happened as occasionally on the 4110's as it does with the 8180's:

Code:

928-Fatal PCIe error.
PCIe Surprise Link Down error detected on Slot 8. B:20 D:2 F:0
928-Fatal PCIe error.
PCIe Completion Timeout error detected on Slot 8. B:20 D:2 F:0

Press ENTER to continue.

This board doesn't actually have a Slot 8; according to HP, this error is either referring to a lane on the Intel C622 chipset that's not used on the Z8 G4 board, or an error with one of the PCIe controllers on the chipset themselves which may be affecting the performance of the other slots/lanes.

Notably so, there hasn't ever been any issues with the inserted PCIe cards on the board despite this warning.

Again, sorry to throw this left-field curveball into the mix, but given your response, I figured it very well could be part of the puzzle.

-Kurt

oneplane · Oct 8, 2022

It's possible, but that would mean that either the numbering is mismatched (so the firmware thinks Slot 8 is something different than the actual slot numbering on the PCB), or it's not talking about chipset lanes but about CPU lanes (since most PCIe lanes will be directly connected to the CPU and not the PCH).

There are a number of PCIe-via-PCH lanes, but the topology really depends on what HP thought would would best. Generally, on-board devices that they know will not need many lanes will be PCH-connected AFAIK.

If you move the PCIe card to a different slot and observe the same problem, it might be the CPU or the card. Otherwise it could be the slot or the specific lanes wired to that slot. In general I haven't seen PCIe breakage that only affects a few lanes unless the PCB or CPU socket got mangled.

I have seen some PCIe lanes on the PCH do weird things, but that was mostly due to bandwidth constraints.

edit: perhaps it’s simpler and it’s just some link power management, if it causes the disconnects you might have to disable it in firmware

Search

Win 10: Task Manager won't update, NOT frozen; constant 40GbE NIC errors in Event Viewer.

cudak888

New Member

cudak888

New Member

oneplane

Well-Known Member

cudak888

New Member

oneplane

Well-Known Member