Fix Intel I219-V Detected Hardware Unit Hang

René! · Jun 12, 2022

When you're using VLAN's on your Intel I219-V, then you might run into the following issue:

Code:

May 22 20:31:21 x759100 kernel: [7294020.176745] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
May 22 20:31:21 x759100 kernel: [7294020.176745]   TDH                  <e8>
May 22 20:31:21 x759100 kernel: [7294020.176745]   TDT                  <f>
May 22 20:31:21 x759100 kernel: [7294020.176745]   next_to_use          <f>
May 22 20:31:21 x759100 kernel: [7294020.176745]   next_to_clean        <e7>
May 22 20:31:21 x759100 kernel: [7294020.176745] buffer_info[next_to_clean]:
May 22 20:31:21 x759100 kernel: [7294020.176745]   time_stamp           <16caf8cc5>
May 22 20:31:21 x759100 kernel: [7294020.176745]   next_to_watch        <e8>
May 22 20:31:21 x759100 kernel: [7294020.176745]   jiffies              <16caf91c8>
May 22 20:31:21 x759100 kernel: [7294020.176745]   next_to_watch.status <0>
May 22 20:31:21 x759100 kernel: [7294020.176745] MAC Status             <40080083>
May 22 20:31:21 x759100 kernel: [7294020.176745] PHY Status             <796d>
May 22 20:31:21 x759100 kernel: [7294020.176745] PHY 1000BASE-T Status  <3800>
May 22 20:31:21 x759100 kernel: [7294020.176745] PHY Extended Status    <3000>
May 22 20:31:21 x759100 kernel: [7294020.176745] PCI Status             <10>
May 22 20:31:21 x759100 kernel: [7294020.400450] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
May 22 20:31:22 x759100 kernel: [7294020.491058] vmbr0: port 1(eno1) entered disabled state

This issue resets the networking and might also cause kernel panics. I didn't dig too deep into the issue, but it seems that by using VLAN's and lots of bridges the hardware device exhausts it's memory and crashes.

To fix this issue I've applied the following change to `/etc/network/interfaces`:

Code:

    post-up /sbin/ethtool -K $IFACE tso off gso off

This will disable the Generic Receive Offload and TCP Segmentation Offload, and there probably will be some sort of performance hit on the CPU. However, after this change I'm not able to trigger this bug any more. The hardware is a Lenovo m720q and I'm running Debian 11 (bullseye).

gb00s · Jun 12, 2022

If I'm not mistaken this is a I219-V issue with TSO & GSO per se and known at least since 2019. I thought this was fixed by a kernel update around 5.4 LTS.

René! · Jun 13, 2022

Well I'm running kernel version 5.10 and it doesn't seem to be solved, unless I disable TSO and GRO.

Stephan · Jun 13, 2022

IMHO its not solved even if you disable TSO and GSO in kernels > 5.4. Was still getting hung interface on anything newer, had to downgrade. So I recommend to watch the kernel log like a hawk for a bit. Fortunately 5.4 is supported by upstream until Dec 2025, at which point I will either throw out all machines which have this chip, or get a separate NIC PCIe card. Last solid NICs Intel brought out were i210, i211, X520 and X540 and those are all 10 years old now. I wasted 1-2 full days debugging this, won't buy a machine with i219 ever again.

Udev-rule for /etc/udev/rules.d/99-intel.conf:

SUBSYSTEM=="net", ACTION=="add", ATTRS{vendor}=="0x8086", ATTRS{device}=="0x15b7", RUN+="/usr/bin/ethtool -K $name tso off gso off"

diskdiddler · Jul 25, 2024

Stephan said:
IMHO its not solved even if you disable TSO and GSO in kernels > 5.4. Was still getting hung interface on anything newer, had to downgrade. So I recommend to watch the kernel log like a hawk for a bit. Fortunately 5.4 is supported by upstream until Dec 2025, at which point I will either throw out all machines which have this chip, or get a separate NIC PCIe card. Last solid NICs Intel brought out were i210, i211, X520 and X540 and those are all 10 years old now. I wasted 1-2 full days debugging this, won't buy a machine with i219 ever again.

Udev-rule for /etc/udev/rules.d/99-intel.conf:

SUBSYSTEM=="net", ACTION=="add", ATTRS{vendor}=="0x8086", ATTRS{device}=="0x15b7", RUN+="/usr/bin/ethtool -K $name tso off gso off"

Sorry can I just clarify for a moment here and ensure I'm reading this correctly.
Are you saying that kernels AFTER 5.4 are actually worse for this bug, than ones before 5.4?

I am currently knee deep in a file corruption issue which I believe to have been caused by this.
I'm currently using proxmox -
Linux 6.8.8-2-pve (2024-06-24T09:00Z)

I'm performing testing now that I've finally learnt the correct command as mentioned here:
How To Fix Proxmox Detected Hardware Unit Hang On Intel NICs - First2Host

Will update if that has not fixed my file corruption problems.

Stephan · Jul 26, 2024

@diskdiddler Been running 6.6.x kernels for a while and haven't seen e1000e driver dropping the link in a while. I still use tso off gso off in udev though, because I mistrust any acceleration on this product. Back when I wrote this the 5.4 LTS kernel was the only reliable kernel I could find.

The latest bug I have seen on 12xxx/13xxx/14xxx CPU systems with this NIC was the ME producing IPv6 multicast i.e. broadcast storms in the megabit/s range when Windows is in sleep or shut down. If you google "HBH ICMP6 multicast listener storm" you can see this affect all sorts of vendors. Only workaround is to disable AMT in BIOS. Or wait a little for the Intel CPU to die, due to (my last info) corroded vias on the chip.

Hire engineers, fire suits and DEI. Unpopular, until damage for company becomes life threateningly large.

diskdiddler · Jul 26, 2024

I can confirm turning TSO and GSO off entirely has fixed my problem, it's been a heck of a thing, heck of a thing. Really caused me some problems this last week.

I couldn't disable AMT, for a homelab guy it's my life, it's incredibly useful!

Thanks for your reply, wish Intel or someone could nail this long term OR the driver ship with these flags disabled unless manually enabled, not the other way around.

TopQuark · Sep 22, 2024

I recently updated my VMWare install and all VM's no longer work. It was an install all the way back from EXSi 6.7 and upgraded along the way to 8.0U3. Took me days to figure it out until I saw this thread.

It was an Ethernet Connection (7) I219-LM to be exact.

Stephan · Sep 22, 2024

Maybe after takeover of Intel by Qualcomm the problem will be 'solved' for good. Or through following the example of Broadcom after the VMware takeover.

Search

Fix Intel I219-V Detected Hardware Unit Hang

René!

New Member

gb00s

Well-Known Member

René!

New Member

Stephan

Well-Known Member

diskdiddler

Member

Stephan

Well-Known Member

diskdiddler

Member

TopQuark

New Member

Stephan

Well-Known Member