Dead/dying onboard NICs?

EffrafaxOfWug · Aug 1, 2018

Noticed last night that one of the ports coming out of my NAS (set up in balance-alb) was only showing up at 10Mb/s on the switch. These are the onboard i210's on an ASRock E3C226D2I, OS is debian stretch, MAC addresses have been changed to protect the innocent.

Device shows up in ifconfig but refuses to come up;

Code:

effrafax@wug:~# ifconfig -a
<snip>
eth0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 00:01:02:03:04:05  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xf7200000-f727ffff

eth1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
        ether 00:01:02:03:04:06  txqueuelen 1000  (Ethernet)
        RX packets 1467379  bytes 1505855146 (1.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1810211  bytes 1890508373 (1.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xf7100000-f717ffff
</snip>

effrafax@wug:~# ifconfig eth0 up
SIOCSIFFLAGS: No such device

Similarly it shows as "disabled" in lshw;

Code:

effrafax@wug:~# lshw -C network
  *-network DISABLED
       description: Ethernet interface
       product: I210 Gigabit Network Connection
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:03:00.0
       logical name: eth0
       version: 03
       serial: 00:01:02:03:04:05
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress cap_list ethernet physical
       configuration: broadcast=yes driver=igb latency=0 multicast=yes
       resources: irq:18 memory:f7200000-f727ffff ioport:d000(size=32) memory:f7280000-f7283fff
  *-network
       description: Ethernet interface
       product: I210 Gigabit Network Connection
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       logical name: eth1
       version: 03
       serial: 00:01:02:03:04:06
       size: 1Gbit/s
       capacity: 1Gbit/s
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.4.0-k duplex=full firmware=3.16, 0x800004d6 latency=0 link=yes multicast=yes port=twisted pair slave=yes speed=1Gbit/s
       resources: irq:19 memory:f7100000-f717ffff ioport:c000(size=32) memory:f7180000-f7183fff
<snip>

No obvious errors in dmesg, device node appears but never seems to initialise;

Code:

effrafax@wug:~# dmesg|egrep -i "eth|net|igb"|grep -v vlan
[    0.119508] NET: Registered protocol family 16
[    0.274181] Method parse/execution failed
[    0.274241] ACPI Exception: AE_NOT_FOUND, while evaluating GPE method [_L24] (20160831/evgpe-646)
[    0.290277] NET: Registered protocol family 2
[    0.290755] NET: Registered protocol family 1
[    0.576882] audit: initializing netlink subsys (disabled)
[    0.625172] NET: Registered protocol family 10
[    0.625784] NET: Registered protocol family 17
[    0.626829] microcode: Microcode Update Driver: v2.01 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[    0.986498] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    0.986500] igb: Copyright (c) 2007-2014 Intel Corporation.
[    1.220391] igb 0000:03:00.0: added PHC on eth0
[    1.220392] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[    1.220393] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:01:02:03:04:05
[    1.220463] igb 0000:03:00.0: eth0: PBA No: 001300-000
[    1.220464] igb 0000:03:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    1.461859] igb 0000:04:00.0: added PHC on eth1
[    1.461860] igb 0000:04:00.0: Intel(R) Gigabit Ethernet Network Connection
[    1.461861] igb 0000:04:00.0: eth1: (PCIe:2.5Gb/s:Width x1) 00:01:02:03:04:06
[    1.461930] igb 0000:04:00.0: eth1: PBA No: 001300-000
[    1.461931] igb 0000:04:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[   13.271969] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[   13.276105] bond0: Adding slave eth0
[   13.277178] bond0: Adding slave eth1
[   13.323402] 8021q: adding VLAN 0 to HW filter on device eth1
[   13.323495] bond0: Enslaving eth1 as an active interface with a down link
[   13.352312] igb 0000:04:00.0: changing MTU from 1500 to 9000
[   13.392457] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[   13.857829] FS-Cache: Netfs 'nfs' registered for caching
[   13.924800] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.937973] NET: Registered protocol family 15
[   13.942361] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   17.894638] igb 0000:04:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   18.014044] bond0: link status up for interface eth1, enabling it in 0 ms
[   18.014219] bond0: link status definitely up for interface eth1, 1000 Mbps full duplex
[   18.014220] bond0: making interface eth1 the new active one
[   18.014755] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[   35.448634] NFSD: starting 90-second grace period (net ffffffffbb6dbe00)

Has anyone else seen behaviour like this before or any clues as to what might be wrong? This is a production system so I've not been able to do any more in-depth investigation with bootable OSes etc. but it's got me rather worried if the other NIC might also go pop; doubly worrisome is that we had a near-identical E3C224D2I have both its NICs go pop last year (albeit after an electrician tripped a breaker). As a mITX system with an HBA there's no room for a PCIe NIC so I might need to look at a pre-emptive replacement.

Blinky 42 · Aug 1, 2018

Sadly I have seen it happen on systems over the years with no real rhyme or reason, mix of vendors and mix of on-board vs expansion cards with the issue. I have also seen individual ports die on the switch side - it may be easier for you to test the switch side and see if that is the problem in your case.
If it wasn't something simple like cables that needed to be replaced, then in my experience the ports have always been unreliable from that point on and I have to work around them.

EffrafaxOfWug · Aug 1, 2018

I've already tried with a different cable and switchport on the same NIC, no difference. Looks like I need to break glass and grab the Contingency Credit Card.

EffrafaxOfWug · Nov 23, 2018

Just in case anyone else runs into this, it eventually turned out to be a software problem.

Once the board was out of commission it went into the testbed; with our regular debian install the problem kept happening, but with bootable or other ISOs the NICs played fine. I eventually traced this problem back to the following udev rule that was present in the initramfs:

Code:

ACTION=="add", SUBSYSTEM=="pci", ATTR{power/control}="auto"

It seems, sometime around a year or two ago, changes to the driver/kernel meant that if ASPM was enabled for the i210's, the hardware would initialise but then stop responding, so when the system booted the NICs appeared fine (hence why they showed up in dmesg and ifconfig seemingly fine) but as soon as PCIe ASPM kicked in, they'd become unresponsive. Disabling PCIe ASPM and reloading the igb module turned out to fix the problem temporarily and it's fixed entirely once the udev rule was expunged from the initrd.

Search

Dead/dying onboard NICs?

EffrafaxOfWug

Radioactive Member

Blinky 42

Active Member

EffrafaxOfWug

Radioactive Member

EffrafaxOfWug

Radioactive Member