X9DAI on Debian Buster freezes up overnight

andybaran

New Member
Jul 7, 2020
13
0
1
I have an older X9DAI running Debian Buster for a pi-hole, hass-io, a few docker containers and some NFS shares backed by OpenZFS. It runs fine very well except if it;'s left not doing anything for a while it just freezes up. When it freezes up It will not respond to network traffic (Melllanox card is the only one plugged in) nor whacking keys on the USB KB that is physically plugged into it. I am running on the latest BIOS for the board itself and latest firmware for the PCIe adapters mentioned below. I have turned all power saving in the BIOS and in the OS. Has anyone seen anything like this? Otherwise, although old, it's a great board and I'd hate to replace it :(


  1. 2 x e5 2620 v2's
  2. 40 GB RAM (5 x 8GB LDDR DIMM)
  3. Mellanox X3 with 10Gb DAC
  4. LSI 9211-8i in IT mode (3 x 3TB SAS drives for NFS shares)
  5. StarTech M2 PCIE addapter (ZIL and L2ARC for ZFS)
  6. ATI Radeon 5500 for video
  7. SSD for boot
  8. A bunch other random drives
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,243
421
83
When you say "freezes up" do you mean it crashes...? Is there anything shown on the console (kernel panic or oops for example) and are there any memory errors reported in neither edac-utils or the IPMI log? Has it only just started doing this or has it been this way since forever?

Edit: doesn't look like the X9DAI has an IPMI chip or alternate out-of-band management, is that correct?
 

andybaran

New Member
Jul 7, 2020
13
0
1
Correct, no OOB. edac-utils isn't something I'm familiar with so I'll RTFM on that tonight and look into it before I inevitably need to hit the reset button tomorrow AM :) It's weird, it really seems like it might just be going into some sort of low power state and not coming out of it but other than entries just no longer being written I'm unable to find anything in my syslogs.
 

j_h_o

Active Member
Apr 21, 2015
470
111
43
California, US
Assuming you can take it out of production for a while, can you run an extended RAM/memory test off an ISO/CD? Does it lock up while that test is running overnight?
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,243
421
83
If you're using debian, it should have memtest86+ as an installable package that'll show up in your boot menu.

Without IPMI you're not going to be able to tell if it's a memory error that resulted in the machine crashing, although edac-util might tell you if there were corrected single-bit memory errors (if there are, it's a fair guess that memory problems are involved). However from the symptoms you describe it might just be a piece of hardware or subsystem crashing for whatever reason so I'd be interested to know if anything makes it to the console - normally the kernel survives long enough to at least print something even if it doesn't make it to the discs.

Is the ATI video chipset an add-on card or one of those embedded jobs?
 

MBastian

Member
Jul 17, 2016
70
12
8
Düsseldorf, Germany
You can up the console output level of systemd and kernel messages. In case you're running X make sure you switch to a text console and dont forget to disable the screenblanker.

Also you can try some Magic .. Sysrq

Edit: I had similar freezes with an x9dri-ln4f and Sandisk Extreme pro NVMEs (aka WD sn750) due to their shitty ASPM implementation. Check if ASPM support is turned on in the BIOS. Also, pull back on the BIOS power saving options. Most old boards don't play well beyond S3
 
Last edited:

andybaran

New Member
Jul 7, 2020
13
0
1
So I gave memtest a run overnight and it got through a full pass without any errors. All cpu power saving is flat out disabled in the bios. I also disabled RAPL b/c why not?

I'll check on ASPM later this evening. I have no idea what it is so will need to research a bit as well .

That's s a really good call on X, upping output and switching to a console. I'm going to uninstall X all together this evening since it's not needed.

Also, ATI is an addin card.
 

MBastian

Member
Jul 17, 2016
70
12
8
Düsseldorf, Germany
I'll check on ASPM later this evening. I have no idea what it is so will need to research a bit as well .
Check if "ASPM support" under "PCIe/PCI/PnP Configuration" is set to disabled.
Normaly you don't have to configure anything if you set it to enabled but there quite some devices out there that can freeze the system if they enter low power states. I think I remember seeing issues regarding certain Mellanox cards in the Supermicro support base for my motherboard, but I am not sure.
If you are paranoid like me double check with "dmesg|grep ASPM" and/or extend the GRUB_CMDLINE with pcie_aspm=off just to be sure.

Edit: I've got it working for my misbehaving WD/Sandisk NVMEs by limiting the lower power states. It does not have any measurable impact on the power consumption on my system but idle NVME temperatures did go down a bit. Strangely the OEM variant SN730 was not affected.
 
Last edited:

andybaran

New Member
Jul 7, 2020
13
0
1
Did a quick dmesg | grep ASPM while at work and got some output that certainly seems like we might be on the right track. I'll dig into the BIOS settitngs tonight and disable ASPM and let you know the results.


[ 1.054093] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[ 1.069846] pci 0000:00:1c.0: ASPM: current common clock configuration is broken, reconfiguring
[ 1.079851] acpi PNP0A03:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[ 1.082397] acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[ 1.085961] acpi PNP0A03:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
 

andybaran

New Member
Jul 7, 2020
13
0
1
Wahoo! It stayed up overnight. I did not end up uninstalling X but will likely do so at a later date. I'm going to give it a few more days and report back in case others find this useful in the future. Thank you all for being so helpful. I really haven't had to dig into hardware this deeply for something like 20 years.
 

MBastian

Member
Jul 17, 2016
70
12
8
Düsseldorf, Germany
You can just disable sddm or what ever your display manager is. No need to uninstall X, with all repercussions that might have. Some packages have rather extensive dependencies and could be deinstalled too.
Unless something is really borked in your BIOS ASPM should've been disabled. Keep watching, if it happens again maybe you have some meaningful console output to work with.
 

andybaran

New Member
Jul 7, 2020
13
0
1
Awesome thanks, yeah it was disabled but setting it to Auto seems to have fixed it. I'll keep an eye out though. Especially if its the Mellanox card going trying to go into low power when theres no network traffic in the middle of the night it would at least make some sense.
 

andybaran

New Member
Jul 7, 2020
13
0
1
A long overdue follow-up on this. Its still working flawlessly....this is danger thing to say on a Friday.