Looking for advice on a persistent NVMe failure issue I've been chasing for over a month. Three different drives have exhibited the same failure
pattern, so I'm confident it's the board/slot, not the drives.
Hardware:
- Board: Generic Intel N100 NAS board (MW-N100-NAS branding, AMI BIOS 2.22.1287, DMI data all "Default string")
- CPU: Intel N100 (Alder Lake-N)
- RAM: 32GB DDR5
- Case: JONSBO N2
- NVMe: Crucial CT1000P5SSD8 1TB Gen3 (firmware P4CR324) — previously 2x Samsung 990 EVO Plus 2TB which both failed with the same symptoms
- HDDs: 5x 4TB in ZFS RAIDZ2
- OS: Proxmox VE 9.1.5, kernel 6.17.9-1-pve
PCIe topology (all slots are PCH-routed):
| Root Port | Width | Speed | Device |
| 00:1c.0 | x1 | 8GT/s | NVMe (Crucial P5) ← the problem slot |
| 00:1c.3 | x1 | 5GT/s | Intel I226-V 2.5GbE |
| 00:1c.6 | x1 | 5GT/s | Intel I226-V 2.5GbE |
| 00:1d.0 | x2 | 8GT/s | Aquantia AQC113C 10GbE |
00:1d.2 | x1 | 8GT/s | JMicron SATA controller |
The NVMe is a x4-capable drive running in a x1 slot. There is no x4 M.2 slot on this board.
The problem:
The NVMe controller dies every ~12 hours with CSTS=0xffffffff, PCI_STATUS=0xffff followed by Unable to change power state from D3cold to D0. ZFS
pool suspends, requires reboot to recover. This has happened with all three drives (2x Samsung in a mirror pool and also attempted as a single drive, also now as 1x Crucial).
What I've tried as guided by Claude since I am new and learning this stuff

:
- Kernel params: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off processor.max_cstate=1 intel_idle.max_cstate=1
- BIOS: ASPM disabled on all root ports, L1 substates disabled, clock/power gating disabled, CPU C6DRAM disabled, SATA aggressive LPM disabled, DMI link ASPM disabled
- Systemd service at boot: d3cold_allowed=0 on the NVMe device, all PCIe root ports, and critical PCH devices
- Udev rule disabling d3cold on all PCIe bridges by device class power/control=on forced on NVMe and its root port
- Removed broken Thunderbolt controller (00:0d.0, stuck in D3cold error at every boot) from PCIe bus at boot
- NVMe keepalive timer (4KB read every 2 minutes)
I wrote a script logging PCIe link status every 30 seconds. It captured the actual failure:
07:45:19 link=3013 (Gen3/8GT x1) pool=ONLINE d3cold=0 runtime=active
07:45:49 link=1011 (Gen1/2.5GT x1) pool=ONLINE d3cold=0 runtime=active
07:46:19 link=1011 (Gen1/2.5GT x1) pool=SUSPENDED d3cold=0 runtime=active
The link spontaneously retrained from Gen3 to Gen1, and 30 seconds later the controller died. All power management fixes were holding perfectly —
d3cold=0, runtime=active, power/control=on. The D3cold error in dmesg seems to be a consequence of the link failure, not the cause.
My current theory:
The x1 PCIe lane on this board seems to have marginal signal integrity at Gen3 (8GT/s). Something causes the link to retrain to the lowest speed, and the NVMe controller can't recover from this. I'm about to try forcing the link to Gen2 (5GT/s) via setpci to give more signal margin while still
providing ~500 MB/s.
Questions:
- Has anyone seen similar link retraining issues on these Aliexpress Intel N100 NAS boards?
- Is forcing Gen2 via setpci on the root port a reliable long-term fix, or just a bandaid?
- Any other ideas for what could cause periodic link retraining on a x1 lane? The ~12h interval doesn't correlate with temperature — we've monitored thermals and they're stable before failures.
- Is a new motherboard just going to be the best next solution?
Thanks for any insight.