Had my ms-01 for about a month. Running ProxMox on it, mainly opnsense. Two Samsung SSD 990 PRO 2TB, zfs mirrored. First thing I had to do was turn off Turbo mode in the BIOS. Otherwise, the CPU temps were climbing too high: up above 70C. (at some point I will try to replace thermal paste).
I periodically run `sensors` to keep track of the running temps, and this has been a good solution.
Yesterday, I was wondering why I never got any email from my PoxMox, found where it was being delivered and classified as spam.
One of particular interest was the sole indication that one of the mirror devices in the zfs root pool went bad, about 18 days ago.
I had rebooted the ms01 many times, and never noticed. One nvme device had disappeared entirely.
I migrated all my VMs off, powered down, removed one SSD, booted, and came up on the system from May 6.
`smartctl` indicated zero issues, zero errors. (need to remember the SSD-specific diag utils).
Put back the other SSD, rebooted, and both /dev/nvme were there, and `zpool` had already finished re-silvering 800GB by the time I looked.
While I was doing this, I had the case off and a portable fan directed at the SSD fan. Temperatures were OK.
Here is an excerpt of the log during reboot(?) when I had two /dev/nvme:
Code:
May 06 21:28:07 pve sensors[1228]: nvme-pci-5800
May 06 21:28:07 pve sensors[1228]: Adapter: PCI adapter
May 06 21:28:07 pve sensors[1228]: Composite: +47.9°C (low = -273.1°C, high = +81.8°C)
May 06 21:28:07 pve sensors[1228]: (crit = +84.8°C)
May 06 21:28:07 pve sensors[1228]: Sensor 1: +47.9°C (low = -273.1°C, high = +65261.8°C)
May 06 21:28:07 pve sensors[1228]: Sensor 2: +62.9°C (low = -273.1°C, high = +65261.8°C)
May 06 21:28:07 pve sensors[1228]: acpitz-acpi-0
May 06 21:28:07 pve sensors[1228]: Adapter: ACPI interface
May 06 21:28:07 pve sensors[1228]: temp1: +27.8°C
May 06 21:28:07 pve sensors[1228]: coretemp-isa-0000
May 06 21:28:07 pve sensors[1228]: Adapter: ISA adapter
May 06 21:28:07 pve sensors[1228]: Package id 0: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 0: +43.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 4: +44.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 8: +40.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 12: +43.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 16: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 20: +43.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 24: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 25: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 26: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 27: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 28: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 29: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 30: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: Core 31: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 06 21:28:07 pve sensors[1228]: nvme-pci-5900
May 06 21:28:07 pve sensors[1228]: Adapter: PCI adapter
May 06 21:28:07 pve sensors[1228]: Composite: +49.9°C (low = -273.1°C, high = +81.8°C)
May 06 21:28:07 pve sensors[1228]: (crit = +84.8°C)
May 06 21:28:07 pve sensors[1228]: Sensor 1: +49.9°C (low = -273.1°C, high = +65261.8°C)
May 06 21:28:07 pve sensors[1228]: Sensor 2: +59.9°C (low = -273.1°C, high = +65261.8°C)
Here is the error:
Code:
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 464 (b1d0) opcode 0x2 (I/O Cmd) QID 4 timeout, aborting req_op:READ(0) size:12288
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 111 (806f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 265 (3109) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 464 (c1d0) opcode 0x2 (I/O Cmd) QID 4 timeout, aborting req_op:READ(0) size:12288
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 111 (906f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 265 (4109) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 464 (d1d0) opcode 0x2 (I/O Cmd) QID 4 timeout, aborting req_op:READ(0) size:12288
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 111 (a06f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve systemd[1]: apt-daily.service: Deactivated successfully.
May 09 19:27:08 pve systemd[1]: Finished apt-daily.service - Daily apt download activities.
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 265 (5109) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 464 (e1d0) opcode 0x2 (I/O Cmd) QID 4 timeout, aborting req_op:READ(0) size:12288
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 111 (b06f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 265 (6109) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 464 (f1d0) opcode 0x2 (I/O Cmd) QID 4 timeout, aborting req_op:READ(0) size:12288
May 09 19:27:08 pve kernel: nvme nvme0: I/O tag 111 (c06f) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:4096
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: nvme nvme0: Abort status: 0x0
May 09 19:27:08 pve kernel: INFO: task z_wr_iss_h:475 blocked for more than 122 seconds.
May 09 19:27:08 pve kernel: Tainted: P O 6.8.4-2-pve #1
May 09 19:27:08 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 09 19:27:08 pve kernel: task:z_wr_iss_h state:D stack:0 pid:475 tgid:475 ppid:2 flags:0x00004000
On May 08 I still had two `/dev/nvme`:
Code:
May 08 02:56:49 pve kernel: nvme1n1: p1 p2 p3
May 08 02:56:49 pve kernel: nvme0n1: p1 p2 p3
...
May 08 02:56:51 pve systemd[1]: Started dbus.service - D-Bus System Message Bus.
May 08 02:56:51 pve sensors[1239]: nvme-pci-5800
May 08 02:56:51 pve sensors[1239]: Adapter: PCI adapter
May 08 02:56:51 pve sensors[1239]: Composite: +49.9°C (low = -273.1°C, high = +81.8°C)
May 08 02:56:51 pve sensors[1239]: (crit = +84.8°C)
May 08 02:56:51 pve sensors[1239]: Sensor 1: +49.9°C (low = -273.1°C, high = +65261.8°C)
May 08 02:56:51 pve sensors[1239]: Sensor 2: +63.9°C (low = -273.1°C, high = +65261.8°C)
May 08 02:56:51 pve sensors[1239]: acpitz-acpi-0
May 08 02:56:51 pve sensors[1239]: Adapter: ACPI interface
May 08 02:56:51 pve sensors[1239]: temp1: +27.8°C
May 08 02:56:51 pve sensors[1239]: coretemp-isa-0000
May 08 02:56:51 pve sensors[1239]: Adapter: ISA adapter
May 08 02:56:51 pve sensors[1239]: Package id 0: +50.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 0: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 4: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 8: +43.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 12: +44.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 16: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 20: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 24: +50.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 25: +49.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 26: +49.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 27: +49.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 28: +47.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 29: +47.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 30: +47.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: Core 31: +47.0°C (high = +100.0°C, crit = +100.0°C)
May 08 02:56:51 pve sensors[1239]: nvme-pci-5900
May 08 02:56:51 pve sensors[1239]: Adapter: PCI adapter
May 08 02:56:51 pve sensors[1239]: Composite: +51.9°C (low = -273.1°C, high = +81.8°C)
May 08 02:56:51 pve sensors[1239]: (crit = +84.8°C)
May 08 02:56:51 pve sensors[1239]: Sensor 1: +51.9°C (low = -273.1°C, high = +65261.8°C)
May 08 02:56:51 pve sensors[1239]: Sensor 2: +61.9°C (low = -273.1°C, high = +65261.8°C)
May 08 02:56:51 pve smartd[1222]: smartd 7.3 2022-02-28 r5338 [x86_64-linux-6.8.4-2-pve] (local build)
but on May 14, there was only one:
Code:
May 14 09:36:51 pve kernel: nvme 0000:59:00.0: platform quirk: setting simple suspend
May 14 09:36:51 pve kernel: nvme 0000:58:00.0: platform quirk: setting simple suspend
May 14 09:36:51 pve kernel: i40e 0000:02:00.0: fw 9.120.73026 api 1.15 nvm 9.20 0x8000d8c5 0.0.0 [8086:1572] [8086:0000]
May 14 09:36:51 pve kernel: nvme nvme0: pci function 0000:59:00.0
May 14 09:36:51 pve kernel: nvme 0000:59:00.0: enabling device (0000 -> 0002)
May 14 09:36:51 pve kernel: nvme nvme1: pci function 0000:58:00.0
May 14 09:36:51 pve kernel: usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.08
May 14 09:36:51 pve kernel: usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
May 14 09:36:51 pve kernel: usb usb2: Product: xHCI Host Controller
May 14 09:36:51 pve kernel: usb usb2: Manufacturer: Linux 6.8.4-2-pve xhci-hcd
May 14 09:36:51 pve kernel: usb usb2: SerialNumber: 0000:00:14.0
May 14 09:36:51 pve kernel: hub 2-0:1.0: USB hub found
May 14 09:36:51 pve kernel: hub 2-0:1.0: 4 ports detected
May 14 09:36:51 pve kernel: nvme nvme1: Shutdown timeout set to 10 seconds
May 14 09:36:51 pve kernel: nvme nvme1: 16/0/0 default/read/poll queues
May 14 09:36:51 pve kernel: nvme1n1: p1 p2 p3
...
May 14 09:36:53 pve smartd[1258]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
May 14 09:36:53 pve systemd[1]: Started dbus.service - D-Bus System Message Bus.
May 14 09:36:53 pve sensors[1273]: coretemp-isa-0000
May 14 09:36:53 pve sensors[1273]: Adapter: ISA adapter
May 14 09:36:53 pve sensors[1273]: Package id 0: +49.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 0: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 4: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 8: +44.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 12: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 16: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 20: +46.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 24: +48.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 25: +48.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 26: +48.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 27: +48.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 28: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 29: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 30: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: Core 31: +45.0°C (high = +100.0°C, crit = +100.0°C)
May 14 09:36:53 pve sensors[1273]: acpitz-acpi-0
May 14 09:36:53 pve sensors[1273]: Adapter: ACPI interface
May 14 09:36:53 pve sensors[1273]: temp1: +27.8°C
May 14 09:36:53 pve sensors[1273]: nvme-pci-5800
May 14 09:36:53 pve sensors[1273]: Adapter: PCI adapter
May 14 09:36:53 pve sensors[1273]: Composite: +56.9°C (low = -273.1°C, high = +81.8°C)
May 14 09:36:53 pve sensors[1273]: (crit = +84.8°C)
May 14 09:36:53 pve sensors[1273]: Sensor 1: +56.9°C (low = -273.1°C, high = +65261.8°C)
May 14 09:36:53 pve sensors[1273]: Sensor 2: +68.8°C (low = -273.1°C, high = +65261.8°C)
May 14 09:36:53 pve smartd[1258]: Device: /dev/nvme1, opened
May 14 09:36:53 pve systemd[1]: Started ksmtuned.service - Kernel Samepage Merging (KSM) Tuning Daemon.
May 14 09:36:53 pve lxcfs[1274]: Starting LXCFS at /usr/bin/lxcfs
I do not know if my issues were due to thermals being exceeded. `smartctl` shows nothing much of use.
Nevertheless, if anyone comes up with a 1/2 decent cooling option for these, other than my current "standing case on side and blowing air on it" approach, please let me know.