Minis forum MS-A2

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
I'm running Proxmox 9.2.2 with 7.0 kernel. Works great. Not doing anything stressful though.
Okay, that's encouraging to hear. I've been chasing a crazy problem for months now - mine will spontaneously reboot after a few hours running any kernel newer than 6.14.8, so 6.14.11 or any 6.17/7.0 I've tried. I've posted on the proxmox forums with not much luck. Rock solid on 6.14.8.

So... I hope you'll forgive me asking some questions - what SSD are you running? how much RAM? did you do any BIOS tweaks?
 

Brian Stretch

New Member
Jan 26, 2017
25
2
3
Okay, that's encouraging to hear. I've been chasing a crazy problem for months now - mine will spontaneously reboot after a few hours running any kernel newer than 6.14.8, so 6.14.11 or any 6.17/7.0 I've tried. I've posted on the proxmox forums with not much luck. Rock solid on 6.14.8.

So... I hope you'll forgive me asking some questions - what SSD are you running? how much RAM? did you do any BIOS tweaks?
Crucial T500 2TB SSD, 128GB Crucial DDR5-5600 RAM, assorted BIOS tweaks to minimize power consumption and turn off LED lights. BIOS is 1.02. They pulled 1.03 which is believed to have issues and attempting to downgrade from 1.03 to 1.02 may brick your system. I did clean install Proxmox 9.1 and upgrade from there. I see that there's a new 9.2 .iso, maybe grab that and try clean installing?
 

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
Crucial T500 2TB SSD, 128GB Crucial DDR5-5600 RAM, assorted BIOS tweaks to minimize power consumption and turn off LED lights. BIOS is 1.02. They pulled 1.03 which is believed to have issues and attempting to downgrade from 1.03 to 1.02 may brick your system. I did clean install Proxmox 9.1 and upgrade from there. I see that there's a new 9.2 .iso, maybe grab that and try clean installing?
Okay, so we have the same RAM, different SSDs (mine is a Samsung 990 Pro). Also running BIOS 1.02, they pulled 1.03 before I had a chance to think about upgrading.

Do you remember what those assorted BIOS tweaks were? I mostly haven't done any, although I did try a few of the documented ones.

Clean installing sounds like such a... Windows... way of dealing with things, but... that does give me an idea, I could always boot the 9.2 ISO, leave it sitting there for a few hours, and see if it spontaneously reboots. If it doesn't then think about reinstalling.
 

Brian Stretch

New Member
Jan 26, 2017
25
2
3
Okay, so we have the same RAM, different SSDs (mine is a Samsung 990 Pro). Also running BIOS 1.02, they pulled 1.03 before I had a chance to think about upgrading.

Do you remember what those assorted BIOS tweaks were? I mostly haven't done any, although I did try a few of the documented ones.

Clean installing sounds like such a... Windows... way of dealing with things, but... that does give me an idea, I could always boot the 9.2 ISO, leave it sitting there for a few hours, and see if it spontaneously reboots. If it doesn't then think about reinstalling.
Sounds like a plan. There was a guide linked to earlier in this thread for low-power operation... but on BIOS 1.02 there's a setting for 45W operation that probably makes all that redundant. Your 990 Pro runs hotter than my T500 and should have a heatsink but I doubt that's the problem.
 

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
Sounds like a plan. There was a guide linked to earlier in this thread for low-power operation... but on BIOS 1.02 there's a setting for 45W operation that probably makes all that redundant. Your 990 Pro runs hotter than my T500 and should have a heatsink but I doubt that's the problem.
I will look for that guide and... I don't think I noticed that 45W setting. Will look again.

The SSD running hotter could be a part of the issue. Maybe the newer kernels enable some PCIe features or other that cause the heat to be even more?
 

maars

New Member
Jan 17, 2023
2
1
3
Okay, so we have the same RAM, different SSDs (mine is a Samsung 990 Pro). Also running BIOS 1.02, they pulled 1.03 before I had a chance to think about upgrading.
I'm running the same setup like you with Samsung 990 Pro's. well I've got a 3 Node Proxmox Cluster in my Basement Rack. The temperature of the Basement is always around 20-21°C. all the 990 Pro running in average like on the image (Zabbix Monitoring).
1779897412611.png


I had a similar issue at the beginning of this year with one of my nodes. The reboots, or rather system lockups, mostly happened during heavy load, such as backups to a NAS.
It took me a while to figure out that, in my case, the cause was a faulty Samsung 990 Pro. Heavy read/write activity caused the NVMe SSD to lock up, which then froze the whole system. It was never related to the NVMe temperature.
I replaced the faulty 990 Pro with a new one, and since then everything has been rock solid. I never had this issue with my other two nodes.
Of course, your reboots may have a completely different root cause, but after reading your comments, I thought I would share my experience.
 
  • Like
Reactions: name stolen

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
I'm running the same setup like you with Samsung 990 Pro's. well I've got a 3 Node Proxmox Cluster in my Basement Rack. The temperature of the Basement is always around 20-21°C. all the 990 Pro running in average like on the image (Zabbix Monitoring).
View attachment 48958


I had a similar issue at the beginning of this year with one of my nodes. The reboots, or rather system lockups, mostly happened during heavy load, such as backups to a NAS.
It took me a while to figure out that, in my case, the cause was a faulty Samsung 990 Pro. Heavy read/write activity caused the NVMe SSD to lock up, which then froze the whole system. It was never related to the NVMe temperature.
I replaced the faulty 990 Pro with a new one, and since then everything has been rock solid. I never had this issue with my other two nodes.
Of course, your reboots may have a completely different root cause, but after reading your comments, I thought I would share my experience.
How did you figure out the SSD was the issue?

I've certainly considered the SSD. Updated the firmware too which did not help...
 

zzjin

New Member
Apr 10, 2026
6
0
1
@VivienM I have exactly the same issue with you witch now I install pve 9.1 and stay on 6.14.8.
Codex say maybe its a AMD microcode update crash due to kernel update.
Some days ago I return my console to factory, they tested and says all fine. They just install a brand new windows11 witch works fine even when full boost cpu and ssd. So I assume it's not the cause of ssd temputure.
Keep watching this thread.
 

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
@VivienM I have exactly the same issue with you witch now I install pve 9.1 and stay on 6.14.8.
Codex say maybe its a AMD microcode update crash due to kernel update.
Some days ago I return my console to factory, they tested and says all fine. They just install a brand new windows11 witch works fine even when full boost cpu and ssd. So I assume it's not the cause of ssd temputure.
Keep watching this thread.
What SSD/RAM do you have?
 

VivienM

Member
Jul 7, 2024
55
10
8
Toronto, ON
Power supply?
If that's aimed at me, then... maybe, but I guess my question would be, why is my power supply fine for 6.14.8 and older, but can't handle anything newer?

Also, I would note one other thing - I've noticed the spontaneous reboots even without running any VMs so the load on the machine should be very low...
 

TrevorH

Active Member
Oct 25, 2024
208
90
28
More directed at zzjin who has SSD problems.

I had an old Cubox arm machine at one point and it had an SSD hung off a USB port and I retired it when the 3rd SSD in a few weeks decided to crap out. I am fairly sure that in my case the USB voltage was fluctuating and causing the problem. I retired the box and left all 3 SSDs on the shelf and came back to them about 3 years later and they had all recovered and were all completely blank. They're all still working now.
 

zzjin

New Member
Apr 10, 2026
6
0
1
More directed at zzjin who has SSD problems.

I had an old Cubox arm machine at one point and it had an SSD hung off a USB port and I retired it when the 3rd SSD in a few weeks decided to crap out. I am fairly sure that in my case the USB voltage was fluctuating and causing the problem. I retired the box and left all 3 SSDs on the shelf and came back to them about 3 years later and they had all recovered and were all completely blank. They're all still working now.
Sounds wired, my three ssd all works fine on another machine(intel i9 10850). for me it's not ssd problem.