I just bought some Micron NVMe SSDs for my Supermicro server motherboard. The server motherboard sits in a full ATX case with decent airflow.
Upon power up, I boot ESXi. Within 10 minutes of ESXi booting, one of the SSDs fail out. I pop over to the BMC web GUI and it's filled with errors like:
This is before I even put a filesystem on the drives. So there shouldn't be any disk I/O happening.
I didn't have this issue when I used my previous consumer Samsung NVMe SSDs in the same exact same slots for years.
Is there something about enterprise NVMe SSDs that make them more temperature intolerant?
Is the BMC the thing that is shutting down the port because of some threshold value? I'm very shocked to see this and don't know what to make of it.
My theory is that both the consumer SSDs and my new enterprise SSDs had the same temperature values BUT the enterprise SSDs now support the feature of reporting temps to the BMC and making it freak out, causing it to shut down the port. Could that be possible, or is it more likely that the SSD controller is shutting itself down from the hot temps?
Upon power up, I boot ESXi. Within 10 minutes of ESXi booting, one of the SSDs fail out. I pop over to the BMC web GUI and it's filled with errors like:
Code:
Temperature [IPMI-1011] M2NVMeSSD Temp2, Upper Non-recoverable - going high - Assertion
Temperature [IPMI-1011] M2NVMeSSD Temp1, Upper Non-recoverable - going high - Assertion
Temperature [IPMI-1009] M2NVMeSSD Temp2, Upper Critical - going high - Assertion
Temperature [IPMI-1009] M2NVMeSSD Temp1, Upper Critical - going high - Assertion
I didn't have this issue when I used my previous consumer Samsung NVMe SSDs in the same exact same slots for years.
Is there something about enterprise NVMe SSDs that make them more temperature intolerant?
Is the BMC the thing that is shutting down the port because of some threshold value? I'm very shocked to see this and don't know what to make of it.
My theory is that both the consumer SSDs and my new enterprise SSDs had the same temperature values BUT the enterprise SSDs now support the feature of reporting temps to the BMC and making it freak out, causing it to shut down the port. Could that be possible, or is it more likely that the SSD controller is shutting itself down from the hot temps?
Last edited: