Samsung drives newer than the 970 (the last to use a Samsung nvme driver) are not reliable with Windows Server.
Over the past several months I've been debugging issues with both Intel and AMD servers that reset to the bios and the SSD is not detected until a power off/on cycle. I was really reluctant to suspect that the entire product line of SSDs was buggy since this is a very popular drive but I'm certain now and it isn't specific to a particular firmware version.
On 4 different systems using Asus, AsRock Rack, or Supermicro motherboards a high load system would crash every couple weeks or so without ever writing a minidump. I suspected defective drives and swapped a 1TB drive for a 2TB one, or swapped the 980 pro for a 990 but the behavior persisted. Meanwhile several systems with 970 pros and the same workload run stable for months.
After I swapped my X13SAE-F motherboard for an Asus W680-ACE (thinking the X13SAE was at fault) I tried restoring a sql server database from backup and observed this was 100% effective at crashing the SSD controller, causing it to disappear until a power cycle. I checked every related bios setting and all the sql server fixes regarding drive sector size to no avail.
Some of the servers that crash are running MySQL instead of Sql Server and the only common denominator is high IO load, Windows Server 2016 or later, and Samsung 980 or later. With so many bug reports related to Samsung's firmware issues it's hard to find corroborating bug reports so I thought I'd share this here.
Over the past several months I've been debugging issues with both Intel and AMD servers that reset to the bios and the SSD is not detected until a power off/on cycle. I was really reluctant to suspect that the entire product line of SSDs was buggy since this is a very popular drive but I'm certain now and it isn't specific to a particular firmware version.
On 4 different systems using Asus, AsRock Rack, or Supermicro motherboards a high load system would crash every couple weeks or so without ever writing a minidump. I suspected defective drives and swapped a 1TB drive for a 2TB one, or swapped the 980 pro for a 990 but the behavior persisted. Meanwhile several systems with 970 pros and the same workload run stable for months.
After I swapped my X13SAE-F motherboard for an Asus W680-ACE (thinking the X13SAE was at fault) I tried restoring a sql server database from backup and observed this was 100% effective at crashing the SSD controller, causing it to disappear until a power cycle. I checked every related bios setting and all the sql server fixes regarding drive sector size to no avail.
Some of the servers that crash are running MySQL instead of Sql Server and the only common denominator is high IO load, Windows Server 2016 or later, and Samsung 980 or later. With so many bug reports related to Samsung's firmware issues it's hard to find corroborating bug reports so I thought I'd share this here.