Avoid Samsung 980 and 990 with Windows Server

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

yeryer

New Member
Sep 24, 2022
23
6
3
Samsung drives newer than the 970 (the last to use a Samsung nvme driver) are not reliable with Windows Server.

Over the past several months I've been debugging issues with both Intel and AMD servers that reset to the bios and the SSD is not detected until a power off/on cycle. I was really reluctant to suspect that the entire product line of SSDs was buggy since this is a very popular drive but I'm certain now and it isn't specific to a particular firmware version.

On 4 different systems using Asus, AsRock Rack, or Supermicro motherboards a high load system would crash every couple weeks or so without ever writing a minidump. I suspected defective drives and swapped a 1TB drive for a 2TB one, or swapped the 980 pro for a 990 but the behavior persisted. Meanwhile several systems with 970 pros and the same workload run stable for months.

After I swapped my X13SAE-F motherboard for an Asus W680-ACE (thinking the X13SAE was at fault) I tried restoring a sql server database from backup and observed this was 100% effective at crashing the SSD controller, causing it to disappear until a power cycle. I checked every related bios setting and all the sql server fixes regarding drive sector size to no avail.

Some of the servers that crash are running MySQL instead of Sql Server and the only common denominator is high IO load, Windows Server 2016 or later, and Samsung 980 or later. With so many bug reports related to Samsung's firmware issues it's hard to find corroborating bug reports so I thought I'd share this here.
 

i386

Well-Known Member
Mar 18, 2016
4,319
1,594
113
35
Germany
Too late :D

I bought a 990 pro 1tb for a boot drive in a sale and encountered the problem you described multiple times...
In my case there is no load besides the stuff windows server does on it. Everything io related goes to an iodrive2 ssd.
 
  • Like
Reactions: yeryer

drdepasquale

Member
Dec 1, 2022
85
34
18
I have a Samsung 980 Pro 2 TB with a Supermicro H12DSi-NT6 running Windows. I haven't experienced any issues with reliability or drivers, though I've only been using the drive for two months. Try updating the firmware to 5B2QGXA7 or newer. Samsung Magician and server versions of Windows are iffy at best. Try performing this firmware update by connecting the drives to another machine running the standard desktop/client version of Windows.
 

yeryer

New Member
Sep 24, 2022
23
6
3
I've been running 5B2QGXA7 and that's still the latest version. I hope your build remains solid but I can confirm there are standard workloads that will crash the SSD with that firmware version as well as several preceding versions. FWIW I haven't yet seen this behavior with Windows 10 or 11, just Server 2016, 2019 and 2022.
 

i386

Well-Known Member
Mar 18, 2016
4,319
1,594
113
35
Germany
I have a Samsung 980 Pro 2 TB with a Supermicro H12DSi-NT6 running Windows. I haven't experienced any issues with reliability or drivers, though I've only been using the drive for two months. Try updating the firmware to 5B2QGXA7 or newer. Samsung Magician and server versions of Windows are iffy at best. Try performing this firmware update by connecting the drives to another machine running the standard desktop/client version of Windows.
I used samsung live cd/dvds to update the firmware (it's in the same download section as samsung magician, jsut scroll down to firmware)
 

CyklonDX

Well-Known Member
Nov 8, 2022
940
326
63
so this issue is related to overheating; when its beyond thermal threshold it will shutdown sensor, and later after some time the controller (and disk will 'die' until reboot). This is common issue with nvme's.

Put decent heatsink, or decent airflow over the nvme. They aren't made for constant r/w but bursts.
 

i386

Well-Known Member
Mar 18, 2016
4,319
1,594
113
35
Germany
Mine is in a supermicro 836 chassis with a x10 board with a front to back airflow layout, the chassis has 3x 7k rpm midwall fans pushing and 2x 6.7k rpm fans pullling air through an airshroud over the mainboard and ssd :D
 

CyklonDX

Well-Known Member
Nov 8, 2022
940
326
63
my reply was meant for OP;

(most often fans are not running at 100%, or not going right way to cool certain components - would still recommend heatsink on them even when you are pushing a lot of air - what i did notice that some nvmes do try manage the temps, and slow down to keep under certain 'C, while others just keep going until they crash.)
 

yeryer

New Member
Sep 24, 2022
23
6
3
so this issue is related to overheating; when its beyond thermal threshold it will shutdown sensor, and later after some time the controller (and disk will 'die' until reboot). This is common issue with nvme's.

Put decent heatsink, or decent airflow over the nvme. They aren't made for constant r/w but bursts.
I believe that would explain many simlar SSD crashes but it's not what I observed. Most of the time I had heat sinks, generous case airflow, and 64 degree ambient temps. There was that particular database restore operation that crashed the SSD 7-10 times while I watched the temperatures and fiddled with every setting I could find. That really speaks to this being a logic glitch somewhere between the OS and the drive. That workload crashed the drive after 5-10 minutes and yet it could pass a full drive scan and a 60 minute stress test.

It's more than just brute load and temperature, there's a bug as well.
 

yeryer

New Member
Sep 24, 2022
23
6
3
I should add that I've installed about 20 of the Samsung 970, 980 or 990 drives and have been a big fan of Samsung until this point. My systems with 970s and two systems with Sabrent Rockets are perfectly stable. Today I'll start testing a Hynix P41 as an alternative until Gen5 availability improves.
 

CyklonDX

Well-Known Member
Nov 8, 2022
940
326
63
Most nvme's crash/thermally throttle at 75'C

(samsung if i recall has only sensors on controller, but flash itself also has thermal limit where they'll just shutdown - i think bit higher at 85 something 'C.)

here is visualization of 90 days with different nvme's at my home lab setup (its from 2 different systems, names kinda same but it shows how they behave)

1678282794319.png


nvme0_1 = pm983a (crashes at 85'C, no thermal throttling) (one that hit 81.4'C max, sensors can get bugged at 70'C) ~ planning on removing it, as its going to be too unpredictable during summer time - so atm i use it as a backup disk - originally also used as l2arc cache but it thermally shutdown on many occasions.)
nvme0_2 = hynix p41 (seems to thermally throttle at 65'C @gen4)
nvme1_1 = toshiba xd5 (at full load reaches 62.9'C no thermal throttle at all)
nvme1_2 = micron 3400 (seems to be stable at gen4 pcie at 63'C)
nvme2_1 = micron 3400 (zfs l2arc cache gen3 stable below 57'C)
nvme3_1 = hynix p41 (zfs l2arc cache gen3 stable below 61'C)

(both systems temps kinda line up in disk temps as typically temps rise when i'm copying new media onto server - from one to the other, or running backups)

(below heatsinks used), both are in same rack, and have relatively same airflow (high cfm negative pressure setups.)
Note: Running gen4 pcie nvme's on gen3 resulted in much better temps.


m.2 2280 (i also use those at work for servers, prob the best heatsink - can't find a current link to them anymore - have at least 10 spare at work.)
1678282385133.png

m.2 22100 (not that great if you don't have fan blowing onto it from top.)
1678283094881.png


had more nvme's here, but stats got truncated due to age. (I don't keep more than 90 days)
In past I used adata's (crash at 70'C, tho some people reported they keep going to 85'C), samsung 970 evo crash at 75'C, wd black crash at 80'C.
(were used as l2arc cache for media center)
 
Last edited:

yeryer

New Member
Sep 24, 2022
23
6
3
hynix p41 (seems to thermally throttle at 65'C @gen4)
These stats are really useful! I'd much rather see a drive throttle early than risk failing and I'm glad my replacement drive appears to do so. My most problematic system has a nice m2 heatsink built into the Asus motherboard but I ordered a few copper heatsinks for the other servers I'll migrate to p41s.
 

yeryer

New Member
Sep 24, 2022
23
6
3
I double-checked the temps on my 990 that crashes so often and it really doesn't get very hot under load. If HWiNFO can be trusted I'm seeing just 49 degrees which again points to the crashes being a bug instead of environmental.

1678292794880.png
 

CyklonDX

Well-Known Member
Nov 8, 2022
940
326
63
it depends how often its polling the stats, and how quickly it overheats.

Some nvme controllers do not report temps in real time. (pm983a reports temp with 15sec delay)
 

drdepasquale

Member
Dec 1, 2022
85
34
18
I don't trust consumer grade SSDs in general, but especially for server. It's not worth taking the risk to cut costs. Servers should be using enterprise grade drives and should be actively cooled if they run too hot. Solving the thermal issues can prevent many problems.
 
  • Like
Reactions: ColdCanuck

yeryer

New Member
Sep 24, 2022
23
6
3
it depends how often its polling the stats, and how quickly it overheats.

Some nvme controllers do not report temps in real time. (pm983a reports temp with 15sec delay)
That's good to know that the reporting period is 15 seconds so I stressed the drive today and 8 hours later I still get a peak temp of 49 in HWiNFO even when load was reporting 100% so I struggle to imagine a situation where the thermals could spike so quickly that this would be hardware instead of software failure.
 

CyklonDX

Well-Known Member
Nov 8, 2022
940
326
63
thats why i recommended to run telegraf, and save stats like temps to another system.

15 sec delay on temp sensor polling is something ive noticed on pm983a, different models or disks themselves can have different polling times.

980 in specs supposed to have has controller that regulates temps 'dynamic thermal guard'; i would recommend checking how soon the data is being polled (push some load for few secs and see how soon temp tick updates.) If you have thermal gun, you could manually check if each component temp too as there are 5 different modules on 980 that can heat up; also drop your smart log, there can be something there.

(within the spec it allows for 'burst mode', where it can hit up to 9W for short amount of time - if polling is also 10-15sec you may not see it.)