Avoid Samsung 980 and 990 with Windows Server

mattlach · Jun 30, 2024

To chime in with an update for me, I just had my first issue with a Samsung SSD as part of a ZFS pool under Linux after ~11 years of using various varieties of them (though to be fair, most of those were older SATA drives like the Samsung 850 Pro)

I found that one of the mirrored 512GB Samsung 980 Pro boot drives in the rpool had just gone unresponsive. A power cycle brought it back online.

Might just have been a fluke. Uptime wasn't too long (only about 90 days) and the system does have double fault tolerant registered ECC, but it might also have been related to the issues you guys have been seeing. I am continuing to monitor.

I have regular backups of the boot partition, and other than base configuration there really isn't anything stored on the boot pool I could lose (other than a little uptime as I restore it) so I am not too concerned, but if this keeps up, I may migrate away from Samsung in this role in the future.

My eight budget ghetto Phison reference design Inland Premium (Microcenter in-house brand) drives continue to be rock solid, as do my two striped 4TB WD Black SN850X cache devices.

Only time will tell here if this was an outlier/fluke or if this is maybe related to some of the issues others have had under Windows server with these drives.

mattlach · Jun 30, 2024

mattlach said:
To chime in with an update for me, I just had my first issue with a Samsung SSD as part of a ZFS pool under Linux after ~11 years of using various varieties of them (though to be fair, most of those were older SATA drives like the Samsung 850 Pro)

I found that one of the mirrored 512GB Samsung 980 Pro boot drives in the rpool had just gone unresponsive. A power cycle brought it back online.

Might just have been a fluke. Uptime wasn't too long (only about 90 days) and the system does have double fault tolerant registered ECC, but it might also have been related to the issues you guys have been seeing. I am continuing to monitor.

I have regular backups of the boot partition, and other than base configuration there really isn't anything stored on the boot pool I could lose (other than a little uptime as I restore it) so I am not too concerned, but if this keeps up, I may migrate away from Samsung in this role in the future.

My eight budget ghetto Phison reference design Inland Premium (Microcenter in-house brand) drives continue to be rock solid, as do my two striped 4TB WD Black SN850X cache devices.

Only time will tell here if this was an outlier/fluke or if this is maybe related to some of the issues others have had under Windows server with these drives.

Also, for what it is worth on these, Smartctl reports the two temperature sensors at 40C and 44C respectively on both the drives (with the drive heatsinks mounted as in pic below, with a small fan blowing on them). This is at a relatively high ambient temp of 81.3°F / 27.4°C (had some AC troubles lately which I am working to fix)

(Before install of heatsinks and expansion cards)

(After install of heatsinks, expansion cards and a little 40mm Noctua fan to keep some direct airflow on them)

So I don't think the temperature is a contributing factor here.

Both drives report 0 time spent above warning temps, for whatever that is worth.

Fritz · Jun 30, 2024

mattlach said:
Also, for what it is worth on these, Smartctl reports the two temperature sensors at 40C and 44C respectively on both the drives (with the drive heatsinks mounted as in pic below, with a small fan blowing on them). This is at a relatively high ambient temp of 81.3°F / 27.4°C (had some AC troubles lately which I am working to fix)

View attachment 37648
(Before install of heatsinks and expansion cards)

View attachment 37649
(After install of heatsinks, expansion cards and a little 40mm Noctua fan to keep some direct airflow on them)

So I don't think the temperature is a contributing factor here.

Both drives report 0 time spent above warning temps, for whatever that is worth.

That's a busy box.

mattlach · Jun 30, 2024

Fritz said:
That's a busy box.

It's an all-in-one VM & NAS box.

Runs Proxmox VE (Debian based with frontend for KVM & LXC). 32C/64T Milan EPYC CPU, 512GB Registered ECC RAM, 24 3.5" drive bays (only 12 populated currently with 16TB Seagate enterprise drives)

Its running a Supermicro H12SSL-NT. PCIe slots are populated with (from left to right) LSI 9305-24i SAS HBA, Intel Intel XL710-QDA2 dual port 40Gbit NIC, and three Gen 4 16x -> 4x M.2 risers, each with four m.2 drives in them, in addition to the two m.2 slots on board, and one of the two slimsas connectors configured for 8x pcie mode, and connected to two Optane 905p u.2 drives in a 2.5" metal drive cage I have ghetto mounted to the side of the case.

I got the NT model to give me more options for networking, but I found that the integrated 10gig NIC ports used a ton of power and ran very hot, so I opted not to use them.

...all in an old Supermicro SC846 chassis, with a custom modded center fan wall to allow for three 120mm Noctua fans instead of the stock 80mm things that were as loud as turbine engines...

I always find that for my needs, I usually run out of expansion first. Number of cores, amountof RAM or any other aspects of the servers are rarely a concern. The little 1U and 2U Dell/Hp type servers would never work for me. That's just not what my workload looks like.

That, and I enjoy building things myself.

Years ago I picked up a 2U HP DL180 G6 because I found an incredible deal on it, and I wound up regretting that big time. Now I have resolved to only build my own going forward.

CyklonDX · Jun 30, 2024

the nvme's should screams in high temps when used.

*(had same setup until recently before turning the sc24 bay into jbod, those nvme heatsinks are really bad for heavy use)

mattlach · Jun 30, 2024

CyklonDX said:
the nvme's should screams in high temps when used.

*(had same setup until recently before turning the sc24 bay into jbod, those nvme heatsinks are really bad for heavy use)

Yeah, the fins are not as big as they ought to be, but the location of the slots on this board resulting in them being underneath the three first 16x PCIe slots, means there isn't much room for increased heatsinks.

That's why I added a small fan for some forced airflow, and this made a pretty huge difference.

The thin, low fin heatsinks alone are pretty bad, but with some customized forced air, they are actually quite good.

I have punished those NVMe drives in my picture pretty hard on occasion, but I have never seen them at more than ~20°C over ambient, and since the warning temp is at 81.5°C, that should be fine. If it is ever 61.5°C or above in my house, I have a much MUCH bigger problem.

CyklonDX · Jun 30, 2024

i've completely removed the aluminum shroud and mounted normal copper heat sinks while trying to fight those temps.

(those jeyi are the best - keep my nvme's at 55'C under load)
(but i also ended up switching all to micron and hynix gen4 nvme's and running them under gen3 - its much better thermally speaking)

(jeyi do take bit more space - not sure if you have it since you have those pcie cards right by each other)

Still i think its preferable to remove the aluminum shroud and having fans on the side.
In your case moving those cards further apart - if you can (pcie lanes), might also be great - as they are likely heating up each other.

mattlach · Jun 30, 2024

CyklonDX said:
i've completely removed the aluminum shroud and mounted normal copper heat sinks while trying to fight those temps.

(those jeyi are the best - keep my nvme's at 55'C under load)
(but i also ended up switching all to micron and hynix gen4 nvme's and running them under gen3 - its much better thermally speaking)
View attachment 37651
(jeyi do take bit more space - not sure if you have it since you have those pcie cards right by each other)

Still i think its preferable to remove the aluminum shroud and having fans on the side.
In your case moving those cards further apart - if you can (pcie lanes), might also be great - as they are likely heating up each other.

I would look into it further, but in monitoring my temps I simply have not had any temperature issues.

The two m.2 drives on the board stay at decent temps even under load, as does everything in the 16x risers. (The entire aluminum case on those Asus 16x risers is a heat spreader though, and they also have a built in fan that pulls it in and exhausts it out the PCI slot in the back.)

I was worried about the clearance between them at first, but in my testing here was only a degree or two difference resultant from separating them, and there were other things on the board in the way if I used the other 16x slots, so they stayed where they are. Apparently the limited couple of millimeter clearance provides sufficient airflow to stay on top of it.

If I had NVME temperature issues I'd do something about it, but I simply don't. All the temps are fine, even under load, even at my current elevated ambients due to no AC.

Of the 16 NVME devices only six of them are Gen4 though. The rest are Gen3. If I had Gen5 or more Gen4 devices I may have more of an issue, but I simply don't need Gen5 bandwidth (and the CPU and Board doesn't support it anyway)

mattlach · Jul 1, 2024

mattlach said:
I would look into it further, but in monitoring my temps I simply have not had any temperature issues.

The two m.2 drives on the board stay at decent temps even under load, as does everything in the 16x risers. (The entire aluminum case on those Asus 16x risers is a heat spreader though, and they also have a built in fan that pulls it in and exhausts it out the PCI slot in the back.)

I was worried about the clearance between them at first, but in my testing here was only a degree or two difference resultant from separating them, and there were other things on the board in the way if I used the other 16x slots, so they stayed where they are. Apparently the limited couple of millimeter clearance provides sufficient airflow to stay on top of it.

If I had NVME temperature issues I'd do something about it, but I simply don't. All the temps are fine, even under load, even at my current elevated ambients due to no AC.

Of the 16 NVME devices only six of them are Gen4 though. The rest are Gen3. If I had Gen5 or more Gen4 devices I may have more of an issue, but I simply don't need Gen5 bandwidth (and the CPU and Board doesn't support it anyway)

Just to clarify and provide more data if anyone is interested, this is what my NVMe drives in my server are used for, what locations they are in, and what temps and wear that results in.

But first, in reading the below, understanding the ZFS storage configuration may help. It looks like this:

Code:

Boot Pool

    data
      mirror-0
        Samsung 980 Pro 500GB
        Samsung 980 Pro 500GB
    logs
      mirror-1
        Intel Optane 900p 280GB Drive 0 Partition 0
        Intel Optane 900p 280GB Drive 1 Partition 0


VM Datastore

    data
      mirror-0
        Samsung 980 Pro 1TB
        Samsung 980 Pro 1TB
    logs 
      mirror-1
        Intel Optane 900p 280GB Drive 0 Partition 1
        Intel Optane 900p 280GB Drive 1 Partition 1


Scheduled Recordings

    data
      mirror-0
        Inland Premium 1TB
        Inland Premium 1TB


Mass File / Media Storage

    data
      raidz2-0
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
      raidz2-1
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
        Seagate Exos x18 16TB
    special 
      mirror-0
        Inland Premium 2TB
        Inland Premium 2TB
        Inland Premium 2TB
    logs 
      mirror-1
        Intel Optane 900p 280GB Drive 0 Partition 2
        Intel Optane 900p 280GB Drive 1 Partition 2
    cache
      WD Black SN850x 4TB
      WD Black SN850x 4TB

#1 Boot drives 2x ZFS mirrored 500GB Samsung 980 Pro
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 43 Celsius
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 43 Celsius

These are Gen4, but mostly idle most of the time. They are the ones directly on the motherboard, partially buried under all the junk with the shitty chinese heatsinks on them and a 40mm slim Noctua fan for good measure. 0% wear reported, but after OS install they get only limited writes, so that is expected.

#2 VM & Container Datastore drives. 2x ZFS mirrored 1TB Samsung 980 Pro
Temperature Sensor 1: 41 Celsius
Temperature Sensor 2: 52 Celsius
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 47 Celsius

These are also Gen4, but get a little more exercise as all of the VM's and containers are running off of them, including a couple with some database activity. They are inside the 16x risers, one leftmost, other rightmost. Of all the storage pools in the system, the wear is typically the fastest on this one (or at least used to be) but since installing the Samsung drives here ~7 months ago, they still report 0% wear.

#3 Dedicated pool for scheduled MythTV PVR recordings. 2x ZFS mirrored 1TB Inland Premium (appear to be reference Phison drives manufactured by Team)
Temperature: 29 Celsius
Temperature: 30 Celsius

These are Gen3. Mostly idle unless recording scheduled shows. I used to record straight to the hard drive pool, but got some strange stutter, so decided to record to a SATA SSD pool instead, and transitioned it to NVMe when I replaced all of the SATA SSD's with NVMe drives. Cron script moves the oldest recordings to my hard drive pool every night at 4am to free up recording space. This is when they see the most activity. These are in the 16x risers. One left, one center. (one in left riser at 29C has unrestricted fan, one in center riser has that 2mm clearance, but only seems to suffer by 1C for it. After almost 3 years in this role they are reporting only 2% wear.

#4 "Special" ZFS VDEV for storing metadata and small files in hard drive pool. 3x ZFS Mirrored 2TB Inland Premium (appear to be reference Phison drives manufactured by Team)
Temperature: 29 Celsius
Temperature: 30 Celsius
Temperature: 30 Celsius

These are Gen3. They see constant slow use, as the hard drive pool is almost always active night and day. (Did a 3 way mirror as my hard drive vdevs are two RAIDz2 vdevs which each have two drive fault tolerance, and recommendation is to match fault tolerance across vdevs. One of these three is in each of the 16x risers. 29c is unobstructed on the left. 30c ones are center and right. After almost 3 years in this role they report only about 1% wear.

#5 Cache / L2ARC drives for hard drive pool. 2x Striped 4TB WD Black SN850X
Temperature: 44 Celsius
Temperature: 46 Celsius

These are Gen4. They see near constant activity as things are moved in and out of the hard drive pools read cache. The level 1 cache (ARC, 384GB) is in RAM. What gets evicted from there gets moved to the cache drives (L2ARC, 8TB) and the least desirable data in that cache gets evicted to make space for it. Since no data is lost if these fail, they are striped to maximize size. One is in th eleft 16x riser, and one is in the right one. Despite near constant write activity, these still report 0% wear after ~7 months in this role.

#6 ZFS SLOG (Separate ZFS Intent Log). 2x ZFS Mirrored 280GB Intel Optane 900p
Temperature: 48 Celsius
Temperature: 48 Celsius

These are Gen3. They see near constant activity, as they each have 4 partitions, 3 of which are in use. Because Optanes are crazy (insane high queue depth IOPS and very low write latency), there is very little drawback from partitioning them, and having each partition mirrored between the two drives, and used to support SLOG activity on three different pools (boot pool, VM Datastoore pool and Hard Drive storage pool). 2.5" U.2 drives are connected via 8x Slimsas connector on motherboard, to U.2 drives in ghetto mounted drive cage on right side of case. 0% wear as expoected despite heavy-ish load, considering they are Optanes. Crazy write endurance on Optanes.

#7 MADM Mirror for Swap Space. 2x ZFS mirrored 1TB Inland Premium (appear to be reference Phison drives manufactured by Team)
Temperature: 29 Celsius
Temperature: 30 Celsius

Used mdadm on these, as ZFS is not appropriate for swap as it can create a race condition. (RAM is low, write more to swap, Swap is on ZFS, causing ZFS RAM use to increase, reducing RAM availability, writing more to swap, etc. etc. Traditional software RAID via MDADM is more appropriate here. One on middle 16x riser, one in right riser. One on the right is hotyter for some reason, despite both having restricted airflow by roughly same amount. These drives were previously in the Cache role, but were moved to the Swap role when I picked up the WD drives ~7months ago. In the swap riole they see very very light writes. Probably negligible. But they did spend 2+ years in th ecache role, and that is probably why they are reporting 5% wear.

#8 Standalone (ext4) drive for MythTV LiveTV 1x 256GB Inland Premium (appears to be reference Phison drives manufactured by Team)
Temperature: 30 Celsius

MythTV kind of "hacks" Live TV by treating them like recordings, writing them to disk, and immediately playing htem back (with a small buffer). This allows you pause/rewind or to mark something as keep/record while watching it and it stays, otherwise they are auto-deleted within 24 hours, or if space is needed. Just like with Scheduled pool, if something is marked record, an overnight script moves it to the scheduled pool (where it remains until it ages out to the hard drive pool or is deleted) With the mother-in-law here in the house, at least one of the MythTV frontends is constantly watching live tv, so this one sees a near constant slow trickle of writes during the day, but only like 1.5 to 4MB/s somewhere. I expected to eat up this drive with writes fast, and then replace it (it was only $29, so why not?) but it has been in use in this role for almost 3 years and is only reporting 9% wear. The drive is in the middle 16x riser.

I guess, I intentionally over-dimensioned the capabilities here, in part becuase that's just what I do, but also in part because I realized I was using consumer NVMe SSD's in a server role, and I didn't want to push my luck, so I split up a lot of the loads to offload and distribute the workload a little.

Nothing runs to hot even at its highest load on a hot day without AC, and nothings really ever given me any trouble except that one 500GB Samsung boot drive which just kind of dropped offline last week. The mirror kepot it going though, and a power cycle brought it back up again. (it had to resilver 1.56GB from th eorrer mirror while it was down). I'm going to have to monitor it going forward, but I'm not too concerned. Worse comes to worse, I'll just buy a couple of other 500+GB NVMe drive and resilver them in place to replace them.

Its interesting to me that the drive that dropped out was one of the boot drives, one of the coolest running of the Gen4 drives and one of the lowest load drives as well with only 2.94TB written over 7 months, while the MUCH more highly loaded VM Datastore drives, also Samsung 980 Pro's are fine. But this is small sample size stuff, so it is difficult to draw any real conclusions.

Welp. That wound up being more of a writeup than I had planned. I hope someone finds this interesting/useful. Arent digressions fun?

bootynugget · Jul 2, 2024

Hi guys and gals. I just created this account to let you know that it is NOT just Windows Server doing this. I built a new system for my dad a couple months ago - 14th gen i7, 32GB RAM, Windows 11 Pro, etc. I decided to go with a Samsung 990 PRO 2TB drive (with heatsink). A week in, my dad calls me and says the system is sitting at the BIOS screen. My assumption is that it must've crashed overnight as it was in the AM when he reported this. Strange... so I had him reboot, and it booted right into Windows. Rinse, repeat at random for the next month or so. One time, he had it happen to him while he sitting at the system working on it. Finally, I got fed up and ordered a WD NVMe drive for it and cloned this 990 PRO over to it. Not a single issue since...

So, this definitely isn't isolated to just Windows Server... It happens on Windows 11 as well.

mattlach · Jul 2, 2024

bootynugget said:
Hi guys and gals. I just created this account to let you know that it is NOT just Windows Server doing this. I built a new system for my dad a couple months ago - 14th gen i7, 32GB RAM, Windows 11 Pro, etc. I decided to go with a Samsung 990 PRO 2TB drive (with heatsink). A week in, my dad calls me and says the system is sitting at the BIOS screen. My assumption is that it must've crashed overnight as it was in the AM when he reported this. Strange... so I had him reboot, and it booted right into Windows. Rinse, repeat at random for the next month or so. One time, he had it happen to him while he sitting at the system working on it. Finally, I got fed up and ordered a WD NVMe drive for it and cloned this 990 PRO over to it. Not a single issue since...

So, this definitely isn't isolated to just Windows Server... It happens on Windows 11 as well.

This sounds like something different, as the occurrence rate is very high.

Did you flash it with the latest firmware using Samsung Magician?

I have two 2TB 990 Pros in my desktop which runs both Windows 11 and Linux Mint (dual boot) and have not had a single issue in over a year of use.

It's possible you received a defective unit. Did you file an RMA request? They have 5 years of warranty.

bootynugget · Jul 3, 2024

I did update it to the latest firmware immediately (Using Samsung Magician). I have not yet filed an RMA request for it, but will probably do so soon. There are other people having the same issue - I've found a few other threads elsewhere that are explaining the same issue as I had as well.

semicycler · Jul 10, 2024

I have a handful of home servers/computers most running fine with the 990 Pro. But I have seen the m.2 drop a handful of times, under Server 2022 and Win 10, two different computers. Latest firmware using Samsung Magician. Different m.2 slots make no difference.

My main PC has been running fine with a 2TB 990 Pro for months. I recently started building a replacement and swapped in an slightly older 2TB 990 Pro so that I could move the newer one to the new build. Now my main PC with the swapped in older m.2 is dropping it - happened twice now, computer boots to BIOS, drive not present, need to cycle power then everything shows up and works again. The only difference here is the m.2 was cloned and swapped to an identical version. This is looking like a Samsung quality problem - why did one m.2 work fine for months, then within days of swapping for an identical make/model it starts to drop the drive?

I'm testing the 'Full Power Mode' option inside Samsung magician which 'Prevents SSD from going to sleep or idle state' with the bad drive. You can find this option in Samsung Magician under Performance Optimization-->Custom Mode, or Performance Optimization-->Full Performance Mode. The difference between the two is over provisioning which I did not want to do.

If this doesn't fix it then I'm looking at an RMA, not sure what else to try.

bootynugget · Jul 12, 2024

semicycler said:
If this doesn't fix it then I'm looking at an RMA, not sure what else to try.

Let me know if you have any luck!

semicycler · Jul 18, 2024

Eight days in and no dropped 990 Pros! I'm not saying this is the fix but am saying things have been stable so far.

All my home servers and clients with 990 Pros in them have this change. Before the change 2 of 8 computers were randomly dropping their 990 pro m.2 needing a cold restart for them to reappear in BIOS. One is a 4TB with heatsink flavor, the other a 2TB with heatsink flavor. Latest 4B2QJXD7 firmware in all of them. Since changing to 'full power mode' in Samsung Magician as shown above zero computers have dropped their drives. Definitely promising after eight days of testing...

mattlach · Jul 19, 2024

semicycler said:
Eight days in and no dropped 990 Pros! I'm not saying this is the fix but am saying things have been stable so far.

All my home servers and clients with 990 Pros in them have this change. Before the change 2 of 8 computers were randomly dropping their 990 pro m.2 needing a cold restart for them to reappear in BIOS. One is a 4TB with heatsink flavor, the other a 2TB with heatsink flavor. Latest 4B2QJXD7 firmware in all of them. Since changing to 'full power mode' in Samsung Magician as shown above zero computers have dropped their drives. Definitely promising after eight days of testing...

Hmm.

I wonder if there is a way to do with without Magician from - say - Linux.

I'm not really having any issues with mine except for that one random drop-out which may not have been related, but it wouldn't hurt to enable it just for good measure.

Its going to be a MASSIVE pain to get them inserted in a windows machine that can run Magician though.

HellDiverUK · Jul 23, 2024

I've had similar issues in my Proxmox machine, originally using a 990 2TB. Various crashes, which was very unexpected considering the machine (brand new Optiplex with i5-13500). Eventually replaced the Samsung with a SKHynix Gold 2TB and had zero issues since.

drdepasquale · Jul 23, 2024

No issues here using the Standard NVM Express driver with a Samsung 980 Pro with Windows Server 2012 R2

Magnet · Jul 25, 2024

dang 2012R2, thats old

drdepasquale · Jul 25, 2024

Magnet said:
dang 2012R2, thats old

It still works and has ESU from Microsoft so it gets the job done

Avoid Samsung 980 and 990 with Windows Server

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

New Member

Active Member

New Member

New Member

New Member

New Member

Active Member

Active Member

Active Member

Active Member

Active Member