Samsung 850 Pros fall over under heavy read/write workloads

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Lance Joseph

Member
Oct 5, 2014
82
40
18
I've a handful or two of Windows Server 2012 R2 systems that can with some consistency, make various flavors of Samsung 850 Pro drives disappear from the OS.
My test systems include a number of different hardware configurations.
These includes combinations of Xeon E5 V2/3 processors with LSI SAS 2308 / 3008 and Intel ICH10 disk controllers.
The OS has all patches applied and the drivers, controllers and disks have all been updated to their latest versions.

The behavior where the disks are "surprise removed" from the OS happens with these drives:
128G Samsung 850 Pro
256G Samsung 850 Pro
512G Samsung 850 Pro

I've not been able to reproduce this with the following models of SSD:
128G Samsung 840 Pro (slightly different failure mode)
250G Samsung 850 Evo
480G Samsung PM863 **
512G Samsung 950 Pro M.2 *
1T Samsung 850 Pro
2T Samsung 850 Evo

I was first notified of this issue by a colleague who was running various types of MS SQL workloads that were causing the drives to slow down and in some cases, disappear from the OS.

My methods for reproducing this include the use of iometer.
Assign a drive to a worker and set 'outstanding IO' to 128.
Use the workload 'all in one' and set hours to 2.
Let the drive fill up then monitor I/O.
Eventually it slows & falls over!

Read and write workload bandwidth should be equal.
So I'm seeing simultaneous ~30MB/s read and write to the 850 Pro drives.
After a while, IO drops to zero before the drives disappear and need to be power cycled.

Apologies if there are somehow gaps in my testing, methodology, or descriptions.
Would anyone else out there be able to try to reproduce this in their environments?
I'm happy to modify workloads or answer questions about my steps / thought processes.


Thanks
Lance


*Edit: forgot to add 950 Pro M.2 to list of unaffected devices.
**Edit#2: added 480G Samsung PM863 to list of unaffected devices.
 
Last edited:

keybored

Active Member
May 28, 2016
280
66
28
Don't have an 850 Pro to play with but I'm curious what temperatures you're seeing on those drives before they disappear. Are you hitting a thermal threshold of some sort?
 

Deslok

Well-Known Member
Jul 15, 2015
1,122
125
63
34
deslok.dyndns.org
The evo models and larger drives have more over provisioned area I would suggest trying that first using Samsung's ssd magician.
 

Lance Joseph

Member
Oct 5, 2014
82
40
18
Don't have an 850 Pro to play with but I'm curious what temperatures you're seeing on those drives before they disappear. Are you hitting a thermal threshold of some sort?
Thanks for pointing that out. I forgot to mention that temps seem okay.
The drives are in Supermicro systems supplied with A/C and steady airflow.
I've observed them maintaining a temperature from 20C all the way up to 30C.
So I don't believe I'm hitting a critical temperature threshold that's causing shutdown.
 

Lance Joseph

Member
Oct 5, 2014
82
40
18
Did you give OP a try on those drives ?
Thanks for bringing this up. I'd not bothered to modify or tweak the over provisioning until you'd mentioned it.
All tests performed prior to the ones below were using the default drive configuration which does not allocate over provisioning.

The evo models and larger drives have more over provisioned area I would suggest trying that first using Samsung's ssd magician.
I just fired up a series of tests on three 256G 850 Pro drives which are now using over provisioning.
They have 1%, 5%, and 10% drive capacity (respectively) allocated for over provisioning using Samsung Magician.
So far, the drives with 1% and 10% have fallen over ~10 minutes after testing began. The drive with 5% should fall over shortly.

Again, just to reiterate, the EVO drives appear to be unaffected.
I've only been able to produce this issue on the 128-512G 850 Pro SSDs.

Some additional info that I neglected to include in my initial post:
The uptime for these drives ranges anywhere from a month to 2.5 years.
The number of terabytes written to the drives ranges from 0.1T up to 26T.
The failure mode doesn't appear to correlate to any particular SMART counter.
 
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,625
2,043
113
I would try 20% OP that's what I see on other benchmarks that seem to really make a difference.

Maybe next test could be 15% 20% 25% :)

Awesome tests!
 

wildchild

Active Member
Feb 4, 2014
389
57
28
At least OP 20%.
I found OP of 35% to be the sweet spot on 840 pro's using ZFS (no trim support)
Never used magician tool though.. always hpdarm
Also never forget to do a secure erase before the OP
 

Lance Joseph

Member
Oct 5, 2014
82
40
18
The drive with 5% should fall over shortly.
This drive finally did the thing.

I would try 20% OP that's what I see on other benchmarks that seem to really make a difference.

Maybe next test could be 15% 20% 25% :)

Awesome tests!
Thanks! I ran the tests on another system with 10%, 15%, 20%, and 25%.
The drives with 10% and 25% also did the thing and the 15% and 20% OP drives followed suit a half-hour later.

At least OP 20%.
I found OP of 35% to be the sweet spot on 840 pro's using ZFS (no trim support)
Never used magician tool though.. always hpdarm
Also never forget to do a secure erase before the OP
I fired up a new system and set the OP to 25%, 35%, 45%, and 50% (the maximum OP setting) on four 256G 850 Pros.
The 25% and 50% OP drives quickly did as expected. I didn't bother to continue testing on the 35% and 45% drives.
I haven't done the secure erase yet on any of these drives. I'll get started on another system and report back.
 

keybored

Active Member
May 28, 2016
280
66
28
Have you tried running these tests on different hardware, assuming you have the option, and/or OS? You did mention that a colleague faced the same issue. Was that on the same hardware? Maybe boot up Ubuntu off of a USB stick and run a test there? Back in 2011-ish, I had a couple of Intel 320 SSDs brick themselves when they didn't like the controller I connected them to. Granted, this is a totally different manufacturer, but still, eliminating h/w and s/w issues would be my next step if I faced this issue.
 

Lance Joseph

Member
Oct 5, 2014
82
40
18
I haven't done the secure erase yet on any of these drives. I'll get started on another system and report back.
It looks like Samsung Magician doesn't have the option to perform a Secure Erase in Windows 8 and above (including Server 2012 R2).

Have you tried running these tests on different hardware, assuming you have the option, and/or OS? You did mention that a colleague faced the same issue. Was that on the same hardware? Maybe boot up Ubuntu off of a USB stick and run a test there? Back in 2011-ish, I had a couple of Intel 320 SSDs brick themselves when they didn't like the controller I connected them to. Granted, this is a totally different manufacturer, but still, eliminating h/w and s/w issues would be my next step if I faced this issue.
Yes my colleague was using the same systems on which I'm testing and reproducing these issues.
In my first post, I listed some of the hardware configurations with which I've been performing these tests.
The majority of my tests are being performed on Supermicro systems with LSI 2308 and 3008 SAS controllers.
In an attempt to rule out the LSI HBAs as the culprit, I connected a 128G Samsung 850 Pro to an Intel ICH10 in a Dell T5610.
I was able to reproduce this issue in all of those systems running Windows Server 2012 R2 however I'm happy to run tests in a Linux environment.
I'll need some time to get a system or two reinstalled with Centos but I'm happy to entertain any other troubleshooting steps on the current systems in the meantime.

Thanks for the suggestions!
 

wildchild

Active Member
Feb 4, 2014
389
57
28
This drive finally did the thing.



Thanks! I ran the tests on another system with 10%, 15%, 20%, and 25%.
The drives with 10% and 25% also did the thing and the 15% and 20% OP drives followed suit a half-hour later.



I fired up a new system and set the OP to 25%, 35%, 45%, and 50% (the maximum OP setting) on four 256G 850 Pros.
The 25% and 50% OP drives quickly did as expected. I didn't bother to continue testing on the 35% and 45% drives.
I haven't done the secure erase yet on any of these drives. I'll get started on another system and report back.
Actually the secure erase is pretty much the most important step.
Otherwise there will left over bits causing the controller not getting it anymore.
You can boot up using a linux live cd, then follow thomas krenn method to secure erase and create a host protected area
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,625
2,043
113
Actually the secure erase is pretty much the most important step.
Otherwise there will left over bits causing the controller not getting it anymore.
You can boot up using a linux live cd, then follow thomas krenn method to secure erase and create a host protected area
I would second this suggestion.

And, for those reading in the future be sure to 'secure erase' before putting any drive into use be it new or used. I've gotten some rather poor performing used drives to have them 'come back' after a secure erase... likely from someone just yanking them from service and selling them without actually doing what they claim (secure erase). Most ebay sellers and IT Recyclers claim "Secure Erase" but I've found it's a 50 / 50 chance of that actually occurring.
 

unwind-protect

Active Member
Mar 7, 2016
415
156
43
Boston
My 850 was also failing on my in unfortunate ways (read block out of the blue). I previously thought the Samsung 850s could be the first SATA SSD I like but it wasn't meant to be.
 

sd11

New Member
Jun 2, 2016
28
1
3
39
Any update on this?

I recently put a 850 Pro 512 into production on an ESX box with low load. About 15% OP.

Does the disk just drop off, or comepletely die?

I'm a bit nervous after reading this.