Sudden SAS SSD death - looking for understanding

marelooke

New Member
Mar 23, 2020
13
1
3
Hi all,

I bought a couple of second hand Dell EMC SAS SSDs. The Dell model No is V4-2S6F-100, but the underlying devices are Samsung (SMART output at bottom of post).
According to the SMART data both of them had ample life left with one having a "percentage endurance used count" of 8% and the other being at 0% (potentially used as a hot spare, based on the reads/writes). Over the weekend both bombed out on me, one on Friday, the other Sunday.

Both with this message in SMART:
Code:
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x73 [asc=5d, ascq=73]
According to information I found this means they ran out of endurance, the specific message being:
MEDIA IMPENDING FAILURE ENDURANCE LIMIT MET
Given that they, supposedly, have between 92% and 100% endurance remaining that seems...odd?

Given that these bombed out in exactly the same way in very short succession I suspect a firmware issue. However I couldn't dig anything up about these models having firmware issues and Dell EMC's support seems locked off.

The seller offered to replace them, however, if this is indeed a firmware issue then any replacements are likely to break in exactly the same way...

Does anyone here have any knowledge about these drives, and if, and where, I could potentially get updated firmware? I fully accept that I just might have been looking in the wrong places but haven't really found much more than a generic spec sheet on these...

Full SMART data of one of the SSDs (other is basically the same, just with a lot more use on it):
Code:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.106-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SAMSUNG
Product:              SS162511 CLAR100
Revision:             DC0D
Compliance:           SPC-4
User Capacity:        100,030,242,816 bytes [100 GB]
Logical block size:   512 bytes
Physical block size:  8192 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538453b04ee0
Serial number:        xxx
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Apr 19 11:53:01 2021 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x73 [asc=5d, ascq=73]

Percentage used endurance indicator: 0%
Current Drive Temperature:     28 C
Drive Trip Temperature:        58 C

Accumulated power on time, hours:minutes 36736:41
Manufactured in week 43 of year 2013
Accumulated start-stop cycles:  322
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0        373.721           0
write:         0        0         0         0          0        724.017           0
verify:        0        0         0         0          0      26705.219           0

Non-medium error count:        6

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   36656                 - [-   -    -]
# 2  Background long   Completed                   -   36513                 - [-   -    -]
# 3  Background long   Completed                   -   36507                 - [-   -    -]
# 4  Background long   Completed                   -   36471                 - [-   -    -]
# 5  Background long   Completed                   -   36442                 - [-   -    -]
# 6  Background long   Completed                   -   36431                 - [-   -    -]
# 7  Background short  Completed                   -   36431                 - [-   -    -]

Long (extended) Self-test duration: 90 seconds [1.5 minutes]
 

marelooke

New Member
Mar 23, 2020
13
1
3
That number looks familiar
Just checked and the second one sits at
Code:
Accumulated power on time, hours:minutes 36722:59
So they likely failed at exactly the same time.

The famous "sudden death" cases that got a lot of attention mentioned 32,768 and 40k hours of power on time, as well as those being Western Digital/SanDisk devices.

So would this be yet another, not really publicised, case entirely as these are Samsung devices?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,570
1,012
113
artofserver.com
@marelooke i know this was a month ago, but what conclusion did you arrive at? I have some Samsung SM1625 SSDs with EMC firmware that seem to be doing the same thing.
 

marelooke

New Member
Mar 23, 2020
13
1
3
@marelooke i know this was a month ago, but what conclusion did you arrive at? I have some Samsung SM1625 SSDs with EMC firmware that seem to be doing the same thing.
The vendor sent me new ones, same model, that I've plugged in (no reformatting to 512Byte sectors, nor any use, just pure "power on"-time) to see what they'll do, so far the "oldest" of the two has made it to:
Code:
Accumulated power on time, hours:minutes 37098:25
without spitting out errors in SMART, at least so far. ..
 
  • Like
Reactions: BLinux

marelooke

New Member
Mar 23, 2020
13
1
3
@marelooke thanks for sharing that info. What is the actual Samsung P/N for these drives?
Should be in the SMART output posted in my OP, if those are not the numbers you're looking for I'm afraid I got nothing else at this point. There might be more to be found if I open up the drive, but I'm not willing to do that quite up yet until I'm sure the two new ones either pass, or fail, as I might still end up having to send them back to the vendor.

Also: were you by any chance running a SMART long test when the SSDs failed?
I was, running a regular long test on my my drives, and these were just taken along with the spinning rust. Is that a bad idea on these?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,570
1,012
113
artofserver.com
I was, running a regular long test on my my drives, and these were just taken along with the spinning rust. Is that a bad idea on these?
I don't know.. but that's exactly what happened here. The drive worked fine, until a SMART long test was run and then same error as you.
 

marelooke

New Member
Mar 23, 2020
13
1
3
Disabled SMART on these and they've been fine so far. Though haven't dared use them. I'll add them as a L2ARC for a while and see how that goes...
 

marelooke

New Member
Mar 23, 2020
13
1
3
Nearly a month as L2ARC use on one of them and it's still going strong. So it would seem likely that it is the SMART selftests can brick these.