Very excited, first Intel DC S3700 failing

Patrick

Administrator
Staff member
Dec 21, 2010
12,364
5,496
113
From a longer-term perspective, a reminder came up that we needed to do a refresh of Used enterprise SSDs: Dissecting our production SSD population

As I was looking at the data a few weeks ago, we still had not experienced an Intel SSD failure. In fact, our failures in the three-quarters after that article were absolutely minimal.

Just when I thought all was lost, I received this notification last evening:
Code:
This message was generated by the smartd daemon running on:
   host name:  fmt-pve-07
   DNS domain: servethebiz.com
The following warning/error was logged by the smartd daemon:
Device: /dev/sdf [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Device info:
INTEL SSDSC2BA400G3E, S/N:_____________, WWN:_________, FW:5DV10250, 400 GB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
An Intel DC S3700 400GB failing!

For those who are not familiar, this is one of the Proxmox VE drive failure alert features in newer versions. I do not remember when they added SMART but I do know it is in PVE 4.4 but it was not in PVE 4.1.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,364
5,496
113
That's a nice alert feature. TBW?
I am going to pull the drive and double-check.

I think it is an actual failure, not a wear out issue. 63 power cycles and 1 week shy of 2 days.

We are over 200 deployed Intel SSDs at this point, so finally seeing one fail is good.
 

Logan

Member
Feb 22, 2017
64
8
8
Would be interesting to understand the cause. Why do these tend to fail, if not from wear? Overheating? Over/undervoltage?

Still under warranty? I recently bought a used S3700 400GB and Intel's warranty site shows coverage estimated into 2018: Warranty Information
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,333
1,791
113
CA
Do you have temp history? I'd be curious about that. As I'm sure you're aware the S3700 can get extremely hot during writes if not cooled properly.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,364
5,496
113
It says every drive is 19-23C in that chassis right now. It is actually just an OS boot drive so very little "heavy" writes. The S3700 is a very cool running drive compared to the XS1715 or the P3700. NVMe drives can hit 25W while the S3700 is a small fraction of that.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,333
1,791
113
CA
Not to get off track but that's why I specified ""if not cooled properly" ;) i've seen them approach max temp in small cases with improper air flow, wasn't sure if you jammed them into a 1U or something :) thus the question ;)
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,364
5,496
113
Not to get off track but that's why I specified ""if not cooled properly" ;) i've seen them approach max temp in small cases with improper air flow, wasn't sure if you jammed them into a 1U or something :) thus the question ;)
2U Intel WT chassis. Cooling is not an issue.

@William - designed to have drives fail in place.
 
  • Like
Reactions: T_Minus

Son of Homer

Member
May 9, 2016
171
15
18
47
Are there any long term studies on SSD longevity with large numbers of drives, similar to the Backblaze and Google studies on hard drives?
 

ATS

Member
Mar 9, 2015
96
32
18
47
Only Patrick gets excited about a drive failing LOL
The rest of us go... Oh Crap, I better do something fast.
Sometimes it is good to see failure, just so you know that everything around a failure actually works like say knowing there is a failure.
 
  • Like
Reactions: Patrick

ATS

Member
Mar 9, 2015
96
32
18
47
Are there any long term studies on SSD longevity with large numbers of drives, similar to the Backblaze and Google studies on hard drives?
unlikely to see those outside of the vendors atm. In order to do reasonable releases you need a pretty high volume and a reasonable failure history. It is unlikely that any given customer at this point has had a large enough population for a long enough time to collect reasonable data to release given that the widespread use in servers has been fairly recent and the types of systems that were used by early adopters have likely already been replaced by newer faster hardware due to the ramp of performance over the last 3-4 years. Its only been recent that it was reasonable to provision all SSD systems outside of niche roles.
 
  • Like
Reactions: Son of Homer