Very excited, first Intel DC S3700 failing

Discussion in 'Hard Drives and Solid State Drives' started by Patrick, Apr 24, 2017.

  1. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,543
    Likes Received:
    4,467
    From a longer-term perspective, a reminder came up that we needed to do a refresh of Used enterprise SSDs: Dissecting our production SSD population

    As I was looking at the data a few weeks ago, we still had not experienced an Intel SSD failure. In fact, our failures in the three-quarters after that article were absolutely minimal.

    Just when I thought all was lost, I received this notification last evening:
    Code:
    This message was generated by the smartd daemon running on:
       host name:  fmt-pve-07
       DNS domain: servethebiz.com
    The following warning/error was logged by the smartd daemon:
    Device: /dev/sdf [SAT], FAILED SMART self-check. BACK UP DATA NOW!
    Device info:
    INTEL SSDSC2BA400G3E, S/N:_____________, WWN:_________, FW:5DV10250, 400 GB
    For details see host's SYSLOG.
    You can also use the smartctl utility for further investigation.
    Another message will be sent in 24 hours if the problem persists.
    An Intel DC S3700 400GB failing!

    For those who are not familiar, this is one of the Proxmox VE drive failure alert features in newer versions. I do not remember when they added SMART but I do know it is in PVE 4.4 but it was not in PVE 4.1.
     
    #1
    balamit, William and niekbergboer like this.
  2. Logan

    Logan Member

    Joined:
    Feb 22, 2017
    Messages:
    56
    Likes Received:
    7
    That's a nice alert feature. TBW?
     
    #2
  3. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,543
    Likes Received:
    4,467
    I am going to pull the drive and double-check.

    I think it is an actual failure, not a wear out issue. 63 power cycles and 1 week shy of 2 days.

    We are over 200 deployed Intel SSDs at this point, so finally seeing one fail is good.
     
    #3
  4. Logan

    Logan Member

    Joined:
    Feb 22, 2017
    Messages:
    56
    Likes Received:
    7
    Would be interesting to understand the cause. Why do these tend to fail, if not from wear? Overheating? Over/undervoltage?

    Still under warranty? I recently bought a used S3700 400GB and Intel's warranty site shows coverage estimated into 2018: Warranty Information
     
    #4
  5. MiniKnight

    MiniKnight Well-Known Member

    Joined:
    Mar 30, 2012
    Messages:
    2,941
    Likes Received:
    857
    You're an odd duck.
     
    #5
  6. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,782
    Likes Received:
    1,457
    Do you have temp history? I'd be curious about that. As I'm sure you're aware the S3700 can get extremely hot during writes if not cooled properly.
     
    #6
  7. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,543
    Likes Received:
    4,467
    It says every drive is 19-23C in that chassis right now. It is actually just an OS boot drive so very little "heavy" writes. The S3700 is a very cool running drive compared to the XS1715 or the P3700. NVMe drives can hit 25W while the S3700 is a small fraction of that.
     
    #7
  8. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,782
    Likes Received:
    1,457
    Not to get off track but that's why I specified ""if not cooled properly" ;) i've seen them approach max temp in small cases with improper air flow, wasn't sure if you jammed them into a 1U or something :) thus the question ;)
     
    #8
  9. William

    William Active Member

    Joined:
    May 7, 2015
    Messages:
    748
    Likes Received:
    228
    Only Patrick gets excited about a drive failing LOL
    The rest of us go... Oh Crap, I better do something fast.
     
    #9
  10. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,543
    Likes Received:
    4,467
    2U Intel WT chassis. Cooling is not an issue.

    @William - designed to have drives fail in place.
     
    #10
    T_Minus likes this.
  11. Son of Homer

    Son of Homer Member

    Joined:
    May 9, 2016
    Messages:
    173
    Likes Received:
    15
    Looking forward to the autopsy report.
     
    #11
  12. Son of Homer

    Son of Homer Member

    Joined:
    May 9, 2016
    Messages:
    173
    Likes Received:
    15
    Are there any long term studies on SSD longevity with large numbers of drives, similar to the Backblaze and Google studies on hard drives?
     
    #12
  13. ATS

    ATS Member

    Joined:
    Mar 9, 2015
    Messages:
    96
    Likes Received:
    32
    Sometimes it is good to see failure, just so you know that everything around a failure actually works like say knowing there is a failure.
     
    #13
    Patrick likes this.
  14. ATS

    ATS Member

    Joined:
    Mar 9, 2015
    Messages:
    96
    Likes Received:
    32
    unlikely to see those outside of the vendors atm. In order to do reasonable releases you need a pretty high volume and a reasonable failure history. It is unlikely that any given customer at this point has had a large enough population for a long enough time to collect reasonable data to release given that the widespread use in servers has been fairly recent and the types of systems that were used by early adopters have likely already been replaced by newer faster hardware due to the ramp of performance over the last 3-4 years. Its only been recent that it was reasonable to provision all SSD systems outside of niche roles.
     
    #14
    Son of Homer likes this.
  15. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,543
    Likes Received:
    4,467
    @ATS is totally on point here.

    @Son of Homer We will have an update to our SSD population reliability figures next quarter. A quick guess will be that it is something like 7-9 million working hours total logged.

    What I can say is this: SSD failure rates are under 1/10th of hard drive failure rates.
     
    #15
  16. Son of Homer

    Son of Homer Member

    Joined:
    May 9, 2016
    Messages:
    173
    Likes Received:
    15
    Thanks Patrick. That is very useful information, and the updated report will be interesting.
     
    #16
Similar Threads: Very excited
Forum Title Date
Hard Drives and Solid State Drives recovery of files from network share Oct 16, 2019
Hard Drives and Solid State Drives A WD Red failure that could have ended very badly Mar 26, 2018
Hard Drives and Solid State Drives is 'data recovery' from a USB stick gone silent possible? Oct 27, 2017
Hard Drives and Solid State Drives DIY Data recovery from 'dead' drives Oct 14, 2017
Hard Drives and Solid State Drives Ceph very slow writes Apr 25, 2017

Share This Page