SAS drives with high ECC corrected errors

Discussion in 'Hard Drives and Solid State Drives' started by levak, Sep 27, 2015.

  1. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    Hello!

    I have the following new setup:
    • server with LSI 9207 HBA
    • Supermicro 837E26-RJBOD1 28bay JBOD
    • 28x Seagate Enterprise capacity 3.5 HDD v4 4TB SAS drives
    All drives are brand new and arrived a few days ago, sealed in antistatic bag, from Supermicro.

    I mount them all into JBOD and started running badblocks test on them. After one and a half pass, I check SMART stats and I saw LOTS of ECC corrected errors.

    Code:
    smartctl -a /dev/sdh
    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Vendor:  SEAGATE
    Product:  ST4000NM0034
    Revision:  E001
    User Capacity:  4,000,787,030,016 bytes [4.00 TB]
    Logical block size:  512 bytes
    Physical block size:  4096 bytes
    Lowest aligned LBA:  0
    Logical block provisioning type unreported, LBPME=0, LBPRZ=0
    Rotation Rate:  7200 rpm
    Form Factor:  3.5 inches
    Logical Unit id:  0x5000c5008375b0db
    Serial number:  Z4F03BS20000R524FN4B
    Device type:  disk
    Transport protocol:  SAS
    Local Time is:  Sun Sep 27 12:37:16 2015 CEST
    SMART support is:  Available - device has SMART capability.
    SMART support is:  Enabled
    Temperature Warning:  Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART Health Status: OK
    
    Current Drive Temperature:  39 C
    Drive Trip Temperature:  60 C
    
    Manufactured in week 15 of year 2015
    Specified cycle count over device lifetime:  10000
    Accumulated start-stop cycles:  68
    Specified load-unload count over device lifetime:  300000
    Accumulated load-unload cycles:  74
    Elements in grown defect list: 0
    
    Vendor (Seagate) cache information
      Blocks sent to initiator = 3519259482
      Blocks received from initiator = 1152791088
      Blocks read from cache and sent to initiator = 3489960
      Number of read and write commands whose size <= segment size = 8488
      Number of read and write commands whose size > segment size = 1
    
    Vendor (Seagate/Hitachi) factory information
      number of hours powered up = 47.02
      number of minutes until next internal SMART test = 50
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  1953503667  0  0  1953503667  0  32007.073  0
    write:  0  0  0  0  0  4988.356  0
    
    Non-medium error count:  0
    
    [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
    SMART Self-test log
    Num  Test  Status  segment  LifeTime  LBA_first_err [SK ASC ASQ]
      Description  number  (hours)
    # 1  Background short  Completed  -  5  - [-  -  -]
    Long (extended) Self Test duration: 24300 seconds [405.0 minutes]
    
    If I look at some of the other drives:
    Code:
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  3886545  0  0  3886545  0  32007.052  0
    write:  0  0  0  0  0  4985.349  0
    Non-medium error count:  0
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  1953499760  0  0  1953499760  0  32007.046  0
    write:  0  0  0  0  0  4992.851  0
    Non-medium error count:  0
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  3716742  0  0  3716742  0  32006.910  0
    write:  0  0  0  0  0  4981.957  0
    Non-medium error count:  0
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  1953490099  0  0  1953490099  0  32006.947  0
    write:  0  0  0  0  0  4976.416  0
    Non-medium error count:  0
    
    I also tried a brand new hard drive, out of the box. It reports the following data:
    Code:
    Vendor (Seagate/Hitachi) factory information
      number of hours powered up = 0.25
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  2266  0  0  2266  0  0.037  0
    write:  0  0  0  0  0  0.005  0
    Non-medium error count:  0
    
    Are those numbers normal?
    I also looked at smart stats on the server 10k SAS drives, but there are 0 ECC corrected errors reported.
    Is it possible that 112 drives are bad?

    Matej
     
    #1
  2. Stanza

    Stanza Active Member

    Joined:
    Jan 11, 2014
    Messages:
    205
    Likes Received:
    40
    Bad Card / Bad Cable / Dicky PSU ?
     
    #2
  3. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    This is happening on:
    4 different JBODs (2 brand new, 2 old ones)
    4 different SFF external cables (although, all are new)
    2 different LSI HBA cards
    122 hard drives

    Matej
     
    #3
  4. bds1904

    bds1904 Active Member

    Joined:
    Aug 30, 2013
    Messages:
    271
    Likes Received:
    76
    What firmware is on the hba? I had a 9200-8e arrive with the latest firmware and it did the exact same thing on omnios.
     
    #4
  5. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    I used P18 and P19. I contacted my reseller and he is working this out with supermirco.

    I'm wondering what the outcome will be.
     
    #5
  6. abstractalgebra

    Joined:
    Dec 3, 2013
    Messages:
    169
    Likes Received:
    21
    Anything electrically connected can cause errors. I would isolate things down to the bare minimum and then add them back one at a time.
    One drive connected to the power/data in only one JBOD. Remove the other controller, ect. Can you direct connect one drive to the controllers and/or bypass the backplane?

    Please let us know how this develops.
     
    #6
  7. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    It looks like ST4000NM0034 is not playing well with the backplane of the JBOD.

    They will replace all drives with SAS2 model and that should fix the issue.

    Too bad I don't have any other SAS3 drives to test (from other manufacturer), but SAS2 HGST is working like a charm.

    Matej
     
    #7
    Chuckleb likes this.
  8. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    Today a got a new batch of hard drives, this time a Seagate ST4000NM0023.

    Same problems! As soon as I start reading from drives, counter goes crazy. No problems with writing. Actually, only inserting a brand new hard drive into the hot-swap slot and running smartctl already gives me cca 6000 errors.

    Ooo and another thing: HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 drives are working without a problem, counter stays at 0.

    Will try on a totally different SM case Thursday in case all 4 JBODs/server/controller is broken. I will also try to put the drive into Infortrend SAN and do some reading there to see if counters go up.

    MAtej
     
    #8
  9. MikeC

    MikeC Member

    Joined:
    Apr 27, 2013
    Messages:
    59
    Likes Received:
    11
    This may be relevant...

    Seagate SER, RRER & HEC
     
    #9
  10. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    Yea, this report is for SATA drives. I don't know if it's relevant for SAS drives, since they report different SMART counters.

    I guess I will have to check with Seagate as well.

    Matej
     
    #10
  11. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    Today I tested the same drive in 3 different JBODs with 2 different servers and 3 different controllers (but all same brand/model/firmware).

    I remembered I also have a brand new IBM server in the rack with 3.5" hard drives. I plugged my Seagate in and powered on. When system booted, I did some dd reading from disk and checked smart stats. Errors were again through the roof, but then I checked other drives in the server which were also IBM branded Seagates and the counters are high as well.

    Smart from Seagate I tested in other JBODs:
    Code:
    === START OF INFORMATION SECTION ===
    Vendor:  SEAGATE
    Product:  ST4000NM0034
    Revision:  E001
    User Capacity:  4,000,787,030,016 bytes [4.00 TB]
    Logical block size:  512 bytes
    Physical block size:  4096 bytes
    Lowest aligned LBA:  0
    Logical block provisioning type unreported, LBPME=0, LBPRZ=0
    Rotation Rate:  7200 rpm
    Form Factor:  3.5 inches
    Logical Unit id:  0x5000c500837568b7
    Device type:  disk
    Transport protocol:  SAS
    Local Time is:  Fri Oct 16 01:22:58 2015 CEST
    SMART support is:  Available - device has SMART capability.
    SMART support is:  Enabled
    Temperature Warning:  Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART Health Status: OK
    
    Current Drive Temperature:  28 C
    Drive Trip Temperature:  60 C
    
    Manufactured in week 15 of year 2015
    Specified cycle count over device lifetime:  10000
    Accumulated start-stop cycles:  70
    Specified load-unload count over device lifetime:  300000
    Accumulated load-unload cycles:  96
    Elements in grown defect list: 0
    
    Vendor (Seagate) cache information
      Blocks sent to initiator = 3531092253
      Blocks received from initiator = 2485440232
      Blocks read from cache and sent to initiator = 3952476
      Number of read and write commands whose size <= segment size = 449484
      Number of read and write commands whose size > segment size = 4437
    
    Vendor (Seagate/Hitachi) factory information
      number of hours powered up = 167.23
      number of minutes until next internal SMART test = 0
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  1956084437  0  0  1956084437  0  32055.540  0
    write:  0  0  0  0  0  5678.375  0
    verify:  634  0  0  634  0  0.000  0
    
    Non-medium error count:  0
    
    Brand new IBM branded Seagate:
    Code:
    Vendor:  LENOVO-X
    Product:  ST300MM0006
    Revision:  L56Q
    User Capacity:  300,000,000,000 bytes [300 GB]
    Logical block size:  512 bytes
    Formatted with type 2 protection
    Logical block provisioning type unreported, LBPME=0, LBPRZ=0
    Rotation Rate:  10500 rpm
    Form Factor:  2.5 inches
    Logical Unit id:  0x5000c5008e4287cb
    Device type:  disk
    Transport protocol:  SAS
    Local Time is:  Fri Oct 16 01:22:52 2015 CEST
    SMART support is:  Available - device has SMART capability.
    SMART support is:  Enabled
    Temperature Warning:  Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART Health Status: OK
    
    Current Drive Temperature:  31 C
    Drive Trip Temperature:  65 C
    
    Elements in grown defect list: 0
    
    Vendor (Seagate) cache information
      Blocks sent to initiator = 0
    
    Vendor (Seagate/Hitachi) factory information
      number of hours powered up = 10.40
      number of minutes until next internal SMART test = 0
    
    Error counter log:
      Errors Corrected by  Total  Correction  Gigabytes  Total
      ECC  rereads/  errors  algorithm  processed  uncorrected
      fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
    read:  167106149  0  0  167106149  0  53.495  0
    write:  0  0  0  0  0  2.297  0
    verify:  7808  0  0  7808  0  0.002  0
    
    Non-medium error count:  8
    
     
    #11
  12. canta

    canta Well-Known Member

    Joined:
    Nov 26, 2014
    Messages:
    1,012
    Likes Received:
    184
    I thinking about seagate firmware that not playing nicely with SAS2 HBA since HUS724040ALS640 is SAS2.

    make sure "Elements in grown defect list" is not growing..

    is that any error on system msgs that related with HBA card and seagate SAS3 drive?
     
    #12
  13. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    There are no errors on the system...

    I have both, SAS2(ST4000NM0023) and SAS3(ST4000NM0034) drives, and they all produce the same amount of errors. I will try to get a newer firmware and I hope I get the new LSI SAS9300 HBA.

    Matej
     
    #13
  14. levak

    levak Member

    Joined:
    Sep 22, 2013
    Messages:
    49
    Likes Received:
    10
    I heard back from Seagate. They are saying that this is a normal value for hard drives and that Seagate drives report raw error values, not filtered or rate, like other manufacturers...

    I got a green light from them and we can proceed with the setup.

    Matej
     
    #14
Similar Threads: drives high
Forum Title Date
Hard Drives and Solid State Drives Windows detects U.2 NVMe drives as removable/external Oct 1, 2019
Hard Drives and Solid State Drives Get Smart data for NVME drives in ESXi? Sep 23, 2019
Hard Drives and Solid State Drives ZFS, 512e and 4KN drives Sep 20, 2019
Hard Drives and Solid State Drives NVMe drives as KVM guest storage Sep 15, 2019
Hard Drives and Solid State Drives Server manufacturers with vendor locked hard drives? Sep 5, 2019

Share This Page