SAS drives with high ECC corrected errors

levak · Sep 27, 2015

Hello!

I have the following new setup:

server with LSI 9207 HBA
Supermicro 837E26-RJBOD1 28bay JBOD
28x Seagate Enterprise capacity 3.5 HDD v4 4TB SAS drives

All drives are brand new and arrived a few days ago, sealed in antistatic bag, from Supermicro.

I mount them all into JBOD and started running badblocks test on them. After one and a half pass, I check SMART stats and I saw LOTS of ECC corrected errors.

Code:

smartctl -a /dev/sdh
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:  SEAGATE
Product:  ST4000NM0034
Revision:  E001
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Logical block size:  512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:  0
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Logical Unit id:  0x5000c5008375b0db
Serial number:  Z4F03BS20000R524FN4B
Device type:  disk
Transport protocol:  SAS
Local Time is:  Sun Sep 27 12:37:16 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  39 C
Drive Trip Temperature:  60 C

Manufactured in week 15 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  68
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  74
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3519259482
  Blocks received from initiator = 1152791088
  Blocks read from cache and sent to initiator = 3489960
  Number of read and write commands whose size <= segment size = 8488
  Number of read and write commands whose size > segment size = 1

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 47.02
  number of minutes until next internal SMART test = 50

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953503667  0  0  1953503667  0  32007.073  0
write:  0  0  0  0  0  4988.356  0

Non-medium error count:  0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test  Status  segment  LifeTime  LBA_first_err [SK ASC ASQ]
  Description  number  (hours)
# 1  Background short  Completed  -  5  - [-  -  -]
Long (extended) Self Test duration: 24300 seconds [405.0 minutes]

If I look at some of the other drives:

Code:

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  3886545  0  0  3886545  0  32007.052  0
write:  0  0  0  0  0  4985.349  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953499760  0  0  1953499760  0  32007.046  0
write:  0  0  0  0  0  4992.851  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  3716742  0  0  3716742  0  32006.910  0
write:  0  0  0  0  0  4981.957  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953490099  0  0  1953490099  0  32006.947  0
write:  0  0  0  0  0  4976.416  0
Non-medium error count:  0

I also tried a brand new hard drive, out of the box. It reports the following data:

Code:

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 0.25
Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  2266  0  0  2266  0  0.037  0
write:  0  0  0  0  0  0.005  0
Non-medium error count:  0

Are those numbers normal?
I also looked at smart stats on the server 10k SAS drives, but there are 0 ECC corrected errors reported.
Is it possible that 112 drives are bad?

Matej

Stanza · Sep 27, 2015

Bad Card / Bad Cable / Dicky PSU ?

levak · Sep 27, 2015

This is happening on:
4 different JBODs (2 brand new, 2 old ones)
4 different SFF external cables (although, all are new)
2 different LSI HBA cards
122 hard drives

Matej

bds1904 · Sep 29, 2015

What firmware is on the hba? I had a 9200-8e arrive with the latest firmware and it did the exact same thing on omnios.

levak · Sep 29, 2015

I used P18 and P19. I contacted my reseller and he is working this out with supermirco.

I'm wondering what the outcome will be.

abstractalgebra · Sep 29, 2015

Anything electrically connected can cause errors. I would isolate things down to the bare minimum and then add them back one at a time.
One drive connected to the power/data in only one JBOD. Remove the other controller, ect. Can you direct connect one drive to the controllers and/or bypass the backplane?

Please let us know how this develops.

levak · Oct 3, 2015

It looks like ST4000NM0034 is not playing well with the backplane of the JBOD.

They will replace all drives with SAS2 model and that should fix the issue.

Too bad I don't have any other SAS3 drives to test (from other manufacturer), but SAS2 HGST is working like a charm.

Matej

levak · Oct 13, 2015

Today a got a new batch of hard drives, this time a Seagate ST4000NM0023.

Same problems! As soon as I start reading from drives, counter goes crazy. No problems with writing. Actually, only inserting a brand new hard drive into the hot-swap slot and running smartctl already gives me cca 6000 errors.

Ooo and another thing: HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 drives are working without a problem, counter stays at 0.

Will try on a totally different SM case Thursday in case all 4 JBODs/server/controller is broken. I will also try to put the drive into Infortrend SAN and do some reading there to see if counters go up.

MAtej

MikeC · Oct 14, 2015

levak said:
Today a got a new batch of hard drives, this time a Seagate ST4000NM0023.

Same problems! As soon as I start reading from drives, counter goes crazy. No problems with writing. Actually, only inserting a brand new hard drive into the hot-swap slot and running smartctl already gives me cca 6000 errors.

Ooo and another thing: HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 drives are working without a problem, counter stays at 0.

Will try on a totally different SM case Thursday in case all 4 JBODs/server/controller is broken. I will also try to put the drive into Infortrend SAN and do some reading there to see if counters go up.

MAtej

This may be relevant...

Seagate SER, RRER & HEC

levak · Oct 14, 2015

Yea, this report is for SATA drives. I don't know if it's relevant for SAS drives, since they report different SMART counters.

I guess I will have to check with Seagate as well.

Matej

levak · Oct 15, 2015

Today I tested the same drive in 3 different JBODs with 2 different servers and 3 different controllers (but all same brand/model/firmware).

I remembered I also have a brand new IBM server in the rack with 3.5" hard drives. I plugged my Seagate in and powered on. When system booted, I did some dd reading from disk and checked smart stats. Errors were again through the roof, but then I checked other drives in the server which were also IBM branded Seagates and the counters are high as well.

Smart from Seagate I tested in other JBODs:

Code:

=== START OF INFORMATION SECTION ===
Vendor:  SEAGATE
Product:  ST4000NM0034
Revision:  E001
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Logical block size:  512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:  0
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Logical Unit id:  0x5000c500837568b7
Device type:  disk
Transport protocol:  SAS
Local Time is:  Fri Oct 16 01:22:58 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  28 C
Drive Trip Temperature:  60 C

Manufactured in week 15 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  70
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  96
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3531092253
  Blocks received from initiator = 2485440232
  Blocks read from cache and sent to initiator = 3952476
  Number of read and write commands whose size <= segment size = 449484
  Number of read and write commands whose size > segment size = 4437

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 167.23
  number of minutes until next internal SMART test = 0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1956084437  0  0  1956084437  0  32055.540  0
write:  0  0  0  0  0  5678.375  0
verify:  634  0  0  634  0  0.000  0

Non-medium error count:  0

Brand new IBM branded Seagate:

Code:

Vendor:  LENOVO-X
Product:  ST300MM0006
Revision:  L56Q
User Capacity:  300,000,000,000 bytes [300 GB]
Logical block size:  512 bytes
Formatted with type 2 protection
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  10500 rpm
Form Factor:  2.5 inches
Logical Unit id:  0x5000c5008e4287cb
Device type:  disk
Transport protocol:  SAS
Local Time is:  Fri Oct 16 01:22:52 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  31 C
Drive Trip Temperature:  65 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 10.40
  number of minutes until next internal SMART test = 0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  167106149  0  0  167106149  0  53.495  0
write:  0  0  0  0  0  2.297  0
verify:  7808  0  0  7808  0  0.002  0

Non-medium error count:  8

canta · Oct 15, 2015

I thinking about seagate firmware that not playing nicely with SAS2 HBA since HUS724040ALS640 is SAS2.

make sure "Elements in grown defect list" is not growing..

is that any error on system msgs that related with HBA card and seagate SAS3 drive?

levak · Oct 15, 2015

There are no errors on the system...

I have both, SAS2(ST4000NM0023) and SAS3(ST4000NM0034) drives, and they all produce the same amount of errors. I will try to get a newer firmware and I hope I get the new LSI SAS9300 HBA.

Matej

levak · Oct 22, 2015

I heard back from Seagate. They are saying that this is a normal value for hard drives and that Seagate drives report raw error values, not filtered or rate, like other manufacturers...

I got a green light from them and we can proceed with the setup.

Matej

Search

SAS drives with high ECC corrected errors

levak

Member

Stanza

Active Member

levak

Member

bds1904

Active Member

levak

Member

abstractalgebra

Active Member

levak

Member

levak

Member

MikeC

Member

levak

Member

levak

Member

canta

Well-Known Member

levak

Member

levak

Member