SAS drives with high ECC corrected errors

levak

Member
Sep 22, 2013
49
10
8
Hello!

I have the following new setup:
  • server with LSI 9207 HBA
  • Supermicro 837E26-RJBOD1 28bay JBOD
  • 28x Seagate Enterprise capacity 3.5 HDD v4 4TB SAS drives
All drives are brand new and arrived a few days ago, sealed in antistatic bag, from Supermicro.

I mount them all into JBOD and started running badblocks test on them. After one and a half pass, I check SMART stats and I saw LOTS of ECC corrected errors.

Code:
smartctl -a /dev/sdh
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:  SEAGATE
Product:  ST4000NM0034
Revision:  E001
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Logical block size:  512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:  0
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Logical Unit id:  0x5000c5008375b0db
Serial number:  Z4F03BS20000R524FN4B
Device type:  disk
Transport protocol:  SAS
Local Time is:  Sun Sep 27 12:37:16 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  39 C
Drive Trip Temperature:  60 C

Manufactured in week 15 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  68
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  74
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3519259482
  Blocks received from initiator = 1152791088
  Blocks read from cache and sent to initiator = 3489960
  Number of read and write commands whose size <= segment size = 8488
  Number of read and write commands whose size > segment size = 1

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 47.02
  number of minutes until next internal SMART test = 50

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953503667  0  0  1953503667  0  32007.073  0
write:  0  0  0  0  0  4988.356  0

Non-medium error count:  0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test  Status  segment  LifeTime  LBA_first_err [SK ASC ASQ]
  Description  number  (hours)
# 1  Background short  Completed  -  5  - [-  -  -]
Long (extended) Self Test duration: 24300 seconds [405.0 minutes]
If I look at some of the other drives:
Code:
Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  3886545  0  0  3886545  0  32007.052  0
write:  0  0  0  0  0  4985.349  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953499760  0  0  1953499760  0  32007.046  0
write:  0  0  0  0  0  4992.851  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  3716742  0  0  3716742  0  32006.910  0
write:  0  0  0  0  0  4981.957  0
Non-medium error count:  0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1953490099  0  0  1953490099  0  32006.947  0
write:  0  0  0  0  0  4976.416  0
Non-medium error count:  0
I also tried a brand new hard drive, out of the box. It reports the following data:
Code:
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 0.25
Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  2266  0  0  2266  0  0.037  0
write:  0  0  0  0  0  0.005  0
Non-medium error count:  0
Are those numbers normal?
I also looked at smart stats on the server 10k SAS drives, but there are 0 ECC corrected errors reported.
Is it possible that 112 drives are bad?

Matej
 

levak

Member
Sep 22, 2013
49
10
8
This is happening on:
4 different JBODs (2 brand new, 2 old ones)
4 different SFF external cables (although, all are new)
2 different LSI HBA cards
122 hard drives

Matej
 

bds1904

Active Member
Aug 30, 2013
271
76
28
What firmware is on the hba? I had a 9200-8e arrive with the latest firmware and it did the exact same thing on omnios.
 

levak

Member
Sep 22, 2013
49
10
8
I used P18 and P19. I contacted my reseller and he is working this out with supermirco.

I'm wondering what the outcome will be.
 

abstractalgebra

Active Member
Dec 3, 2013
180
26
28
MA, USA
Anything electrically connected can cause errors. I would isolate things down to the bare minimum and then add them back one at a time.
One drive connected to the power/data in only one JBOD. Remove the other controller, ect. Can you direct connect one drive to the controllers and/or bypass the backplane?

Please let us know how this develops.
 

levak

Member
Sep 22, 2013
49
10
8
It looks like ST4000NM0034 is not playing well with the backplane of the JBOD.

They will replace all drives with SAS2 model and that should fix the issue.

Too bad I don't have any other SAS3 drives to test (from other manufacturer), but SAS2 HGST is working like a charm.

Matej
 
  • Like
Reactions: Chuckleb

levak

Member
Sep 22, 2013
49
10
8
Today a got a new batch of hard drives, this time a Seagate ST4000NM0023.

Same problems! As soon as I start reading from drives, counter goes crazy. No problems with writing. Actually, only inserting a brand new hard drive into the hot-swap slot and running smartctl already gives me cca 6000 errors.

Ooo and another thing: HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 drives are working without a problem, counter stays at 0.

Will try on a totally different SM case Thursday in case all 4 JBODs/server/controller is broken. I will also try to put the drive into Infortrend SAN and do some reading there to see if counters go up.

MAtej
 

MikeC

Member
Apr 27, 2013
59
11
8
UK
Today a got a new batch of hard drives, this time a Seagate ST4000NM0023.

Same problems! As soon as I start reading from drives, counter goes crazy. No problems with writing. Actually, only inserting a brand new hard drive into the hot-swap slot and running smartctl already gives me cca 6000 errors.

Ooo and another thing: HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 drives are working without a problem, counter stays at 0.

Will try on a totally different SM case Thursday in case all 4 JBODs/server/controller is broken. I will also try to put the drive into Infortrend SAN and do some reading there to see if counters go up.

MAtej
This may be relevant...

Seagate SER, RRER & HEC
 

levak

Member
Sep 22, 2013
49
10
8
Yea, this report is for SATA drives. I don't know if it's relevant for SAS drives, since they report different SMART counters.

I guess I will have to check with Seagate as well.

Matej
 

levak

Member
Sep 22, 2013
49
10
8
Today I tested the same drive in 3 different JBODs with 2 different servers and 3 different controllers (but all same brand/model/firmware).

I remembered I also have a brand new IBM server in the rack with 3.5" hard drives. I plugged my Seagate in and powered on. When system booted, I did some dd reading from disk and checked smart stats. Errors were again through the roof, but then I checked other drives in the server which were also IBM branded Seagates and the counters are high as well.

Smart from Seagate I tested in other JBODs:
Code:
=== START OF INFORMATION SECTION ===
Vendor:  SEAGATE
Product:  ST4000NM0034
Revision:  E001
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Logical block size:  512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:  0
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Logical Unit id:  0x5000c500837568b7
Device type:  disk
Transport protocol:  SAS
Local Time is:  Fri Oct 16 01:22:58 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  28 C
Drive Trip Temperature:  60 C

Manufactured in week 15 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  70
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  96
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3531092253
  Blocks received from initiator = 2485440232
  Blocks read from cache and sent to initiator = 3952476
  Number of read and write commands whose size <= segment size = 449484
  Number of read and write commands whose size > segment size = 4437

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 167.23
  number of minutes until next internal SMART test = 0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  1956084437  0  0  1956084437  0  32055.540  0
write:  0  0  0  0  0  5678.375  0
verify:  634  0  0  634  0  0.000  0

Non-medium error count:  0
Brand new IBM branded Seagate:
Code:
Vendor:  LENOVO-X
Product:  ST300MM0006
Revision:  L56Q
User Capacity:  300,000,000,000 bytes [300 GB]
Logical block size:  512 bytes
Formatted with type 2 protection
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:  10500 rpm
Form Factor:  2.5 inches
Logical Unit id:  0x5000c5008e4287cb
Device type:  disk
Transport protocol:  SAS
Local Time is:  Fri Oct 16 01:22:52 2015 CEST
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  31 C
Drive Trip Temperature:  65 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 10.40
  number of minutes until next internal SMART test = 0

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  167106149  0  0  167106149  0  53.495  0
write:  0  0  0  0  0  2.297  0
verify:  7808  0  0  7808  0  0.002  0

Non-medium error count:  8
 

canta

Well-Known Member
Nov 26, 2014
1,034
207
63
41
I thinking about seagate firmware that not playing nicely with SAS2 HBA since HUS724040ALS640 is SAS2.

make sure "Elements in grown defect list" is not growing..

is that any error on system msgs that related with HBA card and seagate SAS3 drive?
 

levak

Member
Sep 22, 2013
49
10
8
There are no errors on the system...

I have both, SAS2(ST4000NM0023) and SAS3(ST4000NM0034) drives, and they all produce the same amount of errors. I will try to get a newer firmware and I hope I get the new LSI SAS9300 HBA.

Matej
 

levak

Member
Sep 22, 2013
49
10
8
I heard back from Seagate. They are saying that this is a normal value for hard drives and that Seagate drives report raw error values, not filtered or rate, like other manufacturers...

I got a green light from them and we can proceed with the setup.

Matej