HE10 SAS performance issues

n17ikh · Feb 1, 2021

I've been chasing some performance issues with my NAS (Supermicro X9DRD-7LN4F-JBOD, BPN-SAS2-836EL1, 8 10TB HUH721010AL4204, 8 3TB mixed SAS and SATA drives arranged into two md RAID6 arrays, one for the 3TB and one for 10TB drives) lately and could use some additional eyes on this. After poking around with measurements from ioping, I noticed some suspiciously high latency numbers with one of the 10TB drives (in the tens of seconds while writing).
I also noticed that a whole lot of read errors occured while running a plain dd if=/dev/sdb of=/dev/null bs=1M - the drive would read for a few GB and then kick back a read error, always on the same sector. Continue past it and it would do the same. The drive steadfastly refuses to report any bad sectors in SMART data.
I figured I'd try an sg_format to force the drive to reallocate sectors (what can go wrong)?
Well, the sg_format runs (takes about 28 hours - write speed around 100 MB/sec if it's actually writing the whole drive), and then when the drive becomes available again, I get this nonsense:

Code:

% sudo dd if=/dev/zero of=/dev/sdb bs=1M count=1000 conv=fdatasync                                
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 70.5465 s, 14.9 MB/s

Now, the drive reads and writes perfectly, except for the fact that sequential writes are ridiculously slow!
Doesn't matter where I write to the drive, same results.
To add to my confusion, reads are plenty fast:

Code:

 % sudo dd if=/dev/sdb of=/dev/null bs=1M count=1000 skip=10000
1000+0 records in                                                                                                  
1000+0 records out                                                                                                
1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.16192 s, 252 MB/s

The other 7 of these 10TB drives don't have any issues like this. They all happily read at 250MB/sec and write at 210MB/sec.

Here's the SMART data from the drive currently:

Code:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4204
Revision:             C386
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca251139f9c
Serial number:        7PGATKAS
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Feb  1 02:07:59 2021 PST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 1552
Current Drive Temperature:     39 C
Drive Trip Temperature:        85 C

Manufactured in week 32 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  202
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1479
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 42732088725602304

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0  3759158         0   3759158   47340653     135862.604          10
write:         0        0         0         0    2095762     495192.050           0
verify:        0        0         0         0    1522085          0.000           0

Non-medium error count:        1

No Self-tests have been logged

Background scan results log
  Status: halted - vendor specific cause
    Accumulated power on time, hours:minutes 33916:41 [2035001 minutes]
    Number of background scans performed: 147,  scan progress: 0.00%
    Number of background medium scans performed: 147

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 3 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca251139f9d
    attached SAS address = 0x5001940000fb023f
    attached phy identifier = 5
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca251139f9e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

Here's the same, but from when I bought the drive (used from eBay) in April:

Code:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-3-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4204
Revision:             C386
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca251139f9c
Serial number:        7PGATKAS
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr 14 14:29:39 2020 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     28 C
Drive Trip Temperature:        85 C

Manufactured in week 32 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  192
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1187
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 41198396404400128

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        3         0         3     364776      33320.318           0
write:         0        0         0         0    1109549     471590.141           0
verify:        0        0         0         0      23172          0.000           0

Non-medium error count:        0

No Self-tests have been logged

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 26898:01 [1613881 minutes]
    Number of background scans performed: 125,  scan progress: 0.09%
    Number of background medium scans performed: 125

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 3 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca251139f9d
    attached SAS address = 0x5001940000fb023f
    attached phy identifier = 4
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca251139f9e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

Notice the "delayed" read ECC errors; it's a LOT higher now than it was.

What the heck is wrong with this drive? Dying? Weird SAN vendor firmware? Why doesn't it report anything in SMART that makes sense for a drive that will only write at 15 MB/sec? Why only after an sg_format (where all I did was --format, nothing else)?

The seller actually has a 1-year warranty on these but wants me to send my address and email outside eBay so they can invoice me for the difference in cost for a (new) drive they have now, which seems kind of weird to me. Should I insist on replacement/refund? They don't have any more of this particular drive after selling hundreds so they'd probably send me a shucked WD whitelabel or something.

andrewbedia · Feb 1, 2021

I have a different He10 SAS disk (HUH721008AL5204) and it looks more like this:

Code:

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0    2876341     180279.233           0
write:         0        0         0         0     129277      44076.271           0
verify:        0        0         0         0      14964          0.000           0

Non-medium error count:        0

If you're getting uncorrectable errors (UREs) already, I think that's a bad sign. I would just get them to take the drive back and go find another He8 or He10. I have a whole bunch of 8TB SAS disks in a RAIDZ2 for multiple years at this point that have never had UREs. 5x He8, 1x He10, 6x Exos 7E8.

n17ikh · Feb 4, 2021

Yeah, I took the drive out for some additional bench testing and noticed the ominous clicking once it wasn't drowned out by all the server noise. I tried a full-disk read and it started fast and eventually dropped to a 10MB/s crawl.

So, I bought a "used" HE10 from eBay to replace it, which I got lucky on:

Code:

Manufactured in week 22 of year 2018
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  1
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1
Elements in grown defect list: 0
 
Vendor (Seagate Cache) information
  Blocks sent to initiator = 0
 
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.004           0
write:         0        0         0         0          0          0.000           0

andrewbedia · Feb 5, 2021

That's great to hear! I've bought lots of used drives myself and really never gotten something that had the siht run out of it. I think the worst thing I've bought was a 600GB Intel SSD DC S3500 that only had 60% life left (1 out of ~10)

Search

HE10 SAS performance issues

n17ikh

Member

andrewbedia

Well-Known Member

n17ikh

Member

andrewbedia

Well-Known Member