I've been chasing some performance issues with my NAS (Supermicro X9DRD-7LN4F-JBOD, BPN-SAS2-836EL1, 8 10TB HUH721010AL4204, 8 3TB mixed SAS and SATA drives arranged into two md RAID6 arrays, one for the 3TB and one for 10TB drives) lately and could use some additional eyes on this. After poking around with measurements from ioping, I noticed some suspiciously high latency numbers with one of the 10TB drives (in the tens of seconds while writing).
I also noticed that a whole lot of read errors occured while running a plain
I figured I'd try an sg_format to force the drive to reallocate sectors (what can go wrong)?
Well, the sg_format runs (takes about 28 hours - write speed around 100 MB/sec if it's actually writing the whole drive), and then when the drive becomes available again, I get this nonsense:
Now, the drive reads and writes perfectly, except for the fact that sequential writes are ridiculously slow!
Doesn't matter where I write to the drive, same results.
To add to my confusion, reads are plenty fast:
The other 7 of these 10TB drives don't have any issues like this. They all happily read at 250MB/sec and write at 210MB/sec.
Here's the SMART data from the drive currently:
Here's the same, but from when I bought the drive (used from eBay) in April:
Notice the "delayed" read ECC errors; it's a LOT higher now than it was.
What the heck is wrong with this drive? Dying? Weird SAN vendor firmware? Why doesn't it report anything in SMART that makes sense for a drive that will only write at 15 MB/sec? Why only after an sg_format (where all I did was --format, nothing else)?
The seller actually has a 1-year warranty on these but wants me to send my address and email outside eBay so they can invoice me for the difference in cost for a (new) drive they have now, which seems kind of weird to me. Should I insist on replacement/refund? They don't have any more of this particular drive after selling hundreds so they'd probably send me a shucked WD whitelabel or something.
I also noticed that a whole lot of read errors occured while running a plain
dd if=/dev/sdb of=/dev/null bs=1M
- the drive would read for a few GB and then kick back a read error, always on the same sector. Continue past it and it would do the same. The drive steadfastly refuses to report any bad sectors in SMART data.I figured I'd try an sg_format to force the drive to reallocate sectors (what can go wrong)?
Well, the sg_format runs (takes about 28 hours - write speed around 100 MB/sec if it's actually writing the whole drive), and then when the drive becomes available again, I get this nonsense:
Code:
% sudo dd if=/dev/zero of=/dev/sdb bs=1M count=1000 conv=fdatasync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 70.5465 s, 14.9 MB/s
Doesn't matter where I write to the drive, same results.
To add to my confusion, reads are plenty fast:
Code:
% sudo dd if=/dev/sdb of=/dev/null bs=1M count=1000 skip=10000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.16192 s, 252 MB/s
Here's the SMART data from the drive currently:
Code:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721010AL4204
Revision: C386
Compliance: SPC-4
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca251139f9c
Serial number: 7PGATKAS
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Feb 1 02:07:59 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification = 0
Total blocks reassigned during format = 0
Total new blocks reassigned = 0
Power on minutes since format = 1552
Current Drive Temperature: 39 C
Drive Trip Temperature: 85 C
Manufactured in week 32 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 202
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1479
Elements in grown defect list: 0
Vendor (Seagate Cache) information
Blocks sent to initiator = 42732088725602304
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 3759158 0 3759158 47340653 135862.604 10
write: 0 0 0 0 2095762 495192.050 0
verify: 0 0 0 0 1522085 0.000 0
Non-medium error count: 1
No Self-tests have been logged
Background scan results log
Status: halted - vendor specific cause
Accumulated power on time, hours:minutes 33916:41 [2035001 minutes]
Number of background scans performed: 147, scan progress: 0.00%
Number of background medium scans performed: 147
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 1
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; 3 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000cca251139f9d
attached SAS address = 0x5001940000fb023f
attached phy identifier = 5
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
relative target port id = 2
generation code = 1
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca251139f9e
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
Code:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-3-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721010AL4204
Revision: C386
Compliance: SPC-4
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca251139f9c
Serial number: 7PGATKAS
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Apr 14 14:29:39 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 28 C
Drive Trip Temperature: 85 C
Manufactured in week 32 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 192
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1187
Elements in grown defect list: 0
Vendor (Seagate Cache) information
Blocks sent to initiator = 41198396404400128
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 3 0 3 364776 33320.318 0
write: 0 0 0 0 1109549 471590.141 0
verify: 0 0 0 0 23172 0.000 0
Non-medium error count: 0
No Self-tests have been logged
Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 26898:01 [1613881 minutes]
Number of background scans performed: 125, scan progress: 0.09%
Number of background medium scans performed: 125
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 1
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; 3 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000cca251139f9d
attached SAS address = 0x5001940000fb023f
attached phy identifier = 4
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
relative target port id = 2
generation code = 1
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000cca251139f9e
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0
What the heck is wrong with this drive? Dying? Weird SAN vendor firmware? Why doesn't it report anything in SMART that makes sense for a drive that will only write at 15 MB/sec? Why only after an sg_format (where all I did was --format, nothing else)?
The seller actually has a 1-year warranty on these but wants me to send my address and email outside eBay so they can invoice me for the difference in cost for a (new) drive they have now, which seems kind of weird to me. Should I insist on replacement/refund? They don't have any more of this particular drive after selling hundreds so they'd probably send me a shucked WD whitelabel or something.