ZFS write errors on 7200 RPM drives, fine with badblocks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

rhnet

New Member
Mar 18, 2023
9
1
3
I have been struggling adding 7200 RPM refurbished SATA drives (WUH721414AL) to my existing ZFS pool. The drives have no issues with badblocks, and I have tested them on another system as well.

When I try to add them to an existing zfs mirror, I run into lots of WRITE/CKSUM errors, and they will eventually fault. Here's the output only 10% into a resilver.

Code:
        NAME                       STATE     READ WRITE CKSUM
        tank                  DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            disk2_crypt            ONLINE       0     0     0
            disk16_crypt           ONLINE       0     4    93  (resilvering)
And in dmesg I get stuff like:

Code:
    [Sat Jun 15 12:39:59 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=7423255224320 size=12288 flags=1808aa
    [Sat Jun 15 12:39:59 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=7423255212032 size=12288 flags=1808aa
I have these attached to an HBA and SAS expander, but I've also tried with the SATA ports on the motherboard directly.

Setup:

* LSI SAS9340-8i ServeRAID M1215 12Gbps SAS (from artofserver)
* Adaptec 2283400-R AEC-82885T LENOVO 36Port 12Gb/s SAS Expander Card 82885T US
* 10Gtek# 12G Internal Mini SAS HD SFF-8643 to SFF-8643 Cable, with Sideband, 100-Ohm, 0.5-m(1.6ft), 2 Pack
* 3x AdcAudx 2Pack SFF-8643 to SATA: 1M SFF-8643 Mini-SAS to SATA-Cable SFF8643 to SATA Mini SAS HD to SATA Forward Breakout (3.3FT)
* Lots of working fine 5200 RPM data drives: WD101EMAZ-11, WDC WD140EDFZ-11, WDC WD140EDGZ-11, WD80EMAZ-00W, ...

Several 7200 RPM drives from serverpartdeals and goharddrive, all WUH721414AL.

I run them with luks (cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256) and in zfs mirror.
 

rhnet

New Member
Mar 18, 2023
9
1
3
Thanks for the suggestions!

Power supply is a good point. It's a 'Seasonic Vertex GX-750 | 750W | 80+ Gold | ATX 3.0 & PCIe 5.0 Ready'. My smart plug says it's pulling 190W during the resilver. If I crank the CPU I can get it to ~400W. There are 16 drives in there (2 7200rpm) though. And that's across 3 power lines coming out of the power supply (only three sata cables came with it, I'm using two splitters). I did try moving the power cables around a bit to see if I could spread the drives a little better, but no luck.

Here's smartctl of the drive above.
Code:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-112-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC  WUH721414ALE604
Serial Number:    <redacted>
LU WWN Device Id: 5 000cca 258d4159f
Firmware Version: LDGSW2G0
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 15 19:50:44 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.

Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1555) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   081   081   001    Pre-fail  Always       -       382 (Average 382)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       36746
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1418
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1418
194 Temperature_Celsius     0x0002   056   056   000    Old_age   Always       -       38 (Min/Max 22/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     36550         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,757
553
113
Canada
I think you are loading down the power supply...16 disks, plus everything else, is a lot to expect from a 750W power supply. Remember they calculate that 750W across the various rails, they are not equally split. For that number of disks, I would be expecting your supply to be at least in the 1000-1200W range :)
 

nexox

Well-Known Member
May 3, 2023
1,514
730
113
For that number of disks, I would be expecting your supply to be at least in the 1000-1200W range
It's been a while (I don't think SATA drive power consumption has changed all that much,) but I ran up to 14 7200RPM drives off a 380W PSU for close to a decade without issues, perhaps the individual SATA power wires or splitters are overloaded, but a 1200W PSU is a waste of money and power in a system that tops out at under 350W on the DC side.
 
  • Like
Reactions: Chriggel

MountainBofh

Beating my users into submission
Mar 9, 2024
393
287
63
Seasonic is a single 12V rail, so that isn't an issue.. 750 watt is more than enough for 16 drives. The HC530's are rated to draw 6 watts while operating. Being conservative and saying 10 watts per equals 160 watt for all the drives maxed out.

Unless his power supply is defective (and unlikely with the given information), I'd say that's not the issue.
 
  • Like
Reactions: Chriggel and nexox

rhnet

New Member
Mar 18, 2023
9
1
3
Here's the firmware of my HBA, though I also had the same issue when the drives were connected directly to the motherboard SATA ports.

Code:
sudo sas3flash -listall
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS3008(C0)  16.00.12.00    0e.01.00.07    08.37.00.00     00:01:00:00

        Finished Processing Commands Successfully.
        Exiting SAS3Flash
 

Chriggel

Active Member
Mar 30, 2024
157
85
28
It could be an issue with signal loss. You're on the limit of the specified SATA link length. Actually you're beyond the limit, because internal connections and PCB traces are also part of the total link length. Because of this, technically all 100cm SATA cables violate the spec.

I had tons of problems with that in the past. Different setups have more or less problems with it because there are so many contributing factors. I was never able to properly, scientifically verify this, but I had the suspicion that not all expanders are doing a particularly great job when it comes to repeating the signal. In that case, 100cm cables on an expander and 100cm cables on a HBA aren't the same thing. I also wondered if some expanders even do a refresh at all, as you'd expect from an active component.

This was many years ago and because I needed to get the systems up and running, I didn't run the extensive tests that I needed to run to get to the bottom of this. Since then, I'm not using cables longer than 80cm anymore (and 50cm whenever possible) and especially not on the drive side of an expander. Actually, if possible, I use expander backplanes to avoid cables between the expander and the drives altogether. If that isn't an option, I use higher port count HBAs which luckily are a thing today and are much easier to get than back in the days.
 
  • Like
Reactions: nexox

Mithril

Active Member
Sep 13, 2019
452
151
43
Check the RAM and potentially the CPU itself. A long time ago I was running ZFS on a board that I didn't know had a weak RAM VRM as well as old and failing RAM, and I would get similar "makes no sense" errors on scrubs. Running memtest86 overnight showed "sometimes errors, sometimes not" in multiple locations.
 

rhnet

New Member
Mar 18, 2023
9
1
3
I would give it a shot without the crypt layer, just for testing.
I added added another of the disks to the the mirror without luks this time (while disk16_crypt was also resilvering) and had an interesting result where there were checksum errors but no write errors. I haven't actually tried this particular drive with crypt before (been trying different combinations of everything).

Code:
   pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jun 15 12:22:47 2024
        18.5T scanned at 427M/s, 18.5T issued at 426M/s, 60.2T total
        4.01T resilvered, 30.68% done, 1 days 04:31:31 to go
config:

        NAME                                        STATE     READ WRITE CKSUM
        tank                                        DEGRADED     0     0     0
          mirror-0                                  DEGRADED     0     0     0
            disk2_crypt                             ONLINE       0     0   114  (awaiting resilver)
            replacing-1                             UNAVAIL      0     0     0  insufficient replicas
              9203485469624093271                   UNAVAIL      0     0     0  was /dev/mapper/disk1_crypt
              disk13_crypt                          OFFLINE      0     0     0
            18360366250662888740                    UNAVAIL      0     0     0  was /dev/mapper/disk14_crypt
            4915826231744496010                     UNAVAIL      0     0     0  was /dev/mapper/disk15_crypt
            disk16_crypt                            DEGRADED     0    30   114  too many errors  (resilvering)
            ata-WDC_WUH721414ALE604_<redacted/disk17>-part1  ONLINE       0     0    74  (resilvering)
          mirror-1                                  ONLINE       0     0     0
            disk9_crypt                             ONLINE       0     0     0
            disk10_crypt                            ONLINE       0     0     0
          mirror-2                                  ONLINE       0     0     0
            disk5_crypt                             ONLINE       0     0     0
            disk6_crypt                             ONLINE       0     0     0
          mirror-3                                  ONLINE       0     0     0
            disk7_crypt                             ONLINE       0     0     0
            disk8_crypt                             ONLINE       0     0     0
          mirror-4                                  ONLINE       0     0     0
            disk11_crypt                            ONLINE       0     0     0
            disk12_crypt                            ONLINE       0     0     0
          mirror-5                                  ONLINE       0     0     0
            disk3_crypt                             ONLINE       0     0     0
            disk4_crypt                             ONLINE       0     0     0
Some notes, as this is a large view of zpool status than in my original post:

- disk2 checksum: I moved this pool from an older machine two weeks ago, when I did I used a cheap 'Inspur LSI 9300-8i' which caused tons of problems. Files written during that time (all just zfs recv backups of other machines during that time) had errors. So I swapped out the HBA with the presumed good artofserver one.
- 9203485469624093271: was a drive in the mirror that failed to come back after the transplant to the new server. Verified it was bad in another machine.
- disk13, disk14, disk15 are all attempts at adding different 7200 rpm drives to the mirror.

dmesg continues to complain about disk16, but not about the unencrypted 'disk17':

Code:
[Sun Jun 16 03:03:23 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=560327979008 size=348160 flags=1808aa
[Sun Jun 16 03:03:23 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=560327630848 size=348160 flags=1808aa
[Sun Jun 16 03:29:24 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=2648509591552 size=233472 flags=1808aa
[Sun Jun 16 03:30:40 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=1719266394112 size=249856 flags=1808aa
[Sun Jun 16 03:31:50 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5192659308544 size=348160 flags=1808aa
[Sun Jun 16 03:34:40 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=670064041984 size=237568 flags=1808aa
[Sun Jun 16 03:36:43 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=361201635328 size=258048 flags=1808aa
[Sun Jun 16 03:37:05 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5847854530560 size=237568 flags=1808aa
[Sun Jun 16 03:37:56 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5995892486144 size=307200 flags=1808aa
[Sun Jun 16 03:37:56 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5995965739008 size=307200 flags=1808aa
[Sun Jun 16 03:39:29 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=4513760804864 size=311296 flags=1808aa
[Sun Jun 16 03:39:29 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=4514059792384 size=311296 flags=1808aa
[Sun Jun 16 04:00:02 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=7851966734336 size=278528 flags=1808aa
[Sun Jun 16 04:00:02 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=7852044771328 size=278528 flags=1808aa
[Sun Jun 16 04:01:10 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=906085408768 size=299008 flags=1808aa
[Sun Jun 16 05:31:28 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=2371654696960 size=311296 flags=1808aa
[Sun Jun 16 05:31:44 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5914053931008 size=348160 flags=1808aa
[Sun Jun 16 05:33:54 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=5709860196352 size=270336 flags=1808aa
[Sun Jun 16 05:34:49 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=2423416188928 size=303104 flags=1808aa
[Sun Jun 16 05:35:35 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=7102999162880 size=348160 flags=1808aa
[Sun Jun 16 05:35:39 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=6722308485120 size=274432 flags=1808aa
[Sun Jun 16 05:35:39 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=6721856724992 size=274432 flags=1808aa
[Sun Jun 16 05:35:39 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=155387342848 size=282624 flags=1808aa
[Sun Jun 16 05:38:03 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=2594738585600 size=225280 flags=1808aa
[Sun Jun 16 05:38:03 2024] zio pool=tank vdev=/dev/mapper/disk16_crypt error=5 type=2 offset=2595291136000 size=225280 flags=1808aa
 

nabsltd

Well-Known Member
Jan 26, 2022
716
504
93
I think you are loading down the power supply...16 disks, plus everything else, is a lot to expect from a 750W power supply.
16 disks at 15W each would be 240W. Even with a 300W CPU running a full power, a 750W Seasonic will have more than enough power to handle the load.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,618
577
113
I'd ask few other questions

Do you have ecc ram? (does it report errors?) ~ encryption exposes more cpu and ram to chksum issues - especially if you aren't 100% stable.
How about your sas controller? anything in syslog? It could be faulty cache on controller (as you are running 9340 that does have its local cache, unless disabled)
 

rhnet

New Member
Mar 18, 2023
9
1
3
The resilver on the unecrypted drive completed without error, I'm going to try a scrub and see if everything was written fine. It does seem suspicious. I also plan to try disk16_crypt without encryption to see if it has no errors on a resilver.

I do not have ECC ram in this machine. It is DDR5. The previous machine was ECC, I was hoping to get away without it.

Anything I should search for in syslog, there's a lot of other spam in there.

I did have the errors when attached directly to the motherboard sata ports, but I haven't tried again recently.
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,276
846
113
Stavanger, Norway
intellistream.ai
One thing I have noticed on some hard drives, is that they spend a lot of time trying to read a block, ZFS may give up after a few seconds and return a bad checksum error here. These hard drives seems work fine with badblocks. But they may take slightly longer time to complete, depending how bad it is. The easiest thing is to try another hard drive

You can also check sudo zpool events -v tank

Another time when I moved a pool with export & import to another system I had bad checksums for about a month. I really did not have time fixing it, so I just cleared the errors and let it run for another month. After that month, the errors are now gone when scrubbing. It magiaclly fixed itself.
 

rhnet

New Member
Mar 18, 2023
9
1
3
"lot of other spam" anything error/warn wise?
No, not that I can tell. By spam I mean from docker etc. If there's a particular systemd unit I should be looking for I can check.

I really have no idea here, could be a firmware issue. What is the ZFS version?
Code:
$ uname -a
Linux <hostname> 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ modinfo zfs | grep version
version:        2.1.5-1ubuntu6~22.04.3
srcversion:     C846B2D0C274CEADA66935D
vermagic:       5.15.0-112-generic SMP mod_unload modversions

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy
I'm happy to upgrade anything (as long as it's not terribly bleeding edge).
 

rhnet

New Member
Mar 18, 2023
9
1
3
Delayed in replying because I wanted to wait for the scrub on my pool to finish, took a couple days.

Before the scrub I attached disk16_crypt drive directly to the motherboard again. And I still had write errors.
I have not had any write/read errors on the unencrypted drive.

I've been reading zfs issues page and notice that there have been complaints with very similar errors on luks devices (example), especially if they are 4k sectors (the crypt layer). That led me to notice that these new drives were all using 4k luks sectors, but my old drive had 512 luks sectors.

So I've started another test with 512 luks sectors. As a bonus I've left disk16 attached to motherboard and another disk, disk17 to the SAS expander.

Thanks to unwind-protect's suggestion to try without luks, I also have a fully working mirror between disk2 and the unencrypted drive. I was nervous putting so much load on disk2 without a healthy mirror.

Thank you all for your ideas so far!


Code:
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jun 22 06:16:53 2024
        3.66T scanned at 2.50G/s, 3.51T issued at 2.40G/s, 53.4T total
        800M resilvered, 6.58% done, 05:54:44 to go
config:

        NAME                                        STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                                  ONLINE       0     0     0
            disk2_crypt                             ONLINE       0     0     0
            ata-WDC_WUH721414ALE604_<redacted>-part1  ONLINE       0     0     0
            disk16_crypt                            ONLINE       0     0     0  (resilvering)
            disk17_crypt                            ONLINE       0     0     0  (resilvering)
          mirror-1                                  ONLINE       0     0     0
            disk9_crypt                             ONLINE       0     0     0
            disk10_crypt                            ONLINE       0     0     0
          mirror-2                                  ONLINE       0     0     0
            disk5_crypt                             ONLINE       0     0     0
            disk6_crypt                             ONLINE       0     0     0
          mirror-3                                  ONLINE       0     0     0
            disk7_crypt                             ONLINE       0     0     0
            disk8_crypt                             ONLINE       0     0     0
          mirror-4                                  ONLINE       0     0     0
            disk11_crypt                            ONLINE       0     0     0
            disk12_crypt                            ONLINE       0     0     0
          mirror-5                                  ONLINE       0     0     0
            disk3_crypt                             ONLINE       0     0     0
            disk4_crypt                             ONLINE       0     0     0