ZFS on Hardware-Raid

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by Stril, Dec 12, 2018.

  1. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    Everywhere in internet-forums, people say, it's a "no-go" to use ZFS on hardware raid, as it "likes/needs" to see the hardware, but:

    There is no REAL explanation, why!


    I really like ZFS, but rebuild-times are just awful... A hardware-raid-rebuild is mucht faster. I am just in a 120h rebuild for a 4TB-drive in a mirror-config...


    Can you give me some input?
    Is it really no "option" to use hardware-raid with ZFS?

    Thank you and best wishes
    Stril
     
    #1
  2. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    873
    Likes Received:
    291
    The point of ZFS really is that it's a one-stop-shop handling RAID, volume management, snapshots and files - and to do this well it should have direct control of all the discs involved in the filesystem. A hardware RAID card (or indeed putting ZFS on top of another volume management layer like LVM and/or mdadm) is an abstraction away from that. If something goes wrong with a layer beneath ZFS, then ZFS itself may not be able to cope with it. You can run it on a hardware RAID if you like, but then you'll lose the portability and rebuild speed.

    Speaking of rebuilds... assuming you're talking about a regular two-discs RAID1 mirror and one or two vdevs, there's no way it should take anywhere near that long on healthy hardware; rebuilds of mirrors should basically just be a sequential copy from old drive to new drive, and even then only of the files rather than the entire array - so it should be faster at rebuilding than regular RAIDs of the same size. This isn't likely to be a CPU bottleneck (since such an operation isn't typically computationally expensive), are your discs getting thrashed on random IO or something? If you've got a dying drive in the array somewhere it'll likely also result in atrocious performance.
     
    #2
  3. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    @Resilver-Speed
    I am using a pool with 10 drives - each 4TB --> 5 vdevs, 192GB memory, sync=disabled, dedup is on.

    One drive crashed and has to be "resilvered".

    Code:
    zpool list
    NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
    zpool1  21,8T  6,63T  15,1T         -    66%    30%  3.58x  DEGRADED  -
    
    
    zpool status zpool1
      pool: zpool1
     state: DEGRADED
    status: One or more devices is currently being resilvered.  The pool will
            continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Wed Dec 12 09:20:09 2018
        1,01T scanned out of 6,60T at 12,1M/s, 134h17m to go
        172G resilvered, 15,29% done
    config:
    
            NAME                                                 STATE     READ WRITE CKSUM
            zpool1                                               DEGRADED     0     0     0
              mirror-0                                           ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c50084864751-lun-0    ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c500848ab931-lun-0    ONLINE       0     0     0
              mirror-1                                           ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c5008489b6e5-lun-0    ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c5008482f9f5-lun-0    ONLINE       0     0     0
              mirror-2                                           ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c50084864631-lun-0    ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c500847f3ef9-lun-0    ONLINE       0     0     0
              mirror-3                                           DEGRADED     0     0     0
                replacing-0                                      OFFLINE      0     0     0
                  pci-0000:07:00.0-sas-0x5000c500847e7849-lun-0  OFFLINE      0     0     1
                  wwn-0x5000c500a6a16c4b                         ONLINE       0     0     0  (resilvering)
                pci-0000:07:00.0-sas-0x5000c50084830899-lun-0    ONLINE       0     0     0
              mirror-4                                           ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c50084864639-lun-0    ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c500848646a5-lun-0    ONLINE       0     0     0
              mirror-5                                           ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c50084864791-lun-0    ONLINE       0     0     0
                pci-0000:07:00.0-sas-0x5000c5008485d06d-lun-0    ONLINE       0     0     0
    
    errors: No known data errors
    
    --> resilver seems to read ALL the data - not only those of the vdev.

    This is VERY slow, so i thought about using hardware-raid.
     
    #3
  4. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    873
    Likes Received:
    291
    Resilver/rebuild should only ever read the bits of the discs that contain the actual files, it's got no need to read empt/unallocated blocks since those don't need to be rebuilt. You're only using about a third of your entire pool so it should only need to read a third of the discs.

    12MB/s seems outrageously slow (although I'm not as well versed in ZFS as many on this forum). Do you have iostat available? What's the load on the individual discs like? I assume there's no obvious thrashing of CPU/mem/swap? You've got dedupe enabled on at least 6TB of data so it might be worth checking your dedupe tables in the pool status. Check your ashift value as well esp. if your new drive has a different sector layout (e.g. 4kn vs. 512e) than the old ones.
     
    #4
  5. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    IOSTAT does not show a lot of load:

    Code:
    iostat
    Linux 3.16.0-4-amd64 (backuppc)         13.12.2018      _x86_64_        (24 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0,89    0,03    1,55   13,06    0,00   84,46
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sdc              90,24       594,45       802,49  393880996  531731448
    sdd              90,26       595,55       802,49  394610808  531731448
    sdf              66,01       597,04       789,64  395600084  523216384
    sde              65,89       596,48       789,64  395225568  523216384
    sdh              77,11       598,50       801,58  396563760  531125904
    sdj             126,82      1188,31       787,82  787374472  522011104
    sdg              77,16       600,13       801,58  397644860  531125904
    sdk              78,90       599,47       793,37  397209976  525684548
    sdm              94,02       597,05       783,63  395604380  519230232
    sdn              94,15       598,45       783,63  396532980  519230232
    sdl              79,03       601,23       793,37  398376356  525684548
    sda               1,32         6,38         4,24    4228203    2807809
    sdb               1,08         1,16         4,24     769704    2807809
    md0               1,53         7,54         4,08    4995631    2705894
    dm-0              1,42         7,53         4,08    4990797    2705892
    dm-1              0,00         0,00         0,00       1576          0
    sdi              15,32         0,03       391,19      19016  259204347
    
    sdi is the new "inserted" drive.

    I just confuses me, why over 6TB have to be resilvered...
     
    #5
  6. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    ZFS works well on hardware raid with three caveats

    1. ZFS use a quite massive rambased write cache. To protect the writecache you can enable sync write. This logs each single commited write to a onpool ZIL device. A commit from the ZIL must mean data is on disk. A hardwareraid with its own cache cannot guarantee this to ZFS

    2. ZFS comes with data and metadate checksums. Every fault is detected by ZFS and repaired on the fly from ZFS redundancy/Raid. If you use hardwareraid, the array is a single disk from ZFS point of view. This means ZFS cannot repait and the hardwareraid cannot as it is not aware of problems.

    3. Write hole problem
    A hardware raid updates disks sequentially. A crash during a write can mean a corrupt raid and/or a corrupt filesystem. ZFS raid is based on CopyOnWrite what means that an atomic write (ex write raid stripe or write data + metadata) is done completely or discarded.
    "Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays.

    Regarding resilvering
    ZFS resilvering is a low priority process. It reads all data to check if it needs to be copy to the new disk. With few data on the disk this is done within minutes. If the pool is full with a lot of small data, pool iops and RAM is the limiting factor.

    A hardware raid does not care about contend. It simply duplicates the whole disk. In most cases this is slower than ZFS. Only on a quite full pool or with many small files this is slower.

    The caveats mean that you loose three essential ZFS features with a hardwareraid that is additionally slower in most conditions so indeed a NoGo.
     
    #6
    fossxplorer likes this.
  7. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,473
    Likes Received:
    353
    At the enterprise level you will see a lot of ZFS with enterprise SAN backing. Either a single disks (yes I know you loose repair options) or as mirrors.

    Some SUN systems with 100+ TB per volume like this. The world won’t end it works fine. Keep in mind we talk true enterprise grade storage on back end. Not mid range or anything else, high end enterprise. So all topics of writes not been written are not in discussion because if that happens you have petabytes or other problems and corrupted databases.

    Anyway to your problem, I don’t know but that’s vertically not right, unless the host is totally 100% busy non stop zfs should rebuild as fast as hardware raid or close enough to that fast.
     
    #7
  8. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    Thank you for your point - I will stay with HBAs.

    @resilvering:
    Is there any possibility to increase the resilver-priority? The needed time, that is shown, is increasing more and more. I do not want to wait 140h to have a "protected" status...
     
    #8
  9. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    90
    Likes Received:
    31
    On HP hardware, at least, there is an advantage to using a raid controller that has a battery backed write cache and configuring single disk (raid0) volumes for zfs.

    It eliminates any performance decrease from the use of a ZIL journal, while providing power loss protection.

    At least for HP controllers and arrays, it will be easier to locate drives by toggling LEDs using the raid controller. I find that the HBAs often have one-off issues, which results in pulling the wrong drives.

    It's possible to get smart stats for each physical drive on the raid controller.

    That's my experience.....
     
    #9
    Last edited: Dec 13, 2018
  10. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    You can increase priority
    For ZoL google: increase resilver priority zol

    Your main problem seems 12 M/s throughput. With that any resilver must be slow and this should be definitly higher even with an iops related load and disks. Not sure if you have a massive boost with the priority

    from iostat
    its not clear which disks belong to the pool but load is not even. This may indicate an unbalanced pool or a weak disk. Outside Oracle Solaris and dedup2 I would advice against using dedup - even with a lot of RAM.
     
    #10
  11. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi gea!

    resilver is getting slower and slower... Yesterday, it was at 25 MB/s, now at 11,3 MB/s.

    What I tried is to set zfs_scrub_delay=0 - without any effect.

    I know, that dedup is hard to use, but it is a backup-target and I can achieve dedup-factor of >4...

    @iostat:
    all the disks belong to the pool, but sdj is in the same "mirror" group as sdi --> sdj should have more load in my mind. the rest seems to be balanced (sda and sdb are boot-drives).
     
    #11
  12. sovking

    sovking Member

    Joined:
    Jun 2, 2011
    Messages:
    31
    Likes Received:
    1
    dedup=on could slow the resilvering time ? Dedup usually is discorouraged: few benefits and lot of throubles.

    Then I'm noticing you are using ZFS on Linux (Linux 3.16.0-4-amd64), I don't know which specific declination but that is not he best paring. Usually is better to pair ZFS with Solaris/Illumos/OpenIndiana or FreeBSD/FreeNAS.

    If you want both from two worlds, you can put everything on ESXi + FreeNAS (or other OS) with HBA exposed by using VT-d, and then export storage to your linux enviroments.
     
    #12
  13. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    I cannot disable dedup for the live-data.

    Do you have any comparison between ZoL and ZFS on Solaris?

    Backup to resilver:
    Why does the resilver need to do the whole pool and not only one vdev?
     
    #13
  14. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    If you look at ZFS as a filesystem, BSD, Illumos (free Solaris fork) and Linux are quite similar. Mostly new features are first in Illumos but they all try to keep the differences small between Open-ZFS on either OS.

    When you look at integration of ZFS and storage features into the OS, Solarish is superiour. Sun developped OpenSolaris more or less around ZFS and storage. ZFS is the only option, not one among many. Free-BSD has also a quite good integration as they use ZFS for some time. Additionally a genuine ZFS v44 in Oracle Solaris is the fastest and most feature rich storage server. Open-ZFS is not yet on par.

    about resilver
    ZFS cares about security above everything. This is why you cannot simple duplicate a faulted disk from the other mirror halve as this would mean a copy without checksum control. Sun decided a resilver must be based on checksums. This is why all date must be read to verify checksums. This is also why resilver time depends on pool iops, fillrate and filesize and RAM to cache metadata and data.

    Usually a whole disk 1:1 copy is much slower than a resilver based on real data on a disk.
    Ok you have a problem with io but thats not the fault of ZFS, there must be another reason.
     
    #14
  15. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    about resilver:
    Yes, there is IO on that system (nearly 24/7), but isn't there any possiblity to priorize the resilver-process over the "regular" IO?
     
    #15
  16. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    By far, the fastest resilvering is sequential resilvering on a genuine Oracle Solaris ZFS
    Sequential Resilvering

    There are tuning options on Open-ZFS like zfs_scrub_delay to increase priority over regular io.
    This does not answer why your resilver io is 12 Mb/s (even a slow single disk should be 5x faster).

    A resilvering of a mirror disk on Open-ZFS should be in the range of hours not days.
    This is what I would expect on Illumos/OmniOS what I use mostly.

    Tuning is about increasing performance say 5-30%.
    If your result is around 1/10 of a good result, your problem is not tuning.
     
    #16
    Last edited: Dec 13, 2018
    dswartz likes this.
  17. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    873
    Likes Received:
    291
    Assuming these are bog-standard 4TB drives, the random IO on the current workload looks relatively high (but of course that depends on the nature of the workload). What does `iostat -x`* look like? TPS in the region of 80-100 might indicate a high random IO load which would conceivably wreck your resilver times in such a fashion. I assume there aren't any SSDs, SLOG or L2ARC involved here? What's the nature of the data on the array, you say it's for backups but is this lots of big archive/VM files or lots of little files/hardlinks/diffs or a combination of the two? Do you have any blinkenlights on the drive trays to give you an indication of whether the drives might be being thrashed?

    (Sorry, assumed earlier that there wasn't any current IO on the array in question).

    * IIRC there's also an iostat subcommand within zpool which might give more consistent results but I'm not sure on the important differences.
     
    #17
  18. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    90
    Likes Received:
    31
    This discusses tuning resilver of zfs on linux:
    Resilvering raidz - why so incredibly slow? : zfs

    To show the current values:
    head /sys/module/zfs/parameters/*{resilver,inflight}*

    to speed up resliver:
    echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
    echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
    echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

    To return to default values:
    echo 2 > /sys/module/zfs/parameters/zfs_resilver_delay
    echo 32 > /sys/module/zfs/parameters/zfs_top_maxinflight
    echo 3000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

    It's quite possible the tuning will slow down filesystem IO if they are competing for IOPS.
     
    #18
  19. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    Thank you!
    Last night, the resilver became faster. Resilver should be finished in 35h...

    I am a bit worried about the resilver-times, as I wanted to build a larger ZFS-storage within the next weeks (24x10TB). If that storage would be filled at 75%, rebuild would have to scan about 80 TB.
    What is your expirience with these setups? Doesnt a resilver run for weeks?
     
    #19
  20. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    90
    Likes Received:
    31
    One option would be to consider benchmark them.
    For instance, comparing two contrasting pools:
    - one pool with six vdevs of mirrored pairs
    - one pool with one vdev of 12 drives in raidz2
    and see how they perform under your load.

    Sounds like your resilver performance is highly dependent on the load, so testing under that load could shed some light. The mirrored pairs should scrub and resilver faster when there's no other filesystem IO going on just because they perform better; however, under a heavy load, they may show far less difference.

    You might also consider ashift=9 for 10TB drives with 4k native sectors.
    If the if backup files are large, a larger recordsize will conserve metadata consumption of ram, and could potentially allow you to use an l2arc.
    Again, compare two pools with different values, and see what performs best.
     
    #20

Share This Page