ZFS on Hardware-Raid

Stril · Dec 12, 2018

Hi!

Everywhere in internet-forums, people say, it's a "no-go" to use ZFS on hardware raid, as it "likes/needs" to see the hardware, but:

There is no REAL explanation, why!

I really like ZFS, but rebuild-times are just awful... A hardware-raid-rebuild is mucht faster. I am just in a 120h rebuild for a 4TB-drive in a mirror-config...

Can you give me some input?
Is it really no "option" to use hardware-raid with ZFS?

Thank you and best wishes
Stril

EffrafaxOfWug · Dec 13, 2018

The point of ZFS really is that it's a one-stop-shop handling RAID, volume management, snapshots and files - and to do this well it should have direct control of all the discs involved in the filesystem. A hardware RAID card (or indeed putting ZFS on top of another volume management layer like LVM and/or mdadm) is an abstraction away from that. If something goes wrong with a layer beneath ZFS, then ZFS itself may not be able to cope with it. You can run it on a hardware RAID if you like, but then you'll lose the portability and rebuild speed.

Speaking of rebuilds... assuming you're talking about a regular two-discs RAID1 mirror and one or two vdevs, there's no way it should take anywhere near that long on healthy hardware; rebuilds of mirrors should basically just be a sequential copy from old drive to new drive, and even then only of the files rather than the entire array - so it should be faster at rebuilding than regular RAIDs of the same size. This isn't likely to be a CPU bottleneck (since such an operation isn't typically computationally expensive), are your discs getting thrashed on random IO or something? If you've got a dying drive in the array somewhere it'll likely also result in atrocious performance.

Stril · Dec 13, 2018

Hi!

@Resilver-Speed
I am using a pool with 10 drives - each 4TB --> 5 vdevs, 192GB memory, sync=disabled, dedup is on.

One drive crashed and has to be "resilvered".

Code:

zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zpool1  21,8T  6,63T  15,1T         -    66%    30%  3.58x  DEGRADED  -


zpool status zpool1
  pool: zpool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec 12 09:20:09 2018
    1,01T scanned out of 6,60T at 12,1M/s, 134h17m to go
    172G resilvered, 15,29% done
config:

        NAME                                                 STATE     READ WRITE CKSUM
        zpool1                                               DEGRADED     0     0     0
          mirror-0                                           ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c50084864751-lun-0    ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c500848ab931-lun-0    ONLINE       0     0     0
          mirror-1                                           ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c5008489b6e5-lun-0    ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c5008482f9f5-lun-0    ONLINE       0     0     0
          mirror-2                                           ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c50084864631-lun-0    ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c500847f3ef9-lun-0    ONLINE       0     0     0
          mirror-3                                           DEGRADED     0     0     0
            replacing-0                                      OFFLINE      0     0     0
              pci-0000:07:00.0-sas-0x5000c500847e7849-lun-0  OFFLINE      0     0     1
              wwn-0x5000c500a6a16c4b                         ONLINE       0     0     0  (resilvering)
            pci-0000:07:00.0-sas-0x5000c50084830899-lun-0    ONLINE       0     0     0
          mirror-4                                           ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c50084864639-lun-0    ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c500848646a5-lun-0    ONLINE       0     0     0
          mirror-5                                           ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c50084864791-lun-0    ONLINE       0     0     0
            pci-0000:07:00.0-sas-0x5000c5008485d06d-lun-0    ONLINE       0     0     0

errors: No known data errors

--> resilver seems to read ALL the data - not only those of the vdev.

This is VERY slow, so i thought about using hardware-raid.

EffrafaxOfWug · Dec 13, 2018

Resilver/rebuild should only ever read the bits of the discs that contain the actual files, it's got no need to read empt/unallocated blocks since those don't need to be rebuilt. You're only using about a third of your entire pool so it should only need to read a third of the discs.

12MB/s seems outrageously slow (although I'm not as well versed in ZFS as many on this forum). Do you have iostat available? What's the load on the individual discs like? I assume there's no obvious thrashing of CPU/mem/swap? You've got dedupe enabled on at least 6TB of data so it might be worth checking your dedupe tables in the pool status. Check your ashift value as well esp. if your new drive has a different sector layout (e.g. 4kn vs. 512e) than the old ones.

Stril · Dec 13, 2018

Hi!

IOSTAT does not show a lot of load:

Code:

iostat
Linux 3.16.0-4-amd64 (backuppc)         13.12.2018      _x86_64_        (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,89    0,03    1,55   13,06    0,00   84,46

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdc              90,24       594,45       802,49  393880996  531731448
sdd              90,26       595,55       802,49  394610808  531731448
sdf              66,01       597,04       789,64  395600084  523216384
sde              65,89       596,48       789,64  395225568  523216384
sdh              77,11       598,50       801,58  396563760  531125904
sdj             126,82      1188,31       787,82  787374472  522011104
sdg              77,16       600,13       801,58  397644860  531125904
sdk              78,90       599,47       793,37  397209976  525684548
sdm              94,02       597,05       783,63  395604380  519230232
sdn              94,15       598,45       783,63  396532980  519230232
sdl              79,03       601,23       793,37  398376356  525684548
sda               1,32         6,38         4,24    4228203    2807809
sdb               1,08         1,16         4,24     769704    2807809
md0               1,53         7,54         4,08    4995631    2705894
dm-0              1,42         7,53         4,08    4990797    2705892
dm-1              0,00         0,00         0,00       1576          0
sdi              15,32         0,03       391,19      19016  259204347

sdi is the new "inserted" drive.

I just confuses me, why over 6TB have to be resilvered...

gea · Dec 13, 2018

ZFS works well on hardware raid with three caveats

1. ZFS use a quite massive rambased write cache. To protect the writecache you can enable sync write. This logs each single commited write to a onpool ZIL device. A commit from the ZIL must mean data is on disk. A hardwareraid with its own cache cannot guarantee this to ZFS

2. ZFS comes with data and metadate checksums. Every fault is detected by ZFS and repaired on the fly from ZFS redundancy/Raid. If you use hardwareraid, the array is a single disk from ZFS point of view. This means ZFS cannot repait and the hardwareraid cannot as it is not aware of problems.

3. Write hole problem
A hardware raid updates disks sequentially. A crash during a write can mean a corrupt raid and/or a corrupt filesystem. ZFS raid is based on CopyOnWrite what means that an atomic write (ex write raid stripe or write data + metadata) is done completely or discarded.
"Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays.

Regarding resilvering
ZFS resilvering is a low priority process. It reads all data to check if it needs to be copy to the new disk. With few data on the disk this is done within minutes. If the pool is full with a lot of small data, pool iops and RAM is the limiting factor.

A hardware raid does not care about contend. It simply duplicates the whole disk. In most cases this is slower than ZFS. Only on a quite full pool or with many small files this is slower.

The caveats mean that you loose three essential ZFS features with a hardwareraid that is additionally slower in most conditions so indeed a NoGo.

Evan · Dec 13, 2018

At the enterprise level you will see a lot of ZFS with enterprise SAN backing. Either a single disks (yes I know you loose repair options) or as mirrors.

Some SUN systems with 100+ TB per volume like this. The world won’t end it works fine. Keep in mind we talk true enterprise grade storage on back end. Not mid range or anything else, high end enterprise. So all topics of writes not been written are not in discussion because if that happens you have petabytes or other problems and corrupted databases.

Anyway to your problem, I don’t know but that’s vertically not right, unless the host is totally 100% busy non stop zfs should rebuild as fast as hardware raid or close enough to that fast.

Stril · Dec 13, 2018

Hi!

Thank you for your point - I will stay with HBAs.

@resilvering:
Is there any possibility to increase the resilver-priority? The needed time, that is shown, is increasing more and more. I do not want to wait 140h to have a "protected" status...

zxv · Dec 13, 2018

On HP hardware, at least, there is an advantage to using a raid controller that has a battery backed write cache and configuring single disk (raid0) volumes for zfs.

It eliminates any performance decrease from the use of a ZIL journal, while providing power loss protection.

At least for HP controllers and arrays, it will be easier to locate drives by toggling LEDs using the raid controller. I find that the HBAs often have one-off issues, which results in pulling the wrong drives.

It's possible to get smart stats for each physical drive on the raid controller.

That's my experience.....

gea · Dec 13, 2018

You can increase priority
For ZoL google: increase resilver priority zol

Your main problem seems 12 M/s throughput. With that any resilver must be slow and this should be definitly higher even with an iops related load and disks. Not sure if you have a massive boost with the priority

from iostat
its not clear which disks belong to the pool but load is not even. This may indicate an unbalanced pool or a weak disk. Outside Oracle Solaris and dedup2 I would advice against using dedup - even with a lot of RAM.

Stril · Dec 13, 2018

Hi gea!

resilver is getting slower and slower... Yesterday, it was at 25 MB/s, now at 11,3 MB/s.

What I tried is to set zfs_scrub_delay=0 - without any effect.

I know, that dedup is hard to use, but it is a backup-target and I can achieve dedup-factor of >4...

@iostat:
all the disks belong to the pool, but sdj is in the same "mirror" group as sdi --> sdj should have more load in my mind. the rest seems to be balanced (sda and sdb are boot-drives).

sovking · Dec 13, 2018

dedup=on could slow the resilvering time ? Dedup usually is discorouraged: few benefits and lot of throubles.

Then I'm noticing you are using ZFS on Linux (Linux 3.16.0-4-amd64), I don't know which specific declination but that is not he best paring. Usually is better to pair ZFS with Solaris/Illumos/OpenIndiana or FreeBSD/FreeNAS.

If you want both from two worlds, you can put everything on ESXi + FreeNAS (or other OS) with HBA exposed by using VT-d, and then export storage to your linux enviroments.

Stril · Dec 13, 2018

Hi!

I cannot disable dedup for the live-data.

Do you have any comparison between ZoL and ZFS on Solaris?

Backup to resilver:
Why does the resilver need to do the whole pool and not only one vdev?

gea · Dec 13, 2018

If you look at ZFS as a filesystem, BSD, Illumos (free Solaris fork) and Linux are quite similar. Mostly new features are first in Illumos but they all try to keep the differences small between Open-ZFS on either OS.

When you look at integration of ZFS and storage features into the OS, Solarish is superiour. Sun developped OpenSolaris more or less around ZFS and storage. ZFS is the only option, not one among many. Free-BSD has also a quite good integration as they use ZFS for some time. Additionally a genuine ZFS v44 in Oracle Solaris is the fastest and most feature rich storage server. Open-ZFS is not yet on par.

about resilver
ZFS cares about security above everything. This is why you cannot simple duplicate a faulted disk from the other mirror halve as this would mean a copy without checksum control. Sun decided a resilver must be based on checksums. This is why all date must be read to verify checksums. This is also why resilver time depends on pool iops, fillrate and filesize and RAM to cache metadata and data.

Usually a whole disk 1:1 copy is much slower than a resilver based on real data on a disk.
Ok you have a problem with io but thats not the fault of ZFS, there must be another reason.

Stril · Dec 13, 2018

Hi!

about resilver:
Yes, there is IO on that system (nearly 24/7), but isn't there any possiblity to priorize the resilver-process over the "regular" IO?

gea · Dec 13, 2018

By far, the fastest resilvering is sequential resilvering on a genuine Oracle Solaris ZFS
Sequential Resilvering

There are tuning options on Open-ZFS like zfs_scrub_delay to increase priority over regular io.
This does not answer why your resilver io is 12 Mb/s (even a slow single disk should be 5x faster).

A resilvering of a mirror disk on Open-ZFS should be in the range of hours not days.
This is what I would expect on Illumos/OmniOS what I use mostly.

Tuning is about increasing performance say 5-30%.
If your result is around 1/10 of a good result, your problem is not tuning.

EffrafaxOfWug · Dec 13, 2018

Assuming these are bog-standard 4TB drives, the random IO on the current workload looks relatively high (but of course that depends on the nature of the workload). What does `iostat -x`* look like? TPS in the region of 80-100 might indicate a high random IO load which would conceivably wreck your resilver times in such a fashion. I assume there aren't any SSDs, SLOG or L2ARC involved here? What's the nature of the data on the array, you say it's for backups but is this lots of big archive/VM files or lots of little files/hardlinks/diffs or a combination of the two? Do you have any blinkenlights on the drive trays to give you an indication of whether the drives might be being thrashed?

(Sorry, assumed earlier that there wasn't any current IO on the array in question).

* IIRC there's also an iostat subcommand within zpool which might give more consistent results but I'm not sure on the important differences.

zxv · Dec 13, 2018

This discusses tuning resilver of zfs on linux:
Resilvering raidz - why so incredibly slow? : zfs

To show the current values:
head /sys/module/zfs/parameters/*{resilver,inflight}*

to speed up resliver:
echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

To return to default values:
echo 2 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 32 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 3000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

It's quite possible the tuning will slow down filesystem IO if they are competing for IOPS.

Stril · Dec 13, 2018

Hi!

Thank you!
Last night, the resilver became faster. Resilver should be finished in 35h...

I am a bit worried about the resilver-times, as I wanted to build a larger ZFS-storage within the next weeks (24x10TB). If that storage would be filled at 75%, rebuild would have to scan about 80 TB.
What is your expirience with these setups? Doesnt a resilver run for weeks?

zxv · Dec 13, 2018

One option would be to consider benchmark them.
For instance, comparing two contrasting pools:
- one pool with six vdevs of mirrored pairs
- one pool with one vdev of 12 drives in raidz2
and see how they perform under your load.

Sounds like your resilver performance is highly dependent on the load, so testing under that load could shed some light. The mirrored pairs should scrub and resilver faster when there's no other filesystem IO going on just because they perform better; however, under a heavy load, they may show far less difference.

You might also consider ashift=9 for 10TB drives with 4k native sectors.
If the if backup files are large, a larger recordsize will conserve metadata consumption of ram, and could potentially allow you to use an l2arc.
Again, compare two pools with different values, and see what performs best.

ZFS on Hardware-Raid

Member

Radioactive Member

Member

Radioactive Member

Member

Well-Known Member

Well-Known Member

Member

The more I C, the less I see.

Well-Known Member

Member

Member

Member

Well-Known Member

Member

Well-Known Member

Radioactive Member

The more I C, the less I see.

Member

The more I C, the less I see.