[ZFS] mirroring vs raidz performance considerations

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

3nodeproblem

Member
Jun 28, 2020
49
11
8
TL;DR: Given a system where CPU may be a constraining factor, should I expect larger performance from mirrored vdevs or 3 disk raidz1?

---


After having run a smaller glusterfs cluster for some time, I am now upgrading hardware and after a lot of reading finally feeling ready to do my first ZFS setup.

The system will consist of 3 glusterfs servers (replica 2 arbiter 1), with the following specs on each:

  • Ryzen 3600
  • 128 GB RAM
  • 8 SATA ports
  • 1 m.2 PCIe slot (to be populated by 2TB NVMe drive; used as both system drive and L2ARC/SLOG/special)
  • 1 PCIe Gen4x16 slot (to be unused for now)
  • 2 10GbE NICs
  • Debian 11

My plan is to have 2 glusterfs volumes and this is what I'm thinking for the 2 storage servers (in glusterfs the arbiter does not hold actual data):

  • media (larger files, mostly torrents and media for streaming; reads mostly sequential)
  • data (everything else for hosted services; docker images, logs, config, databases, git repos; varied loads with lots of random access)
    • 4-6 2.5" 2TB SSDs (currently deciding between WD Blue, Seagate Barracuda 120, Crucial MX500. Assuming I still want drives with their own DRAM cache)
    • L2ARC/SLOG/special: ~700GB partitions on NVMe

I am aware that the NVMe drive may become a bottleneck and that it's advisable to put the "special" dev on redundant storage - if that turns out to be an issue, I could still plug a 2xNVMe card into each storage server's PCIe slot and add additional NVMe drives at a later stage.

Now, I am having trouble settling on the number of disks and vdev layouts before purchasing.

  • For the media volume, should I go with a 3x8TB raidz1, or 2x12TB mirrored, assuming I will have a single-vdev zpool?
  • For the data volume, 2 mirrored striped vdevs with 2x2TB each, or single-vdev raidz?

Since I will have offsite backups and glusterfs provides additional redundancy, I am satisfied with any safety above 0, so the failure tradeoffs between the alternatives is not really any concern.

I have seen conflicting information on the performance characteristics between raidz1 and mirrored. The obvious answer is that mirroring is less CPU-taxing due to not having to calculate parity, but how much consideration should that really be? feels like the sweet spot for storage utilization for the data vdev, and mirroring (2x12TB) for media, but will I pay a significant performance tax for either? Especially this benchmark confuses me a lot in how the author gets significantly better performance from raidz1 compared to mirroring.

The Ryzen 3600 is a 6-core 12-thread CPU and these servers will be dedicated to storage, but considering the overheads of glusterfs, encryption (recently in stable ZFS for Linux), L2ARC/SLOG, scrubbing, and (optional, if it can be afforded) compression, I am unsure how to reason about it.
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
The "killer" CPU-wise will be ZFS native encryption, not parity; the hit is larger than using LUKS, at least on ZoL. However, if it were me if I were using platter-based HDDs I would use mirrors because they're much easier to expand


Apples to oranges, but my debian system can read from one of its LUKS arrays at ~1.8GB/s using about 120% of a core on a 3700X. Parity calculation barely registers under real world usage and hits a maximum of about 30-50% of a core during rebuilds/resyncs. The 3600 is only marginally slower than the 3700X when it comes to single-thread perf so it shouldn't have any problem with the ZFS side of things.

Personally there's no way I wouldn't run my boot/system drives in a mirror if I could avoid it but that's just me - ran in to too many times when I've lost a sole boot drive and spent too long restoring. Even better, removing one half of a RAID1 before a big OS upgrade gets you an almost-instant rollback in the event of something going wrong. I guess you're building two redundant nodes here but it's still something that'd keep me up at night ;)

Does your media drive really need an L2ARC and a SLOG? Assuming it's primarily serving only a couple of clients at any one time, regular drives should be able to handle it without any appreciable performance problems (unless you're running like a hojillion torrents simultaneously), since you've already got your random IO hitting a different set of drives. If your data drive on the SSDs isn't using sync writes then I'd say that doesn't warrant an L2ARC or a SLOG either; if it was using sync writes, I think you'd still be better off missing out on an L2ARC (reads from the SSDs should be hella fast and you'll need not worry about using your RAM and NVME space for it) but using a good SLOG like an optane may be beneficial depending on your write requirements.

(Hopefully someone more au fait with ZFS can give you better advice here though, I don't use it at home because it's very hard to expand so take the above with a generous dose of sodium chloride)

Regarding your link, I'd take those benches with a very large pinch of salt; dd is a very poor test of IO performance, it has all sorts of limitations. Much better to use iometer (which also gives you much more detailed results). It's also fairly old for a ZoL test so I'd suggest trying the same sort of tests yourself if no-one else has done better ones.

SSD-wise through I would steer well clear of "consumer" drives if at all possible; their performance is often highly inconsistent, especially when under long-running loads (and a set of VMs/containers will be making constant small reads and writes). If it'll fit in your budget at least consider some "enterprise-lite" drives like the Micron 5200/5300 or the Intel D3-S4510. Sadly these are both a lot more expensive than they were six months ago, at least in the UK.
 
  • Like
Reactions: 3nodeproblem

3nodeproblem

Member
Jun 28, 2020
49
11
8
@EffrafaxOfWug Great food for thought! And good point on the benchmarks not necessarily being representative of the current state of things.

Regarding need of L2ARC/SLOG: I do think it would still make a big difference for the media drives; I am currently running everything on a scaled-down setup with GlusterFS without ZFS straight on LVM/LUKS with a single 8TB Micron 5210 ION SSD on each server and the performance is pretty horrible (some/most of which might be explained by CPU context switching on the embedded CPU servers, and too much happening on the GbE network links, but still).

My thinking for the media volume is that even if media streaming is sequential in nature, the constant seeding has a great chance of "interrupting" the sequential accesses on the mechanical drives, causing high churn (this is the right term I hope - the head moving back and forth and waiting for rotation a lot). I'm thinking well-filled read- and write caches should prevent this situation a lot.


Either way, thinking a bit more about it I've narrowed it down a bit more:

  • media: mirrored 12TB drives
  • data: 3x2TB Seagate Barracuda 120 SSD in a raidz1; In my country it's at the same price-point as the other cheapest-but-still-not-cache-less 2TB SSDs (almost half the price of e.g. Micron 5210), but with 1170 TBW rating and 5 years warranty. Not strictly apples to oranges, but if I understand it right, that's still roughly in the same range as the DWPD ratings of the 5210 ION and even if in practice it means the Seagates will fail faster, I can still afford to have a couple of spare drives at hand for cheaper total cost but with significantly higher write performance.
Reading the post and comments re zfs native encryption it's obvious that LUKS is the more performant choice today (even with the performance improvements in 0.8.4 mentioned in the comments) but I will still try to go with the native encryption as over time we should expect performance to improve even more and I hope that zfs native encryption can allow me to put in my passphrase only once per boot and not once per physical drive per boot ;)