EXT4 optimization (Journal, checksum, QEMU, LVM, MDADM)

MrCalvin

IT consultant, Denmark
Aug 22, 2016
56
14
8
48
Denmark
www.wit.dk
I've decide to use mdadm RAID1(root) and RAID5(data) on my servers and my final choice of filesystem was EXT4 for both (did consider btrfs, XFS and ZFS, but looking at all pros/cons I ended up with EXT4).

I've started wondering about a few things in regards to performance optimizing:

Twice journal write:
If my QCOW2 file is located on a EXT4 filesystem on the host, and the filesystem in the QCOW2 is also EXT4, one could say journal would be written twice. Would it be safe to disable journal in QCOW2 disk? Would I still be protected in case of crash?
I assume using a Logical Volume (LVM) directly as quest disk wouldn't give me any additional "enhancement" in relation to journal and checksum, as LVM doesn't do neither, right.
But at least journaling would be written twice together with other filesystem overhead on the host.

EXT4 data checksum:
AFAIK EXT4 doesn't have data-checksum (only on metadate), but if EXT4 is located on a RAID5 volume would I get data-checksum, similar to e.g. ZFS, because of the RAID5 data checksum?
If so, what about RAID1, will that also give me data checksum?
 
  • Like
Reactions: gigatexal

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,265
428
83
If my QCOW2 file is located on a EXT4 filesystem on the host, and the filesystem in the QCOW2 is also EXT4, one could say journal would be written twice. Would it be safe to disable journal in QCOW2 disk? Would I still be protected in case of crash?
I don't have any direct experience with QCOW2 images myself but no, the two journals are completely different and you'll need to keep both enabled to keep things crash-consistent.

I assume using a Logical Volume (LVM) directly as quest disk wouldn't give me any additional "enhancement" in relation to journal and checksum, as LVM doesn't do neither, right.
Correct.

AFAIK EXT4 doesn't have data-checksum (only on metadate), but if EXT4 is located on a RAID5 volume would I get data-checksum, similar to e.g. ZFS, because of the RAID5 data checksum?
If so, what about RAID1, will that also give me data checksum?
mdadm doesn't do checksums, RAID5 only has parity and no explicit recovery mechanism if things go wrong with the data itself and the overlying LVM and/or filesystem layers have no idea what the blocks on the disc are doing. Because ZFS does the equivalent of the RAID, volume management and filesystem itself, it's able to utilise the same checksums as error detection and correction throughout.

In reality metadata checksumming is a lot more useful than it sounds at least as far as filesystem reliability (rather than data integrity) goes, but if you want an end-to-end checksummed filesystem your choices are basically ZFS or btrfs. Personally I think the chance of bit-rot have been somewhat overblown so it doesn't rank highly on my list of filesystem must-haves yet; I've been checksumming the static data on my file server for nearly 15 years now and I'm yet to experience any spontaneous file corruption.
 
  • Like
Reactions: arglebargle

MrCalvin

IT consultant, Denmark
Aug 22, 2016
56
14
8
48
Denmark
www.wit.dk
Thanks for clarifying :)
Yea....been working professionally with IT for 25 year, never meet corruption in a way where data checksum would have helped.
I believe hardware RAID, which the whole has been running on forever, doesn't have data-checksum either. It hasn't been a problem so far :p
I did found a thread where a guy had corrupted data on a USB hdd connected to a router, probably a cheap $25 one, I don't think that's a comparable setup, haha
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,265
428
83
Yeah, I'd love for ECC-all-the-way to become the norm eventually as it's a "nice to have" (and essentially "free" with modern x86 CPUs that include the crc32c instruction) but ZFS and btrfs aren't suitable for me at present. Maybe some day this century btrfs will become production-ready...! In the meantime ext4 is the best compromise between robustness and flexibility IMHO.

Hardware RAID cards will normally use ECC memory for their caches to protect against random bit-flips in memory, but yeah it's the filesystem and not the RAID layer where the checksums need to be if you want recovery to work; that means you need filesystem support for it and there's still a vanishingly small number of filesystems that support it. To add to your anecdote about the USB hard drive, the only time I've had unrecoverably corrupted data has always been due to faulty hardware - although chances are a checksum-enabled filesystem (with regular checks) will at least indicate to you when integrity is in question. Although I think the chances of a non-business router treating USB storage anything other than horribly is very low...!

As an aside, the HAMMER2 filesystem in Dragonfly BSD is very interesting feature-wise and worth having a play with if that's your bag.

As an even larger aside, one time I was in Denmark it was near christmas I had some utterly delicious roast pork with crackling, hopefully you're having something similarly delicious for easter! :)
 

MrCalvin

IT consultant, Denmark
Aug 22, 2016
56
14
8
48
Denmark
www.wit.dk
Wau, that's an impressive feature list of HAMMER2, and they say themself it's production ready.
Never really considered BSD as a "general" server OS (VM host, web, mail, SQL (even MS-SQL), dotnet core, samba etc), I have an idea there are rather limited services available, but I might be wrong. I know the firewall, OPNsense, I run as an VM on my servers is running on BSD, but that's a close I get to BSD.

Yea, the pork with crackling is very danish, glad you liked it :) But I think must goes for lamp in easter myself included.
 
Last edited:

msg7086

Active Member
May 2, 2017
256
69
28
33
I had some file corruption with qcow2 that's beyond repair and ended up digging data from local backup. I personally will rely on direct LVM instead of ext4-on-qcow2-on-ext4.
 

gea

Well-Known Member
Dec 31, 2010
2,519
852
113
DE
When I read your comments, why do do want t0 use ext4 + mdadm/raid5 instead of ZFS + Raid-Zn?

Is there silent data corruption?
Yes it is. I have seen this either ex in form of corrupted images (half of image is black) or when ZFS informs me about checksum errors in a file (not too often but it happens) . Real data checksums on ZFS informs you about the problem. Ext4 cannot as it has no data checksums. Raid-5 cannot as it has no data checksums only checks about raid-stripes not real data. Even when a Raid detects a problem it cannot repair ex on a Raid-1 when data is different (mostly undetected on conventional Raid as it does nor compare both mirros on reads and cannot validate what it readed): Which one is valid. With ZFS redundancy errors are detected repaired on the fly (or a poolwide scrub) as it knows ex on the Raid-1 example which part is valid.

How often?
Data corruption can occur due bad drivers, cabling, backplanes etc. Beside that it happens on chance on disk. Not very often but the number of errors depend how long the data is on disk and how large the storage is. If the disk is quite small and you recheck soon everything is ok. If this is a large archive storage and you check after some time you will find errors (If you can detect them with checksums, otherwise you have errors but cannot detect).

What happens on a crash during write?
In this case you are affected by the write hole phenomenom, http://www.raid-recovery-guide.com/raid5-write-hole.asp
When you write a file you write first data and then metadata. A crash within can corrupt your filesystem. When you write to a raid 1/5/6 you must update data disk by disk. A crash within gives you a corrupted raid.

ZFS has CopyOnWrite what means an atomic write (data+metadate or whole raid stripe) is done completely or discarded. No filesystem or Raid corruption on a crash. As an addon you get readonly snaps (protection against Ransomware) .

ZFS has superiour rambased read/write caches with an additional SSD readcache extension (L2Arc)

ZFS can protect the content of the rambased write cache with its ZIL/Slog devices.
Without ZFS you need a hardware raid with BBU/Flash protection - the write hole problem remains

What feature are you missing on ZFS?
Main one may be raid-conversion. This is underway but not ready on ZFS.
 

Blinky 42

Active Member
Aug 6, 2015
565
201
43
44
PA, USA
I am also curious what prompted you to decide on ext4 over XFS or ZFS? For most uses cases both are far more stable and high performance than ext4.
With hardware or software radi5, I would urge caution and solid backups if the physical volumes are large. It is quite easy to hit errors during a rebuild and end up loosing the whole raid5 volume. I am much more a fan of raid6 (or 60) + hot spares if you have enough drive slots.
Going to to the next level of data security is ZFS as gea mentioned, but to implement that level of smarts requires ZFS to be in control of more of the system directly, and has different impacts on how you tune it vs. the traditional filesystem level tuning you may be used to where the filesystem parameters are the only tweakable items and everything above that is common in the kernel across filesystems.

This coming from someone who has been using XFS + LVM and hw or sw raid for a decade+ and several PB of production data.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,265
428
83
I am also curious what prompted you to decide on ext4 over XFS or ZFS? For most uses cases both are far more stable and high performance than ext4.
What are the advantages of XFS over ext4? In my experience the perceived performance of XFS has come at the cost of reliability; XFS remains the only filesystem I've actually lost after a power failure, and in any benches on workloads I've done the performance difference between XFS and ext4 is negligible.

(The deal breaker for me is that XFS can't be shrunk, so I'll never use it apart from for throwaway/testing purposes, but I'm genuinely curious as to what the advantages are)
 

Blinky 42

Active Member
Aug 6, 2015
565
201
43
44
PA, USA
I never really need to need to shrink filesystems - the only cases where that happens if some project has "finished" or gone into a read-only like state and I'll at that point just carve out a new filesystem the size needed, rsync things over and reclaim the old LV. It is more common that I'll setup an archive spot on a different file server that is the proper size and migrate the data there and free up storage on the primary host.

For my main 2 use cases of:
- Tens/hundreds of millions of small/medium files in one filesystem
- Large media files (1-20GB) + associated metadata files (a few combined < %10 of the main file size)
XFS has been the most stable and best performing filesystem out of the set. After getting repeatedly burned with ReiserFS (omg no!), JFS, EXT* and early versions of ZFS on Linux we just use it everywhere now for data and temp space.
 
  • Like
Reactions: arglebargle

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
656
233
43
It sounds like you're trying really hard to invent your own ZFS with a collection of RAID 5, LVM, EXT4, QCOW2 and more EXT4. I can't urge you strongly enough not to do this. You can piece something together that functions, sure, but when shit breaks (and something is going to break, it always does) you're going to have a hell of a time figuring out where in the stack it broke and how to fix it and what else broke or corrupted while you were fixing the first thing. There's also no way to detect or recover from data corruption without going to backups, verifying your entire array against the backups (discrepancy! which data is correct! roll a d20 to decide!) and then praying that you can figure out which data is correct.

You really, *really* don't want to do this unless you really, really like complicated disaster recovery. And if you like that then you should probably run Btrfs because it'll be a hell of a lot more flexible than what you're considering right now.

I would take a step or five back from this and look into Btrfs and ZFS again. I've blown up RAID arrays in creatively stupid ways for the last 30 years and what you want now are checksums, snapshots, and incremental send and receive so that you can make sane backups and be confident that the data on your drives (and in your backups!) is what you expect it to be.

LVM almost offers these features (snapshots are great!) but falls flat because you aren't assured that your snapshot will still exist if you run low on space and you still don't have a method to verify data correctness.

At the very least you'll want to either run SnapRAID (or something similar) or just roll a recursive checksumming script that'll at least tell you which data is correct (even if you can't repair it) if you ever have to do full DR.
 
Last edited:
  • Like
Reactions: fohdeesha