SW or HW RAID6?

fossxplorer · Sep 17, 2016

Reading Software vs hardware RAID performance and cache usage confuses me.
Today i'm running Linux md RAID10 with 4 x Hitachi 4TB NAS drives. As i only have 8TB avaiable space, i'm planning for the future and RAID6 based on 5x 6TB disks seems good to me space wise.
I will then be able to get 18TB of available space.
This is for a server inside a DC so there are UPS etc, so i won't have too many power failures to deal with (what i'm thinking is that with Linux md RAID6 data will be cached to OS cache so it's solely protected by the mains and UPS, nothing else (i.e no battery))

So my questions is, is it dangerous to go for Linux md RAID6? If HW RAID is the better solutions, what kind of prices do these BBU backed (used) controllers go for?
I was also told once at #centos that Linux md RAID isn't recommended with drives more than 8, but ideally 4.

Thanks.

gea · Sep 17, 2016

With any Raid-6 solution, does not matter if softwareraid or hardwareraid, a raid stripeset is written disk by disk sequentially. On a crash during a write you are in danger of a corrupted filesysten and/or raid array.

This is due the write hole problem on Raid-6
"Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays.

There are two common ways to handle this. One is a hardware raidcontroller + BBU that can reduce the problem. The other solution are softwareraid and CopyOnWrite filesystems as there a write is done completely (data, metadata, raidstripe) or not at all. No partly updated filesystem or raid-arrays - 100% crash resistent by design.

So the best solution would be not using Raid-6 with ext4 but using ZFS with a more robust raid than Raid-6 and with a better overall datasecurity due checksums and CopyOnWrite. A nice extra are snaps and the advanced cache mechanism and sync write options that are part of ZFS. With enough RAM and a modern CPU this is even faster than a hardwareraid with less and slower RAM. Current CPUs are able to do realtime 3D. The little raid calculation does not matter. Even with dozens of disks in a ZFS Raid-Z2/3 CPU load is not really relevant today.

4 GB RAM is ok for ZFS. Oracle where ZFS is native with a 64bit Solaris claims a minumum of 2GB for ZFS solutions with any poolsize. More is faster as it is used as Readcache. Some ZFS systems have 64-512 GB RAM what means that most or nearly all reads for current data are delivered from RAM. Optionally you can add an SSD to extend the readcache but this is much slower than RAMbased caching.

whitey · Sep 17, 2016

I see no issues with this. If your comfortable/trust mdadm then stick with it but I hope this is only for bulk media not running any VM's or I/O intensive workloads on it.

EDIT: Cough, concur with Gea above, I just didn't want to sound like a ZFS zealot but you all know my fragile mind is warped/I drank the koolaid LONG ago!

gea · Sep 17, 2016

A small anecdote from history

It was in 2001 when about a third of all german internet sites were offline for a week with a complete dataloss on many of them. Reason was a crash/powerloss in a Sun storage cluster with the unability to repair via several fschk where each needs some days to run and the lack of a most current backup or the needed time to restore others from backup. Another internet provider was well advised to use hundreds of smaller independent servers instead one large storage to avoid large storage problems.

In 2016 even a tablet or laptop may have more storage than the Sun cluster at that time so you must face similar problems with any storage. Filesystems from that time are simply not good enough for current storage capacities.

I have no confirmation, but Sun started developping ZFS the same time to adress exact these problems like a crash resistent filesystem or checksums to be able to repair verified data from redundancy. This disaster was blameful for Sun. The principles behind ZFS can be found in many enterprise storage systems like NetApp. Beside the genuine one in Oracle Solaris, ZFS is luckily OpenSource and therefor available in free Solaris forks but also in BSD, Linux or OSX.

Strato-Panne: "Wir machen die reinste Hölle durch" - SPIEGEL ONLINE

you may translate with Google as there is no english version

TuxDude · Sep 19, 2016

There's nothing dangerous about linux md-raid - it is very mature and well tested. While it will use system RAM for caching that is only for reads so there is no worry about data loss in a power outage. There's also nothing wrong with building arrays with far more than 8 drives - I was playing with a 70-drive array recently - though you will get recommendations on the maximum drives to have in a single parity group which tends to decrease as drives get larger. Using 6TB drives I would probably not go more than 5 in a raid-5 (I would probably not use raid-5 on 6TB disks actually, I'd go to raid-6), or 10 in a raid-6. That doesn't mean you can't have a 20-drive array of 6TB drives, just that you move to raid-60 and have two 10-disk raid-6's with your data striped between them.

gea · Sep 19, 2016

Like a car that is similar old to ext, hfs or ntfs filesystems, say 20 years.
There is nothing dangerous or wrong about them -
beside all modern security technologies are missing.

klree · Sep 20, 2016

Well...It talks about increasing CPU performance, (Core i5 M 520, Westmere generation) have XOR performance of over 4 GB/s and RAID-6 syndrome performance over 3 GB/s over a single execution core. It seems it doesn't touch on rebuilding part. With your latest CPU running XOR, I haven't seen a rebuild with md can goes below 10hrs.

TuxDude · Sep 21, 2016

Don't even worry about the CPU usage of software raid - it is negligible. The basic XOR parity for raid-5 is so simple I can do the calculations in my head as fast as I can read the data, and while the extra parity for raid-6 does get complicated enough to actually call math it is still trivial for even an older CPU. Unless you're packing a pile of NVMe drives your CPU will be mostly idle waiting for data from disks to process.

Rebuilding a md array will generally run as fast as the disk being repaired can take the streaming writes - the other drives involved can usually do their streaming reads faster. If you're rebuilding an array that was made out of 4TB drives and lost one, the rebuild process will involve starting at block 1 on all of the drives, and reading everything and writing the calculated repaired data to the new drive. It's going to involve moving around 4TB times the number of drives of data, but will do so in a way that matches the best possible workload for spinning disks.

Rebuilding a ZFS array is much different - with the raid being part of the filesystem instead of starting at the beginning and processing the entire drive, ZFS will first figure out which files had data on the failed drive to figure out exactly what all data needs to be repaired. It will only process enough data to do those repairs and skip the rest - so a large array built with 4TB drives but that only has a few GB of data stored on it will only end up processing a few GB instead of many TB. But it ends up jumping around the physical drives to do that work, causing the disks to see a random workload instead of sequential. Not that much of an issue for SSDs, but a spinning disk might drop from 130MB/s of sequential throughput, down to 5MB or less with a random workload. If you want to hear bad stories about long rebuild times (ZFS calls it resilvering), go read about repairing large ZFS arrays.

EffrafaxOfWug · Sep 22, 2016

TuxDude said:
Don't even worry about the CPU usage of software raid - it is negligible.

Just to give some perspective on the numbers - mdadm actually uses SSE or AVX these days for accelerated parity calcs as well. When the raid5/6 module is called it does a quick performance test to see which methods are the fastest for that system; a quick bit of google-fu will show stuff posted from people's dmesg like this:

Code:

raid5: automatically using best checksumming function: generic_sse
    generic_sse: 11892.000 MB/sec
raid5: using function: generic_sse (11892.000 MB/sec)
raid6: int64x1   2644 MB/s
raid6: int64x2   3238 MB/s
raid6: int64x4   3011 MB/s
raid6: int64x8   2503 MB/s
raid6: sse2x1    5375 MB/s
raid6: sse2x2    5851 MB/s
raid6: sse2x4    9136 MB/s
raid6: using algorithm sse2x4 (9136 MB/s)

Parity RAID rebuilds take an age because the whole array needs to be read in order to recalculate the parity, and you're typically going to be limited by the speed of stuff being written to the new disc; in my experience, this isn't quite a sequential workload so average rebuild speeds of 50-60MB/s are what I tend to get on my home arrays with ~100MB/s-class drives (whilst my RAID10 arrays rebuild in a sequential read from one drive to a sequential write to another and will thus almost always attain rebuilds at 100MB/s). As Tux points out, since ZFS does the RAID, volume management and filesystem, it only needs to rebuild the bits with actual data and not the entire array so it can theoretically be much faster than mdadm in that regard.

Veering more back in the direction of the topic, I wouldn't touch hardware RAID with a barge pole these days if I could avoid it; for home use I use nothing but softraid. HW raid cards are bloody expensive, and if they go wrong you will frequently find yourself forced to buy another bloody expensive card to get your data back. With generic softraid all you need is the built-in SATA ports and/or a relatively inexpensive HBA.

The big advantage for mdadm for me is it's a real doddle to expand on the fly (one of my arrays at home was originally 4x1TB drives, then 6x2TB drives and is now 8x3TB drives); from my experiments with it ZFS doesn't really let you do this. Sure you can replace drives with larger capacity ones and re[build|silver] the array, but you can't add extra devices to it on the VDEV level. Not really a problem for big piles of kit at work, but a major PITA for dinky little home NAS units IMHO; it's a technically superior option than mdadm, but significantly more expensive in terms of initial outlay and maintenance.

fossxplorer said:
This is for a server inside a DC so there are UPS etc, so i won't have too many power failures to deal with (what i'm thinking is that with Linux md RAID6 data will be cached to OS cache so it's solely protected by the mains and UPS, nothing else (i.e no battery))

The parameter in question here is the stripe_cache_size option used my mdadm I think; it's basically a RAM cache used to sychronise writes across the array. It can be expanded from it's fairly conservative size of 256 pages (page size typically being 4k) to keep more stuff in RAM, this normally has a pretty impressive effect on IO (especially random) since it allows HDDs to be used in a more sequential manner... but it also increases the risk of corruption in the event of power failure. Personally I've not had any problems from it being set to 8192 but always pays to be cautious (and check your backups!).

Keljian · Sep 22, 2016

re ZFS: Raid 10 for Vpools is perfectly acceptable - as in adding disks in mirrored pairs in zfs. There's no reason that this can't be used at home (aside from cost). They can even be different sizes.

fractal · Sep 22, 2016

Keljian said:
re ZFS: Raid 10 for Vpools is perfectly acceptable - as in adding disks in mirrored pairs in zfs. There's no reason that this can't be used at home (aside from cost). They can even be different sizes.

I have read a number of articles on the subject and the prevailing advice is for everyone to run ZFS Raid 10 under every situation. It is great. Performance is wonderful. Expansion is wonderful. Life is wonderful... If you can afford buying twice as many drives as you need.

I started my MD Raid with 4 x 1 TB drives almost a decade ago. It slowly grew from 4 drives in a raid5 volume to 8 drives in a raid6 volume. All by adding one or two drives at a time as finances allowed.

I upgraded that to 4 x 4 TB drives a couple of years ago. I am now up to 6 x 4 TB drives. I just added the drives as I bought them and told mdadm to add them to the volume. A week or so later they were online. Yeah, it was that slow.

There are places for ZFS. ZFS 10 vdevs and buying drives 2 at a time works. RaidZ1 with 4 drive vdevs and buying 4 drives at a time is also doable with less overhead but a slightly higher expansion cost. But nothing beats mdadm for a low cost method of adding one drive at a time to expand your pool.

Now, whether it make sense to have large parity arrays with modern size hard drives is another subject that has been beat into the ground in so many other threads without resolution

FWIW, I do both. I have two NASs where performance is even an "on the radar" factor that run zfs10 . I have my main file store that is software raid5 with 6 x 4 tb 5400RPM drives that is rsync'ed to 4 x 6 tb drives in a raidz1 volume in a different room. I don't think either choice is a "one size fits all".

Other than the original question -- software raid vs hardware raid -- That I think the answer is a resounding "software raid" for pretty much anyone short of enterprise. And even there the decision is situational.

i386 · Sep 23, 2016

Everytime I read something about hw raid and the raid write hole it's from people suggesting zfs. And then I have to think about this article: The Cult of ZFS | SMB IT Journal

ttabbal · Sep 23, 2016

HW RAID and the write hole is easy enough to solve with battery backup cache. It has it's place. Particularly on platforms with poor SW RAID implementations, or none at all. Windows is fixing this with their new Storage Spaces code, which looks interesting. Even though I have no use for Windows on servers personally, plenty of people do and it's good to see more competition. My biggest complaint about HW RAID isn't the write hole, it's compatibility and resliver times. I can't swap controllers with a lot of research, even down to firmware versions, and be sure the array will be compatible. And since it doesn't know about the filesystem, it has to resliver the whole array, not just the used space. Not that some software setups don't have similar issues.

ZFS is great, and it's what I use for my server. It's not the end-all of filesystems, but it is very good. And I like that the open source nature of it means I can switch OS and hardware out and still use my existing array. The main features for me on my personal home systems are pretty simple. I want data security, checksumming, redundancy, and scrubbing. Not just of metadata, the actual data as well. I have had to repair file damage due to bitrot, not often, but it does happen. That requirement alone drops me to ZFS and BtrFS. ZFS is, in my opinion, more stable and reliable. In my current configuration, striped mirrors, I think BtrFS could work well. But it seems to still have problems with parity modes.

Where ZFS isn't as nice for home users is expansion. Particularly if you want to use parity modes. You can't add a drive or two to an existing raidz, you have to add a whole new raidz or upgrade each drive. Mirrors aren't bad, you can just add/upgrade 2 drives. You can't remove them though, even if you have the space. It's designed for enterprise setups that just don't need that flexibility. Linux mdadm and some other volume managers can do this sort of thing. So if you want that sort of flexibility, ZFS might not be the best choice. If you use those tools, make sure you have good backups. It's not likely, but any time you do something that major to a system, there is a possibility of having the whole thing fall over. Even big changes with ZFS. We should all be doing that anyway, but in home settings it's probably the most common thing to cut corners on.

While I would say ZFS has earned its reputation, I do think the cult of ZFS does exist, and I've occasionally attended meetings.

Chuntzu · Sep 24, 2016

Windows server 2016 Storage Spaces and REFS has greatly improved. It is very fast and the networking stack is insane and requires no tweeking. Just plug in 10/25/40/56/100gbit and voila local and remote connections that will tap out your cpu and 20+ gigabytes per second transfers if you have the drives and ram.

Alex Skysilk · Sep 26, 2016

ZFS # LinuxMD. The argument is somewhat disingenous; ZFS deals with data both at block and LVM level, while Linux MD is strictly block level. Linux MD is not aware of the data on the file system, nor is it capable of dealing with any fault with it leaving any such functionality to the file system.

To that end, ZFS volumes dont really "rebuild" in the RAID sense at all. Resilvering is more akin to an fsck as it only deals with the actual files in the file system. This tight coupling of the file system with block storage mapping cannot be replicated with Linux MD. You can make the case that Linux MD plus a journalling file system is "good enough" but its not really a comparable technique; moreover, with ZFS support now almost ubiquitous across all unixlike OSs and CPU cycles/RAM being abundant even in entry level hardware there is almost no use case where mdraid is preferable at all.

Search

SW or HW RAID6?

fossxplorer

Active Member

gea

Well-Known Member

whitey

Moderator

gea

Well-Known Member

TuxDude

Well-Known Member

gea

Well-Known Member

klree

Member

TuxDude

Well-Known Member

EffrafaxOfWug

Radioactive Member

Keljian

Active Member

fractal

Active Member

i386

Well-Known Member

ttabbal

Active Member

Chuntzu

Active Member

Alex Skysilk

New Member