mdadm - Production best practices

Ron Kelley · Feb 21, 2020

Greetings all,

Looking to put mdadm for the first time into production and am looking for some best practices. I have used it in the past for non-production work, but a production install absolutely requires rock-solid uptime, reliability, and performance.

Background:
I have 4x NFS servers in production - each with 6x 2TB Samsung SSD drives connected to LSI 9300 hardware RAID cards. These servers have been running flawlessly for the past 3-4 years. They are fast and easy to manage (easy to replace a failed drive, etc). The new servers will hold 3x 6.2TB NVMe drives, but unfortunately, they don't have an available PCIe slot for a RAID card. Thus, I am looking at software RAID this time.

As a test, I installed FreeNAS on the server and setup a RAID-Z array with auto-tune enabled. Unfortunately, no amount of tuning provided any decent performance (ashift=12, disabling compression/sync-writes/atime, tuning min/max active reads, etc) . The disk reads were capped around 650MB/sec and disk writes capped around 1.2GB/sec. I spent an entire day pouring over the zfs tuning options to no avail.

Purpose:
The purpose of the servers is to host ESXi vmdk files. Thus, XFS makes a great filesystem option in this case. My idea is to use mdadm RAID-5 and put an XFS filesytem on top. Export the volume via NFS and call it a day. The servers will have a dedicated (non-RAID) boot drive via 64GB SATA-DOM.

The issue is day-2 (and beyond) maintenance. Since I have not done mdadm in production, I don't know how easy/hard it is for someone to walk into the data center and replace a failing drive. Or, what happens if the server panics and reboots w/out attaching the RAID volume. Or, how reliable, in general, is mdadm.

I am wondering if anyone could share their best practices for a real production setup with mdadm. For example, how often do you scrub the drives, what tuning parameters did you use, etc.

Thanks for any feedback.

-Ron

gea · Feb 21, 2020

Your main concern should be data security vs performance

Sun (now Oracle) developped ZFS not for superiour performance but superiour data security. As a unique new feature ZFS offers checksums for realtime data validation and auto repair. This increases amount of data and lowers performance.

ZFS adds CopyOnWrite. This means that atomic writes like update data + update metadate is done completely or discarded. Never again a corrupted filesystem and as an add-on secure readonly snaps for versioning.

Even if a modern filesystem or a hardware raid-adapter with BBU offers a decent protection against a crash or the write hole raid problem ("Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays.) this does not protect a guest filesystem like you need with VM storage. A quite decent protection is a hardware raid with BBU but the best of all is ZFS with sync write enabled. Optionally with a slow pool you want an Slog like Intel Optane for performance.

With mdadm you loose all the superiour features of ZFS raid. With ext4 or XFS you may get a slightly faster filesystem but with a much lower level of data protection.

FreeNAS is a quite common ZFS option. The fastest ZFS is Oracle Solaris with native ZFS (They invented NFS and ZFS). With Open-ZFS you may try OmniOS (free Solaris fork, an enterprise class OS with an LTS and stable with regular security updates) as an alternative, quite often a little faster than Free-BSD/ FreeNAS with newer ZFS features like encryption and special vdevs and lower CPU and RAM demands, see OmniOS Community Edition - even with a commercial support option.

If you want to go without ZFS, look for a hardware raid with BBU/Flash protection for Sata or 12G SAS SSDs. For an SSD/NVMe pool as VM storage use enterprise class ones with powerloss protection in any case. With NVMe software raid is effectively your only option.

Ron Kelley · Feb 21, 2020

Appreciate your feedback. I have been watching (and playing with) ZFS over the past 5 years and understand the benefits it provides compared to other storage technologies. That said, performance has always been the worst part about ZFS - regardless of tuning (and I have spent many hours trying to tune it). I have tried it on FreeNAS, OpenIndiana, ZFS on Linux, etc.

As I mentioned in the original post, these servers don't have any extra PCIe slots available for NVMe RAID cards. Thus, I need to go to a software-based setup.

ZFS aside, can you offer any best practices for mdadm in production?

gea · Feb 21, 2020

10 years ago when I lost my mailserver with > 1M files on a ntfs Raid-5 with one of the best hardware raid/BBU at that time with no other repair option than a restore from backup (after 2 days of chkdsk and 3 days of restore) I decided to go 100% ZFS and nothing else.

btw
The only "raid" adapter that I know of is a BroadCom 9400. I have not tried as a raid adapter, only as a HBA.

redeamon · Feb 21, 2020

+1 for ZFS

If you need performance just add tons of ram as ZFS will utilize it for caching and does a good job at it.

Ron Kelley · Feb 21, 2020

-1 for ZFS

RANT
(!!! I told myself not to talk about ZFS !!!)

Adding tons of RAM to a server with 3x high-end NVMe drives just to get adequate performance is not a way to solve this problem. This seems to be a common "solution" for anyone with ZFS issues. "Oh, you just need to add <xyz> hardware to your server to make it work properly". This has been going on for years.

I started with ZFS about 5 years ago with a bunch of 2TB drives. Bad read-write performance. Advice from ZFS gurus "You need to add SSDs for log device, and you need 2x RAM for every TB of space". Fail. Added a dedicated SSD and a ton of RAM - read/write speeds marginally better. Many days of tuning down the drain. Back to HW RAID.

2 years ago, I tried again with SSDs. Again, bad read/write performance. Advice from ZFS gurus "You need to add NVMe drive for log device, and you need MAX RAM". Again. Fail. Many more days of tuning down the drain. Back to HW RAID.

This year, I tried yet again with NVMe - hoping the performance would get better. Again. Fail. 3x NVMe drives that can drive well over 2GB/sec per drive can barely get 650MB/sec reads. No amount of ZFS tuning fixes the problem. No amount of RAM will solve this problem. Sure, I get snapshots, scrubbing, volume management, etc. That is great. But when the system provides less than 50% performance even after tuning, I am not willing to put in any more effort. Install mdadm - get well over 3GB/secs reads and writes using XFS (or BTRFS). Done.

ZFS is a great filesystem with lots of bells and whistles. Unfortunately, speed is not one of them...

/RANT

redeamon · Feb 21, 2020

That's pretty odd. I run NVMe's on my ZFS and get over 2gb/sec over 40gbe. I haven't tested locally but I would imagine it's probably near drive speed as single stream 40gbe is limited.

The devil is always in the details it seems.

acquacow · Feb 21, 2020

I have an 8 HDD raidz2 that does 750MB/sec reads and a 4 SSD Raid z1 that does 1GB/sec just fine read/write.

My config is out of the box FreeNAS 11. Only thing I've done is enable SMB multichannel and make sure iperf showed full 10gige speeds both direction between clients and servers.

That said, I'm a HUGE mdadm fan, since that's what we used to use at Fusion-io to build big flash arrays that would do 200GB/sec and such w/o too much issue.

-- Dave

Ron Kelley · Feb 23, 2020

Thanks Dave. Any performance tuning options you can recommend on the MDADM config? I have 3x P4610 NVMe drives and would like a single RAID-5 volume. The goal is to host VMWare VMDK files. I just ran some raw FIO tests against the drives; each can push almost 3GB reads/writes (128K block size with depth-queue of 64), so I know the hardware is capable.

Also, sorry about the RANT. I really, really want to use ZFS, but unfortunately, each time I try I always get stuck on performance. And, it seems ZFS is currently struggling to get full bandwidth from the NVME drives without intimate knowledge of tuning (Fixing Slow NVMe Raid Performance on Epyc).

I have been hoping ZFS would give at least 50% or more aggregate performance for the entire volume, but regardless of how much I tune, I can only get about 30% per drive. The problem is memory related - not the amount of RAM installed but the code that performs memory copies for ZFS (search the forum.level1tech.com link for "Its the memory copies - my old nemesis again!").

Also, ZFS on Linux is currently experiencing some weird slow-down on reads. (Ref this URL: [URL="https://github.com/zfsonlinux/zfs/issues/8836"]Slow write performance with zfs 0.8 · Issue #8836 · zfsonlinux/zfs)[/URL].

acquacow · Feb 23, 2020

There used to be a lot of hand tuning with nested raid configs, chunk-size, etc, but these days you can pretty much just throw a --level=5 at it and be fine.

It would be different if you had 20+devices, but for 3 devices, you're probably fine.

If using ext3/4, be sure to calculate your stripe-width when you format the FS. No need with xfs.

-- Dave

EffrafaxOfWug · Feb 23, 2020

I'd echo acquacow in that you can generally just chuck mdadm in with default config and everything'll work, but as always the answer is "it depends".

For parity RAID on systems hooked in to a UPS, I almost always use the following udev rule to increase the size of the kernel stripe cache from the default of 256 to 16MB:

Code:

SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="16384"

Since it basically means more stuff held in RAM being queued for writing to disc, it comes with a higher risk of data loss in the event of a crash/power cut etc but in my tests at least gives a noticeable improvement in throughput.

Ron Kelley · Feb 24, 2020

acquacow said:
There used to be a lot of hand tuning with nested raid configs, chunk-size, etc, but these days you can pretty much just throw a --level=5 at it and be fine.

It would be different if you had 20+devices, but for 3 devices, you're probably fine.

If using ext3/4, be sure to calculate your stripe-width when you format the FS. No need with xfs.

-- Dave

Thanks again, Dave!

Ron Kelley · Feb 24, 2020

EffrafaxOfWug said:
I'd echo acquacow in that you can generally just chuck mdadm in with default config and everything'll work, but as always the answer is "it depends".

For parity RAID on systems hooked in to a UPS, I almost always use the following udev rule to increase the size of the kernel stripe cache from the default of 256 to 16MB:

Code:

SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="16384"

Since it basically means more stuff held in RAM being queued for writing to disc, it comes with a higher risk of data loss in the event of a crash/power cut etc but in my tests at least gives a noticeable improvement in throughput.

THIS! This is exactly the kind of feedback I am looking to get. I will certainly test this with my system today. Thus far, I have compiled the following list of items to build the NVMe RAID set:

Code:

# Create RAID /dev/md0
mdadm  --create -c 1M --verbose /dev/md0 --level=5 -n 3 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

# Tune md0
echo 16 > /sys/block/md0/md/group_thread_cnt
echo 16384 > /sys/block/md0/md/stripe_cache_size
echo 5000000 > /proc/sys/dev/raid/speed_limit_min;
echo 10000000 > /proc/sys/dev/raid/speed_limit_max

mdadm --detail --scan >> /etc/mdadm/mdadm.conf

Thanks again for the great feedback!

MrCalvin · Aug 22, 2020

Hi Ron
How is status on your project?

It's my experience that mdadm raid5 only perform properly in conjunction with the host page cache. Don't know why that is, it's just my experience. And as such you can't really use synchronous I/O which ESX are using AFAIK (for good and bad). Not sure how your setup is and if those synchronous I/O form the ESX are parsed all the way to the physical disk devices, as I guess one could say they should, or are "tricked" on the way to speed up thinks.
On the other maybe that's not a problem when using NVMe, only been running mdadm raid5 on hdds.
Always looking for storage tweaking myself, that's an never ending job, right ;-)

MrCalvin · Aug 22, 2020

And I would claim that mdadm are the more robust than hardware based RAID.
The raid configuration is saved on each disk in the so called SUPERBLOCKs. You can actually just take you disc and move them to another server and you RAID will work (or at worst with very few commands). Try to do that with a hardware-based RAID! Even if the controllers are the same, you still have the issue with the configuration on the controller and perhaps even different firmware. yark!.
And the controller add another point of failure too, right. And use a lot of power, 8 watt in idle doing nothing if you're lucky, and 16 watt at use. Not saying it doesn't have it's advantages, but in my view, they don't match the above disadvantages.

i386 · Aug 22, 2020

MrCalvin said:
Try to do that with a hardware-based RAID! Even if the controllers are the same, you still have the issue with the configuration on the controller and perhaps even different firmware.

I moved my array from an adaptec series 6 controller without problems to a series 8 controller

acquacow · Aug 22, 2020

Also, if you are going to have 4 NFS boxes, have you looked at running gluster at all?

I always prefer replication to in-box raid.

Search

mdadm - Production best practices

Ron Kelley

New Member

gea

Well-Known Member

Ron Kelley

New Member

gea

Well-Known Member

redeamon

Active Member

Ron Kelley

New Member

redeamon

Active Member

acquacow

Well-Known Member

Ron Kelley

New Member

acquacow

Well-Known Member

EffrafaxOfWug

Radioactive Member

Ron Kelley

New Member

Ron Kelley

New Member

MrCalvin

IT consultant, Denmark

MrCalvin

IT consultant, Denmark

i386

Well-Known Member

acquacow

Well-Known Member