Raid5 grow time

pettinz · Jan 24, 2019

Hi,
I have a Rackmount with a LSI-9011-8i. Now I have 2x8TB drives (one of these full) and I'm planning to buy another 4 of these drives (WD80EFAX) to build a RAID-5 with mdadm. (Now the 2 drives are NOT in RAID)
If I create a RAID-5 with the new 4x8TB drives, how long will it take to grow, if I want to add a new drive to the array? Is it possible?
After creating the RAID-5 with the new 4x8TB drives and after copying data from the old 2 drives into the array, I will remain with the old 2x8TB unused, so I wonder if they can be added to the new raid.

dandanio · Jan 24, 2019

This is my braindump, hope it helps.
gist:fdc22d5a2c330b2773b5399649b519aa

I have been maintaining a raid mdadm raid and went through multiple (over 20) disk upgrades on it, in addition to changing raid levels etc. Never an issue, just careful planning (backups in place). Hope it helps. It is a mediawiki formatting but there is no pastebin with mediawiki formatting in it.

All you are asking is possible. It will take days to add ad replace 8TB disks. But with careful planning and backups, the risk is minimal.
If you have any q's, fire away.

pettinz · Jan 24, 2019

Thank you for your answer, I have some issues to read that file but I will.
Now I have some questions.

Do you suggest to make directly a RAID-6 on the initial setup of 4x8TB drives?
Before growing up the array, should I unmount the array? Or it can be online while growing up?
Can I add multiple drives or just one drive at a time?

Thank you.

dandanio · Jan 24, 2019

You do not need to unmount anything, mdadm can perform all changes "live".
I would recommend adding one drive at a time, it takes sufficiently long to do so.
RAID-5 vs RAID-6 is a debatable topic. I recommend doing RAID-6 (RAID Z2 on ZFS) anytime the storage is a primary source of data. So for example I would do RAID6 on primary, then RAID5 on secondaries. Cron'd scrub or checkarray is recommended.

pettinz · Jan 24, 2019

dandanio said:
You do not need to unmount anything, mdadm can perform all changes "live".

Performing changes "live" will not increase the time of growth ?

dandanio · Jan 24, 2019

pettinz said:
Performing changes "live" will not increase the time of growth ?

It will, but it won't be significant. And you can tweak rebuild time vs performance. If you are uncomfortable with that concept, just take it offline.

EffrafaxOfWug · Jan 24, 2019

First off - yes, don't risk your data with a RAID5 array comprising of discs that big. You want at least RAID6. IIRC mdadm supports turning a RAID5 directly into a RAID6 by adding a new drive and growing the array; for example to turn a 3-drive RAID5 into a 4-drive RAID6 the following should work:

Code:

mdadm --manage /dev/md16 --add /dev/sdd1
mdadm --grow /dev/md16 --raid-devices 4 --level 6

If you want to keep the array intact you'll need to add drives one at a time and wait for the resync/reshape to complete before taking the next step.

Which brings us back to your original question - it will take A Long Time, but how long depends on a lot of factors.

Firstly, creating a new array from four new drives - this is usually faster than a reshape and should proceed at <finger in the air> 80-100MB/s.

Secondly, yes once you're done moving data off the old drives, the old array can be destroyed and the drives recycled into the new array. I'm growing a RAID6 array from six to seven 6TB drives currently and it's not a speedy process, hitting <50MB/s currently - parity RAID involve a lot more random IO patterns than stripes or mirrors so the rebuild time is correspondingly larger:

Code:

md16 : active raid6 sdr1[6] sdl1[0] sdq1[5] sdp1[4] sdo1[3] sdn1[2] sdm1[1]
      23441561600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      [============>........]  reshape = 60.8% (3568739840/5860390400) finish=881.0min speed=43349K/sec
      bitmap: 2/44 pages [8KB], 65536KB chunk

I generally stick to RAID10 (default nr2 layout) for my "production" arrays since rebuild times are dramatically shorter - replacing a failed disc is basically a sequential copy from one disc to another so you can expect that to clock in at 100-200MB/s on modern large HDDs.

pettinz said:
Performing changes "live" will not increase the time of growth ?

It depends a lot on the workload the array is undergoing. My example above is on an array that's not being used; any heavy load on the array would lengthen the rebuild time (quite possibly considerably) due to even higher load on the discs. Another disadvantage of parity RAID is the comparatively poor performance during rebuilds.

pettinz · Jan 24, 2019

Code:

md16 : active raid6 sdr1[6] sdl1[0] sdq1[5] sdp1[4] sdo1[3] sdn1[2] sdm1[1]
      23441561600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      [============>........]  reshape = 60.8% (3568739840/5860390400) finish=881.0min speed=43349K/sec
      bitmap: 2/44 pages [8KB], 65536KB chunk

881.0min total or 881.0min from the 60.8% ?

dandanio · Jan 24, 2019

pettinz said:
881.0min total or 881.0min from the 60.8% ?

That's additional 881 minutes on top of what already passed.

pettinz · Jan 24, 2019

Ok now I have another question... It will take SO long, so

What happen if power fails or current blows?
Is there a way to continue the reshape after a "accidental" shutdown?

EffrafaxOfWug · Jan 24, 2019

Resync and reshape will pick up just fine after a controlled shutdown OK (indeed from any sort of shutdown that doesn't do any damage). There is a potential for data loss in the event of fire/flood/electric gremlins/hippies due to the way that parity RAID stripes are held in RAM, if you're unlucky these can scupper the partition table or superblocks or other important areas of the disc. mdadm's default setup minimises this risk by keeping these buffers small, but if you want the RAID array to be speedy you'll generally have increased this already. This is an unavoidable consequence of using system RAM as cache, you can mitigate it with a UPS but as you point out that won't save you from a crash or from grandma kicking out the power cord.

This sort of error is known as the write whole and this page gives a fair summation about it: "Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays. ZFS is the only filesystem I'm aware of immune to the write hole by design.

I've never run into these catastrophic errors myself (closest I've come was some heavy fscking needed post-crash) but it's certainly happened to other people. mdadm is generally pretty resilient but you never want to do any reshaping work without having a backup handy, just in case of a total loss of one of the arrays.

(As an aside, always spend your money on a) a UPS and b) more backups before you spend your money on RAID)

dandanio · Jan 24, 2019

pettinz said:
Ok now I have another question... It will take SO long, so

What happen if power fails or current blows?

Is there a way to continue the reshape after a "accidental" shutdown?

To answer both your questions:

Buy an UPS.
Install apcupsd.

mstone · Jan 24, 2019

you probably also want to increase the raid speed limit

EffrafaxOfWug · Jan 24, 2019

Well the good news is the remaining reshape of the array ended up finishing somewhat quicker than the ~13hrs.

Now I just need to wait for the resync operation to finish; this should be much faster than the reshape operation as the data isn't being moved around (although I'm dubious it'll acually finish in 40mins as RAID arrays tend to get slower as you get towards the end):

Code:

root@wug:~# mdadm --grow /dev/md16 --size=max
mdadm: component size of /dev/md16 has been set to 5860392960K
root@wug:~# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md16 : active raid6 sdr1[6] sdl1[0] sdq1[5] sdp1[4] sdo1[3] sdn1[2] sdm1[1]
      29301964800 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      [>....................]  resync =  0.1% (8279252/5860392960) finish=35.3min speed=2759752K/sec
      bitmap: 0/44 pages [0KB], 65536KB chunk

As mstone alludes to I've increased the RAID rebuild speeds by tweaking proc/sysctl:

Code:

dev.raid.speed_limit_min = 50000
dev.raid.speed_limit_max = 5000000

...and as mentioned previously I increased the RAID stripe cache set in /sys/block/md32/md/stripe_cache_size from the default of 256 to 4096 with the following udev rule:

Code:

SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="4096"

msg7086 · Jan 29, 2019

Multiple 8TB drives with RAID-5 will destroy your [del]life[/del] data.

Search

Raid5 grow time

pettinz

Member

dandanio

Active Member

pettinz

Member

dandanio

Active Member

pettinz

Member

dandanio

Active Member

EffrafaxOfWug

Radioactive Member

pettinz

Member

dandanio

Active Member

pettinz

Member

EffrafaxOfWug

Radioactive Member

dandanio

Active Member

mstone

Active Member

EffrafaxOfWug

Radioactive Member

msg7086

Active Member