mdadm raid5 recovery

frawst

New Member
Mar 2, 2021
21
3
3
It's probably enough of a divert to warrant its own thread but this sort of thing is very useful for a recovery scenario as a way of gauging how scunnered your files might be.

I've been using md5deep/hashdeep for years for the same sort of thing (I'm not using a filesystem with full checksumming so it's a poor-mans solution to spotting bitrot); you essentially just run it on your directory tree and it'll make an MD5/SHA/whatever hash of all the files within. You can save that out to a file, and then at a later date compare the hashes stored in the file vs. what the hashes of the files are right now.

Create a list of MD5 hashes for the files under /stuff using 4 CPU threads and save to an audit file:
Code:
pushd /stuff && nice -n 19 ionice -c 3 hashdeep -c md5 -l -r -j 4 * > /var/lib/hashdeep/stuff_2021-03-10.hashdeep
Compare the current files with the previously generated hash list (audit mode):
Code:
nice -n 19 ionice -c 3 hashdeep -r -j4 -c md5 -x -v -k /var/lib/hashdeep/stuff_2021-03-10.hashdeep /stuff
This'll output a list of all the files that have either changed or didn't exist since the audit file was made so whilst it's very useful for static datasets (movies, photos, etc) it's not ideal for rapidly changing datasets.
Ohhkay, so it's basically just a mass checksum to compare changes. That's certainly useful! The amount of cronjobs i'm going to have in place..
 

Goose

New Member
Jan 16, 2019
14
2
3
How are you copying stuff? rsync? If so then I think you can use those same checksums and just compare.
 

frawst

New Member
Mar 2, 2021
21
3
3
How are you copying stuff? rsync? If so then I think you can use those same checksums and just compare.
Yeah i'm using rsync -av --progress. But i'm not sure what i'm supposed to be comparing. It's done around 6Tb in the past 24 hours. Since i'm doing real files. it's going nice and slow. still a good bit of "failed: structure needs cleaning (117)". I have it mounted as read only right now. Just pulling away in multiple instances to different disks.
 

frawst

New Member
Mar 2, 2021
21
3
3
@Goose Would you rather go for two smaller RaidZ2 groups? or one larger RaidZ3 group? I will end up having 12 8Tb disks at my disposal, plus some SSD's for ZIL and SLOG. The server has a 10Gb NIC that I'd like to keep saturated, and I've realized that my initial plan of 2 groups of 6 disks in RaidZ2 would be considerably slower than my capable output. My only concern in the future is expanding the pool in the crazy off-chance I need more storage. But I also don't fancy the idea of having a large system that I can't pull down a disk at a time either, Thoughts?

anyone else is welcome to chime in here. I'm trying to make an effort to understand this well enough to start off on a better foot this time around. I'm pretty sure i'm set with using TrueNAS / ZFS at this point.

Some food for thought, This NAS runs as a VM on one of my proxmox nodes with the drives direct mapped. I suppose I could also just set up a ZFS storage on the proxmox host itself, but I like the idea of having it's own OS for the drives and NAS config without using crazy large virtual disks.
 

Goose

New Member
Jan 16, 2019
14
2
3
I'm running an 8 disk RAIDZ2 setup. I have 16 bays in my chassis, so ready to add another if/when I run out of space.

TBH if you're not interested in max IOPS I would just go one big RAIDZ2. I've never run triple parity. If I was that concerned I would simply have a hot spare.

I can't speak to speeds as I don't know what you're storing nor how many clients are accessing.

Same for the SLOG. Presumably you meant L2ARC rather than ZIL as the ZIL is either part of the array or is part of the SLOG.
Again though, I think you need to define your workload here as often a SLOG makes no difference as all writes are ASYNC.

My storage is direct on proxmox but I'm clearly perverting the dev's vision for proxmox by having docker running natively on the host as well.

If it's just bulk media stuff then you probably want a recordsize of 1M, or larger if you don't care about portability, and set compression to LZ4 or maybe even ZSTD. That might get close to if not able to flood 10G with a couple of clients.
 

frawst

New Member
Mar 2, 2021
21
3
3
I'm running an 8 disk RAIDZ2 setup. I have 16 bays in my chassis, so ready to add another if/when I run out of space.

TBH if you're not interested in max IOPS I would just go one big RAIDZ2. I've never run triple parity. If I was that concerned I would simply have a hot spare.

I can't speak to speeds as I don't know what you're storing nor how many clients are accessing.

Same for the SLOG. Presumably you meant L2ARC rather than ZIL as the ZIL is either part of the array or is part of the SLOG.
Again though, I think you need to define your workload here as often a SLOG makes no difference as all writes are ASYNC.

My storage is direct on proxmox but I'm clearly perverting the dev's vision for proxmox by having docker running natively on the host as well.

If it's just bulk media stuff then you probably want a recordsize of 1M, or larger if you don't care about portability, and set compression to LZ4 or maybe even ZSTD. That might get close to if not able to flood 10G with a couple of clients.

I like the idea of doing the RaidZ2 and a potential hot-spare. I think having one at the ready sounds like a good compromise on speed and safety, As for the usage, size, etc. The volume stores a mixture of data, primarily static data such as media (movies, pictures, etc), computer backups and VM disks. So while the vast majority of space usage is going to be large files, it's pretty well spread out. As far as the usage goes, quite low, It's mostly me and the other proxmox nodes that read from it. anything else is just writing backups now and then.

Admittely, as I've mentioned there is a lot I don't know about ZFS, My brief research about do's and don'ts mentioned these kind of things for performance. I don't think a read cache would matter as much as a mirrored write cache. As for that concern, I have two brand new Cyberpower 1500 AVR's on the way. I plan to link one of them to the NAS VM, TrueNAS has the ability to listen and react accordingly. I'm hoping this can help file corruption in the future, I had one on it previously, but it got old enough that I had to take it down, not long after, i found myself in this mess.

As for storage space in total. I think doing a single large RaidZ2 with a hot-spare will satisfy my needs. Worst case I can just make a whole new pool, no biggie. I've gotten about 20Tb off the volume so far, i'm now to the point of dealing with excess corruption in the XFS volume, Running the xfs_repair a time or two made some inaccessible areas reachable, but it seemed to always end with a "segmentation fault". I'm running it again now with a max memory set, and prefetching disabled, it seems to be doing something better this time around. I'll report back on that later. At this point I consider the recovery overall as a decent win. I got what matters most, but not all my hoarded data (yet).
 

Mashie

New Member
Jun 26, 2020
11
4
3
I have a 10 disk mdadm RAID6 setup. I did partition all the drives beforehand with a 100MB gap at the end as it was recommended in case you ended up with multiple manufacturers and the physical sizes didn't match. I run EXT4 and nothing else on top. It has been dead simple to expand the array over time using resizefs after growing the array.

Code:
/dev/md0:
           Version : 1.2
     Creation Time : Sun Jun 30 22:27:54 2019
        Raid Level : raid6
        Array Size : 78129610752 (74510.20 GiB 80004.72 GB)
     Used Dev Size : 9766201344 (9313.78 GiB 10000.59 GB)
      Raid Devices : 10
     Total Devices : 10
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Mar 19 09:05:32 2021
             State : clean 
    Active Devices : 10
   Working Devices : 10
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : xxxx:0  (local to host xxxx)
              UUID : xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx
            Events : 143099

    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8      161        1      active sync   /dev/sdk1
       2       8      145        2      active sync   /dev/sdj1
       3       8      129        3      active sync   /dev/sdi1
       4       8       81        4      active sync   /dev/sdf1
       6       8        1        5      active sync   /dev/sda1
       5       8       49        6      active sync   /dev/sdd1
       7       8       33        7      active sync   /dev/sdc1
       9       8       17        8      active sync   /dev/sdb1
       8       8       65        9      active sync   /dev/sde1
 
  • Like
Reactions: Goose

Goose

New Member
Jan 16, 2019
14
2
3
I like the idea of doing the RaidZ2 and a potential hot-spare. I think having one at the ready sounds like a good compromise on speed and safety, As for the usage, size, etc. The volume stores a mixture of data, primarily static data such as media (movies, pictures, etc), computer backups and VM disks. So while the vast majority of space usage is going to be large files, it's pretty well spread out. As far as the usage goes, quite low, It's mostly me and the other proxmox nodes that read from it. anything else is just writing backups now and then.

Admittely, as I've mentioned there is a lot I don't know about ZFS, My brief research about do's and don'ts mentioned these kind of things for performance. I don't think a read cache would matter as much as a mirrored write cache. As for that concern, I have two brand new Cyberpower 1500 AVR's on the way. I plan to link one of them to the NAS VM, TrueNAS has the ability to listen and react accordingly. I'm hoping this can help file corruption in the future, I had one on it previously, but it got old enough that I had to take it down, not long after, i found myself in this mess.

As for storage space in total. I think doing a single large RaidZ2 with a hot-spare will satisfy my needs. Worst case I can just make a whole new pool, no biggie. I've gotten about 20Tb off the volume so far, i'm now to the point of dealing with excess corruption in the XFS volume, Running the xfs_repair a time or two made some inaccessible areas reachable, but it seemed to always end with a "segmentation fault". I'm running it again now with a max memory set, and prefetching disabled, it seems to be doing something better this time around. I'll report back on that later. At this point I consider the recovery overall as a decent win. I got what matters most, but not all my hoarded data (yet).
Hey

I forgot about this thread.

I'd suggest you run VMs from a different volume. Maybe a largish SSD?

The ZIL/SLOG isn't a write cache. It's for SYNC writes rather than ASYNC ones. It's worth reading about this so you understand its purpose.
You may be better off with the ZFS special VDEV types. It will certainly help metadata. This is a newish thing and I have no experience with it. Give it a google.

ZFs is copy on write so theoretically it shouldn't suffer from issues around sudden poweroff. This is due to its atmoicity where it either completes a transaction record or it doesn't make it active. There isn't a journal to replay like on XFS, ETX4, NTFS etc.

Remember that you can set the recordsize separately for each dataset so maybe 1M for your large files and some research for whatever would be most beneficial for your other stuff.

You wont need to make a new pool if you add in another VDEV (RAIDZ2) it will incorporate it and for a while until they are balanced it will mostly write to those drives.

The XFS repair is a problem, I had it using over32GB of RAM one time trying to sort some corruption on a large filesystem. I don't have any advice here except that maybe it's worth having a SWAP file/partition in case you need more RAM.

You may also want to actually vet your data. ie. open the files, if movies and images check them for obvious corruption such as problems with the image and seek issues.

For other files open them up and see if they look right.

Hopefully some of that is useful to you.