ZFSonLinux (aka ZoL) now at version 0.7 - big update imo

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sfbayzfs

Active Member
May 6, 2015
259
143
43
SF Bay area
I got home to find my rsync at 15TB on the new array but 2 drives had dropped from a z2 array and it was still rsyncing :(

I suspect hardware and not the new checksums, but we'll see - I'm replacing them with different drives in different bays - after a reboot they show up fine, but they did drop completely from the system, not just from the array. I think the couple of hundred "write errors" zpool status reported for each were from them being gone, not actual failed write commands, SMART looks good on them after a reboot.

I have never used the backplane I had them in (the rear backplane of a 45-bay SC847 JBOD) but my cabling cascades through the rear backplane to the front one, which I have used, and has been fine.

The resilver in the rear backplane was picking up checksum errors on multiple drives, so I shut down, moved the drives to the front, and re-imported to restart the resilver. I have 2 drive replaces going at once with general resilver - a few checksum errors are ticking on random drives, drives which were clean before. Hopefully it is the rear 21-bay backplane at fault here, that was a problem I had with another enclosure before, which prompted me to upgrade all of my enclosures' backplane firmware in case that was the problem (it was not, that first bad backplane dropped drives) Good thing I have a couple of spares, but it has me a little worried about them.
 

sfbayzfs

Active Member
May 6, 2015
259
143
43
SF Bay area
The problem turned out to be two hardware issues at once - one drive was going bad combined with a port on the 9206-16e card I was connecting to the JBOD with mostly working, but having one very specific issue. It would drop a few writes to good drives, mostly certain drives, causing resilvers to restart every time when zfs went to write a corrected sector. I switched to another port on the same card, and it's happy with all of the drives now - the resilver finished, and 2 scrubs since have come back clean!

ddrescue worked great for a sleight-of-hand replacement of the failing drive while I tracked down the other issue - the replacement showed up with it's sd?? letter in the zpool during the resilver that finally worked, but after a reboot and import, it appeared as it's ata-manufacturer-model-serial name like the rest of my drives in the pool. I'm really glad a friend tested this technique a few years ago when I told him to replace his failing Seagate 3TB disaster drives with HGST 4TBs - he had over 1/2 a TB of unreadable sectors, but was able to ddrescue it onto the HGST 4TB and the subsequent resilver recreated the missing data with minimal risk. In my case, the resilvers weren't getting very far due to the other issue, so ddrescue greatly reduced my risk of data loss while I worked on the other issue.

Even though I am running the latest firmware on my backplanes, two good drives did get dropped from the backplane when another started having write issues, so the warnings about SATA drives on SAS backplanes especially in strange failure conditions do seem to be true, but it may not have happened at all if the card port had not been problematic.

It's also really nice that all of the data except for one file is intact, even though half of it was written with no parity and one of the written stripes was on a failing drive after two drives had been dropped by the system from a z2 double parity pool! It would be nice to set a minimum parity level on a zpool before it is forced to go read-only - when 4 drives at once dropped from a pool before while it was being written to, it went read-only, but this time, it had just enough disks to continue writing data without parity, and stayed writable, putting a lot of data potentially at risk - maybe I have missed a setting for that. This is still a great win for ZFS though - the entire array would have been unrebuildable if I had been using hardware RAID and this happened.

So, in summary, the new skein checksums are working great on my new pool created under 0.7.1!
 
Last edited:
  • Like
Reactions: gigatexal

sfbayzfs

Active Member
May 6, 2015
259
143
43
SF Bay area
Quick note to those running Centos or other RHEL variants - a lot changed upgrading from 7.3 to 7.4 and even zfs-kmod didn't stay working - I had to uninstall all zfs related RPMs including zfs-release, and then edit the zfs repo file to enable the KMOD repo and disable the default one. Then I was able to reinstall kmod-zfs and kmod-spl (which pulled in the lib dependencies of course) and everything worked fine.

In retrospect, changing the repo file to point to the 7.4 repos would probably have fixed it, but it is cleaner to have the RPM installed reflect the current OS version anyway.