Just a note to the reliability and survivability of ZFS and the great tool Napp-It that sits on top.
TL;DR... ZFS was recoverable even after 7 out of 16 drives had been kicked out of my array due to overheating.
In my garage at home I have several servers and disk shelves and other gear making up my home lab. This also includes a server and VMs that host media for playing to STBs and portable devices around the house.
At some point over the weekend some papers fell or blew in front of the media server and stuck to the front of the drive bays, blocking pretty much all the air flow into the front of the server. It was like that for at most an hour before I noticed it. By the time I found the problem 7 out of 16 drives had gone offline due to over-temps and ZFS kicked them out of the storage pool. The pool is 16 x 4TB SAS drives in a SuperMicro chassis and has been running for more than two years.
I shut it all down and let it cool off for a while. I powered up the system and the HBA saw all the drives and no error states on them. ESXi came up OK and I started my OmniOS/Napp-It VM that has the HBA passed thru to it. ZFS still had the pool degraded, but I had seen this before on other systems (not with this many expelled drives however). In the Napp-It GUI the drive list only showed 9 drives and looking at the 'Initialize' menu did not show any other drives available. This was worrying me because the drive controller saw all the drives. Going to the command line, the 'format' command only saw 9 nine drives as well. Since I did not have a current backup of all that data, I was really starting to sweat.
On my ESXi server I had installed the latest Napp-It 'OVA' to test, so I thought I would pass the HBA to it and see what it could see. Thankfully it saw all 16 drives, and said I could import the pool, and the pool did not have any missing members. I configured Napp-It etc. as best I could then successfully imported my pool. After fixing some permission issues all is well and the family is happy again watching TV and movies as they please.
I am still not sure if I could have done anything to get the drives to show back up to the original machine. After importing the pool to the new machine I exported it back out, and tried to import it back into the original VM, but it was still ignoring the 7 disks it had kicked out like to OS didn't see them any more.
I know other RAID system may also be able to recover from a situation like this, but ZFS has never let me down at home or at work. There we have two 2PB systems using ZFS as a back end to Luster and have never lost a bit of data. I have had hardware RAID failures that several times forced me to go recover from backups. I am a firm believer in software RAID and especially ZFS.
Thanks to all those that have built and tested ZFS and also to @gea for the great interface to it!
TL;DR... ZFS was recoverable even after 7 out of 16 drives had been kicked out of my array due to overheating.
In my garage at home I have several servers and disk shelves and other gear making up my home lab. This also includes a server and VMs that host media for playing to STBs and portable devices around the house.
At some point over the weekend some papers fell or blew in front of the media server and stuck to the front of the drive bays, blocking pretty much all the air flow into the front of the server. It was like that for at most an hour before I noticed it. By the time I found the problem 7 out of 16 drives had gone offline due to over-temps and ZFS kicked them out of the storage pool. The pool is 16 x 4TB SAS drives in a SuperMicro chassis and has been running for more than two years.
I shut it all down and let it cool off for a while. I powered up the system and the HBA saw all the drives and no error states on them. ESXi came up OK and I started my OmniOS/Napp-It VM that has the HBA passed thru to it. ZFS still had the pool degraded, but I had seen this before on other systems (not with this many expelled drives however). In the Napp-It GUI the drive list only showed 9 drives and looking at the 'Initialize' menu did not show any other drives available. This was worrying me because the drive controller saw all the drives. Going to the command line, the 'format' command only saw 9 nine drives as well. Since I did not have a current backup of all that data, I was really starting to sweat.
On my ESXi server I had installed the latest Napp-It 'OVA' to test, so I thought I would pass the HBA to it and see what it could see. Thankfully it saw all 16 drives, and said I could import the pool, and the pool did not have any missing members. I configured Napp-It etc. as best I could then successfully imported my pool. After fixing some permission issues all is well and the family is happy again watching TV and movies as they please.
I am still not sure if I could have done anything to get the drives to show back up to the original machine. After importing the pool to the new machine I exported it back out, and tried to import it back into the original VM, but it was still ignoring the 7 disks it had kicked out like to OS didn't see them any more.
I know other RAID system may also be able to recover from a situation like this, but ZFS has never let me down at home or at work. There we have two 2PB systems using ZFS as a back end to Luster and have never lost a bit of data. I have had hardware RAID failures that several times forced me to go recover from backups. I am a firm believer in software RAID and especially ZFS.
Thanks to all those that have built and tested ZFS and also to @gea for the great interface to it!