Hi all, last few months have been storage hell with RAID issues on a number of servers I look after. Clearly I am doing something wrong and I'm hoping this is a safe place to ask for some advice.
Back story: I have looked after a dozen or so "small business" servers for over fifteen years with relatively few issues. This isn't my day job but mostly just something I spend a few hours on every other week to help out various friends and family who have small or home businesses. Most of the servers are white box with the exception of some older HP DL380 and ML350. The first few white boxes were inherited and had RAID 5 running on Intel SW RAID or LSI MegaRAID controllers. This was back when when 200 GB enterprise drives were considered huge and RAID5 with a hot spare actually worked.
Many many years ago I had a RAID5 failure during rebuild and educated myself on the nightmare that is RAID5. Since then I've either reconfigured all of the arrays as RAID 1 or 10 or replaced the servers. I try to either have backups to rotated external hard drives, a NAS located somewhere else on the premises, and/or cloud based backup solution. In the last few years I have switched to Hyper-V for almost all of these servers so I can run a Linux web server, PBX, or pfsense in addition to a Windows domain controller or file server on the same box.
Present issues: Last month I had a RAID 10 array fail during rebuild. I have had a dozen hard drives die in RAID 10 arrays and never thought this was possible. The motherboard was a Supermicro X10SRH-CLN4F and I was using the onboard LSI/Avago/Broadcom 3008 in IR mode to do RAID 10. The user was reporting that accessing files on a VM was slow so I poked around on the server (including in MegaRAID Storage Manager) and didn't see any issues. I figured I would do all possible driver and software updates and after installing the latest storage drivers and newest MegaRAID I rebooted the server and instantly began to get dozens of emails from MegaRAID that there were unrecoverable read errors. Somehow the 3008 had either stopped doing patrol reads or stopped reporting errors and multiple drives on the RAID 10 had gone bad causing the array to be unrecoverable.
I pulled two drives from a NAS to make a RAID 1 on the Intel RSTe controller and copied all the vhdx over to the new array with the exception of one which couldn't copy due to unreadable data. I restored that one from backup and thought I was in the clear. I sent the RAID 10 drives which were failing back to WD under warranty and started testing the other ones using the Western Digital tool. They seem to be fine but the tool can't read SMART data due to the IR mode of the LSI controller.
I started manual consistency checks on three other servers also using the LSI 3008 controllers and found another array with consistency errors. It didn't fail rebuild but this really has me worried. Why is this controller not checking consistency? I was planning on moving over to the Intel RSTe controller but now the temporary RAID 1 array on the Intel controller seems to be stuck on "verifying and repairing" with the array of <3 month old NAS drives seemingly stuck with multiple bad blocks that can't be repaired.
Can I trust any of these onboard RAID solutions? Should I recommend we buy PCIe RAID controllers for all of the servers I'm looking after? I have read about Windows Storage Spaces, is that any safer? What is a "small business" single Hyper-V server guy supposed to use these days?
Back story: I have looked after a dozen or so "small business" servers for over fifteen years with relatively few issues. This isn't my day job but mostly just something I spend a few hours on every other week to help out various friends and family who have small or home businesses. Most of the servers are white box with the exception of some older HP DL380 and ML350. The first few white boxes were inherited and had RAID 5 running on Intel SW RAID or LSI MegaRAID controllers. This was back when when 200 GB enterprise drives were considered huge and RAID5 with a hot spare actually worked.
Many many years ago I had a RAID5 failure during rebuild and educated myself on the nightmare that is RAID5. Since then I've either reconfigured all of the arrays as RAID 1 or 10 or replaced the servers. I try to either have backups to rotated external hard drives, a NAS located somewhere else on the premises, and/or cloud based backup solution. In the last few years I have switched to Hyper-V for almost all of these servers so I can run a Linux web server, PBX, or pfsense in addition to a Windows domain controller or file server on the same box.
Present issues: Last month I had a RAID 10 array fail during rebuild. I have had a dozen hard drives die in RAID 10 arrays and never thought this was possible. The motherboard was a Supermicro X10SRH-CLN4F and I was using the onboard LSI/Avago/Broadcom 3008 in IR mode to do RAID 10. The user was reporting that accessing files on a VM was slow so I poked around on the server (including in MegaRAID Storage Manager) and didn't see any issues. I figured I would do all possible driver and software updates and after installing the latest storage drivers and newest MegaRAID I rebooted the server and instantly began to get dozens of emails from MegaRAID that there were unrecoverable read errors. Somehow the 3008 had either stopped doing patrol reads or stopped reporting errors and multiple drives on the RAID 10 had gone bad causing the array to be unrecoverable.
I pulled two drives from a NAS to make a RAID 1 on the Intel RSTe controller and copied all the vhdx over to the new array with the exception of one which couldn't copy due to unreadable data. I restored that one from backup and thought I was in the clear. I sent the RAID 10 drives which were failing back to WD under warranty and started testing the other ones using the Western Digital tool. They seem to be fine but the tool can't read SMART data due to the IR mode of the LSI controller.
I started manual consistency checks on three other servers also using the LSI 3008 controllers and found another array with consistency errors. It didn't fail rebuild but this really has me worried. Why is this controller not checking consistency? I was planning on moving over to the Intel RSTe controller but now the temporary RAID 1 array on the Intel controller seems to be stuck on "verifying and repairing" with the array of <3 month old NAS drives seemingly stuck with multiple bad blocks that can't be repaired.
Can I trust any of these onboard RAID solutions? Should I recommend we buy PCIe RAID controllers for all of the servers I'm looking after? I have read about Windows Storage Spaces, is that any safer? What is a "small business" single Hyper-V server guy supposed to use these days?