RAID 6 triple drive failure, can I manually overcome a puncturing bad block? MR 9361-8i

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Maery Fedorica

New Member
Jan 19, 2021
6
2
3
I was just sipping my coffee sitting at my home workstation and noticed I lost my main data hard drive that I use for a network drive. It is a 7 disk array of which 6 drives were active in hardware RAID 6, with battery backup and cache.

I signed in to the Broadcom LSI Storage Manager webpage and looked at the logs. It was not a normal issue, but a catastrophic one.

The best I can tell is that a single drive had a "lost sense," i.e. a drive failure where it stops responding, but it apparently cascaded to two other drives during recovery or they had independent, unlucky errors. This kind of failure seemed a bit unlikely, so I wonder if there was a power supply glitch or the RAID card failed during its automatic recovery and the errors were transient?

Before we get into it, my backups are so-so, and it's likely I will lose a small-ish amount of unbacked up data either due to timing between critical backups, or due to data I did not consider vital enough to backup. I would like to recover the array due to recent file changes and convenience, of course. This VD holds about 6 TB in an array of 3 TB Seagates, of which 90 to 95% is backed up and around 0.1% may be recent critical data files. The unbacked up data is my collection of replaceable things that I can re-download from the Interweb Tubes. I am using Windows 11. I have a controller card I bought off eBay and I modified it to have a fan that blows constantly on the controller chip. My last controller would get CRC errors occasionally, so I figured it wouldn't hurt to keep it cool.

Also, FYI, I had been running a monthly patrol read of the VD. The last patrol read was on the 30th, as in 7 days ago. They take about a day.

Around 9:03 pm, Device ID 16 had several timeouts. It's the failed drive. And then... the controller tried to bring on my Global Hot Spare (GHS), Device 17
9:03:04 ID 16, Unexpected sense, logical unit not ready [several like this including bus resets]
9:03:17 ID 16, Unrecoverable medium error during recovery
9:03:17 ID 16, Puncturing bad block 0x117cad820
9:03:17 ID 17, Puncturing bad block 0x117cad820
[That's not good! Other one simultaneously bad?]
9:03:36 ID 16, Timeout (again)
9:03:36 ID 16, Reset

[Several timeouts and unexpected senses from ID 16]
9:04:52 ID 17, Unrecoverable medium error during recovery
9:04:52 ID 16, Unrecoverable medium error during recovery
9:05:24 ID 16, Unexpected sense, Unrecovered read error
[that's a first]
9:05:24 ID 16, Unrecoverable medium error during recovery [again]
Huge gap of over 30 minutes of nothing being logged. I suppose the array was degraded and it was copying to GHS

9:44:26 ID 16, Unexpected sense, Unrecovered read error [again]
9:44:26 ID 16, Unrecoverable medium error during recovery [again]

The last two are repeated - more read errors
9:45:09 ID 15, Puncturing bad block Location 0xaeb8408 [Oh, no, what's going on to ID 15?]
9:45:09 ID 15, Puncturing bad block Location 0xaeb8409 [Two?]
9:45:09 ID 17, Puncturing bad block Location 0x6450340 [Hello ID 17?]
9:45:09 ID 18, Puncturing bad block Location 0x6450340 [Hello ID 18?]
9:45:09 ID 17, Puncturing bad block Location 0x6450341
9:45:09 ID 18, Puncturing bad block Location 0x6450341
9:45:09 ID 17, Puncturing bad block Location 0x6450342
9:45:09 ID 18, Puncturing bad block Location 0x6450342

[Several unexpected senses and Unrecoverable read errors from ID 16]
9:47:21 [Several more puncturing bad blocks on ID 17 and ID 18 0x6450350 to 0x6450359]
9:47:22 [Several more puncturing bad blocks on ID 17 and ID 18 0x645035a to 0x645035f]
[Several unexpected senses and Unrecoverable read errors from ID 16]

9:48:01 ID 16, Reset Type 3
9:48:05 ID 16, Failed. Drive Error Counter: 44
9:48:05 ID 16, Previous: Online, Current: Failed
9:48:05 Virtual Drive. State change on VD: 0 Previous: Degraded; Current: Offline;
9:48:05 Controller cache pinned for missing or offline VD: VD 0
9:48:05 VD is now OFFLINE VD 0
9:48:05 ID 15, Puncturing bad block Location 0xaeb84e8
[repeated for sequential blocks up to 0xaeb84f7]
9:48:08 Number of valid snapdump available is 1
[Several unexpected senses, resets, and then disk removed from ID 16]
9:48:31 ID 16, Previous: Failed, Current: UnConfigured Bad
9:50:37 ID 16, Disk: Inserted


So in less than an hour, I had ID 16 go bad, the GHS ID 17 failed during recovery, and then several bad blocks found on ID 18 and ID 15. The current state of things is dire. I am faced with trying to correct it by forcing some of these failed drives back into the array. However, the current drive list shows three drives in the failed VD and three drives listed as "foreign." (see pic) I got the PC to boot out of the LSI "safe mode" by removing the disk cache, which is probably toast anyway.
Screenshot 2026-04-06 164351.png
In the VD: ID 17, ID 18, and ID 15. All of these have reported bad blocks
Foreign drives: ID 19, ID 13, and ID 14

Weirdly enough, the foreign drives were never mentioned in the logs. That array seems to be the best one. I wish I could switch over to the Foreign drives and see if any of the failed drives would add to its array?

So it would be: IDs 19, 13, 14, and either 18 or 14 or both. (I need four drives of course for the array to work)
I don't trust ID 17, the GHS as of 9:00 pm yesterday. (see pic)

Screenshot 2026-04-06 164145.png
My question is this. Can I remove the current array (I mean, it's populated with just the baddest ones) and then ask the controller to import the foreign array? I don't understand how the drives got labeled that way or if it matters. I am wondering if anyone has worked with punctuating bad blocks and tried to remove them with megacli or megacli64, which I have installed.

Screenshot 2026-04-06 164724.png
 
Last edited:

Maery Fedorica

New Member
Jan 19, 2021
6
2
3
The amazing thing I just discovered is that my automatic backup had completed at... 9:03:13 pm. That's either unbelievable or very believable. The first drive error was actually at 9:02:59. No matter... the backup said it was completed successfully! Woo hoo.

Now that's the skin of your teeth!! My backups are fairly far apart. I think this one is set to be 71 hours apart, so the chances of it happening within seconds of a backup were pretty small. I am happy about that, of course, but I sure wish I could resurrect the virtual drive without a full rebuild... I don't like running a restore from external and starting over. I've only rarely had to do it, and this is really my first major catastrophic failure in decades. My main backup has only one gotcha: files over 100 GB are not included, which should primarily relate to ... other massive backups for other PCs and data.
 
Last edited:
  • Like
Reactions: ecosse

Maery Fedorica

New Member
Jan 19, 2021
6
2
3
Here is a list of my foreign configs...

Code:
C:\Oracle Storage 12 Gbps SAS LSI MegaRAID 9361-8i\repairs>storcli64 /c0/fall show all
CLI Version = 007.3603.0000.0000 Oct 30, 2025
Operating system = Windows 11
Controller = 0
Status = Success
Description = Operation on foreign configuration Succeeded

Foreign Topology :
================
----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT      Size PDC  PI SED DS3  FSpace TR
----------------------------------------------------------------------------
 0 -   -   -        -   RAID6 Frgn  N  10.913 TB dsbl N  N   dflt N      N
 0 0   -   -        -   RAID6 Frgn  N  10.913 TB dsbl N  N   dflt N      N
 0 0   0   252:6    14  DRIVE Frgn  N   2.728 TB dsbl N  N   dflt -      N
 0 0   1   252:5    15  DRIVE Onln  N   2.728 TB dsbl N  N   dflt -      N
 0 0   2   -        -   DRIVE Msng  -   2.728 TB -    -  -   -    -      N
 0 0   3   252:1    17  DRIVE Onln  N   2.728 TB dsbl N  N   dflt -      N
 0 0   4   252:3    18  DRIVE Onln  N   2.728 TB dsbl N  N   dflt -      N
 0 0   5   252:2    19  DRIVE Frgn  N   2.728 TB dsbl N  N   dflt -      N
----------------------------------------------------------------------------

Foreign VD List :
===============
---------------------------------
DG VD      Size Type  Name
---------------------------------
 0  0 10.913 TB RAID6 CHEESE1_00
---------------------------------

NoVDs - Number of VD in Drive Group
DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive or RAID Type|Onln=Online|Rbld=Rebuild|Optl=Optimal
Dgrd=Degraded|Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready

Total foreign Drive Groups = 1
Total Foreign PDs = 3
Total Locked Foreign PDs = 0
Here is the current VD config:
Code:
[C:\Oracle Storage 12 Gbps SAS LSI MegaRAID 9361-8i\repairs>storcli64 /c0/v0 show all
CLI Version = 007.3603.0000.0000 Oct 30, 2025
Operating system = Windows 11
Controller = 0
Status = Success
Description = None


/c0/v0 :
======

--------------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC      Size Name
--------------------------------------------------------------------
0/0   RAID6 OfLn  RW     No      RAWBD -   OFF 10.913 TB CHEESE1_00
--------------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency


PDs for VD 0 :
============

------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model              Sp Type
------------------------------------------------------------------------------
252:5    15 Onln   0 2.728 TB SATA HDD N   N  512B ST3000DM008-2DM166 U  -
252:1    17 Onln   0 2.728 TB SATA HDD N   N  512B ST3000DM008-2DM166 U  -
252:3    18 Onln   0 2.728 TB SATA HDD N   N  512B ST3000DM008-2DM166 U  -
------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=PI Eligible
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


VD0 Properties :
==============
Strip Size = 256 KB
Number of Blocks = 23437492224
VD has Emulated PD = Yes
Span Depth = 1
Number of Drives Per Span = 6
Write Cache(initial setting) = WriteBack
Disk Cache Policy = Disabled
Encryption = None
Data Protection = Disabled
Active Operations = None
Exposed to OS = Yes
OS Drive Name = N/A
Creation Date = 27-08-2025
Creation Time = 05:09:16 AM
Emulation type = default
Cachebypass size = Cachebypass-64k
Cachebypass Mode = Cachebypass Intelligent
Is LD Ready for OS Requests = Yes
SCSI NAA Id = 600605b00bc1ee3030414f7c8bd2fa5d
Unmap Enabled = N/A
 

DarkServant

Active Member
Apr 5, 2022
124
99
28
I don't know exactly what f&ç%d up your array.
There is a reason why so many moved away from hardware RAID to something like TrueNAS, or another non-windows based ZFS-based solution, and the use of enterprise/datacenter-grade SSD's. Cascading failures, write-holes, etc.
The disks got bigger and bigger, but the uncorrectable bit error-rate stayed the same, and the speed not increased in the same rate as capacity until the cascading-failure scenario dropped into a real problem.

I myself ran into so much troubles with hardware-RAID, and lost once a ton of data (yes RAID-6, but my fault... jesus saves, but god backs up).
Back when RAID-controllers were still really big cards, i had one with an array of red LED's and in non-busy states it imitated the front of the car in "knight rider" (Mylex or AMI-Megaraid? don't know); but then the the time of RAID-on-Chip arose and the highly integrated chips got way too hot and had no adequate heatsinks or fans -> build dirt-cheap and sell for a fortune LSI times now Broadcom, the probability of a failure was simply too high for the price-point, and SSD's rather fast took over the stage.
Sometimes i think it's more of an liability thing, than real protection with local hardware-RAID nowadays.

Now a SATA-SSD (sm883) based ZFS striped-mirror solution runs for over 5 years without big troubles, no failures.
For those big HDD's (non-SMR "Ultrastar"...) only a triple ZFS-mirror can save you with high probability from an array-failure.
I myself use a datacenter/enterprise-grade SSD (yeah, no fancy packaging, expensive as hell, and reliable) in my workstation, and backing-up time to time, no more RAID. The failure-rate of SSD's is like 0,4% per year instead of the >2% of HDD's.
About professional Optane drives (P4800X, P5800X) i never read about a failure, but there is only a very limited batch of those out in the hands of individuals.

Too much talk, no solution... :confused:
 

kapone

Well-Known Member
May 23, 2015
2,009
1,374
113
I don't know exactly what f&ç%d up your array.
There is a reason why so many moved away from hardware RAID to something like TrueNAS, or another non-windows based ZFS-based solution, and the use of enterprise/datacenter-grade SSD's. Cascading failures, write-holes, etc.
The disks got bigger and bigger, but the uncorrectable bit error-rate stayed the same, and the speed not increased in the same rate as capacity until the cascading-failure scenario dropped into a real problem.

I myself ran into so much troubles with hardware-RAID, and lost once a ton of data (yes RAID-6, but my fault... jesus saves, but god backs up).
Back when RAID-controllers were still really big cards, i had one with an array of red LED's and in non-busy states it imitated the front of the car in "knight rider" (Mylex or AMI-Megaraid? don't know); but then the the time of RAID-on-Chip arose and the highly integrated chips got way too hot and had no adequate heatsinks or fans -> build dirt-cheap and sell for a fortune LSI times now Broadcom, the probability of a failure was simply too high for the price-point, and SSD's rather fast took over the stage.
Sometimes i think it's more of an liability thing, than real protection with local hardware-RAID nowadays.

Now a SATA-SSD (sm883) based ZFS striped-mirror solution runs for over 5 years without big troubles, no failures.
For those big HDD's (non-SMR "Ultrastar"...) only a triple ZFS-mirror can save you with high probability from an array-failure.
I myself use a datacenter/enterprise-grade SSD (yeah, no fancy packaging, expensive as hell, and reliable) in my workstation, and backing-up time to time, no more RAID. The failure-rate of SSD's is like 0,4% per year instead of the >2% of HDD's.
About professional Optane drives (P4800X, P5800X) i never read about a failure, but there is only a very limited batch of those out in the hands of individuals.

Too much talk, no solution... :confused:
This is all BS. Hardware raid is just as reliable as ZFS "raid". You moved to SSDs instead of HDDs, so that removes some of the risk, but that doesn't mean hardware raid got any riskier. The vast, vast, VAST majority of enterprise systems STILL run on hardware raid, not ZFS. Just because home-labbers find the price of good raid cards "exorbitant", doesn't mean..

:rolleyes::rolleyes:
 
  • Like
Reactions: Micro and ecosse

DarkServant

Active Member
Apr 5, 2022
124
99
28
I once met a person who worked in a large financial company in switzerland, and he said the use Sun Solaris and ZFS for critical stuff... was more than ten years ago.
ZFS does not perform very well in the means of speed, and has too it's drawbacks. But how do the actual HW-RAID adapters manage something like bit-rot, and whats about the so called write-hole in write-back mode?
The post is about DAS and not NAS or SAN, which limits the options on an windows-platform, "open"ZFS is more of a NAS thing.
Adding a RAID card is another point of failure, and as i said it is too a liability problem, and Broadcom wants to sell their stuff too. A Server comes mostly with HW-RAID already integrated
The last HW-RAID Adapter was an LSI 9266-8i which overheated quickly ->cheap small aluminum heatsink, i had an additional "cachecade" key installed, five WD Velocirator 600GB in RAID-6, the SSD cache was an RAID10 with four Samsung 830 128GB (consumer), it failed miserably. And yeah the controller alone costed well over 1000$... now collects dust in a drawer since years. Then i swapped to a single sm863a 1,92TB -was not cheap too; has collected 50k hours, no troubles. Now i have two P5800X 1,6TB (no DRAM-Cache, and only one ARM-Cortex R7 core in the controller running at 1.1GHz), and an sm883 3,84TB (reduced to 3TB ->OP).

In those days, an expansion is not in my range (like an Micron 9550 MAX 12,8TB u.2 or an Solidigm PS1030 12,8TB), four 32GB sticks of DDR5-RDIMM even less.

this looks more like the problem, but on another platform: vnx-unity-understanding-uncorrectable-sectors-and-parity-errors
 

twin_savage

Active Member
Jan 26, 2018
170
127
43
35
The disks got bigger and bigger, but the uncorrectable bit error-rate stayed the same
This isn't true in practice, most modern SAS disks will outperform their specsheet-stated UER by a couple orders of magnitude; it's the specsheets that have frozen in time. About 15 years ago HDDs really did perform fairly close to their stated UER, but the adoption of advanced format was an inflection point in them getting significantly better with respect to UER.

The vast, vast, VAST majority of enterprise systems STILL run on hardware raid, not ZFS. Just because home-labbers find the price of good raid cards "exorbitant", doesn't mean..
This is very true. I've been on projects where ZFS is explicitly banned from use partially due to it's freespace write hole that will end up making you loose data if you don't have a pristine datacenter environment that can guarantee excelent power/dust/temperature conditions.

ZFS does not perform very well in the means of speed, and has too it's drawbacks. But how do the actual HW-RAID adapters manage something like bit-rot, and whats about the so called write-hole in write-back mode?
Hardware raid handles bitrot via patrol scrubs, just as ZFS would. Hardware raid handles write holes via super capacitors now as opposed to the lithium ion batteries that it used decades ago.
I want to point out that almost all software raid, including ZFS suffers from a write hole problem it cannot overcome due to lack of energy storage cache; Klennet's blog has some very interesting analysis and speculation on how/why ZFS will render user data inaccessible, it turns out that not all structures of ZFS are as protected as claimed.
 

DarkServant

Active Member
Apr 5, 2022
124
99
28
This isn't true in practice, most modern SAS disks will outperform their specsheet-stated UER by a couple orders of magnitude; it's the specsheets that have frozen in time. About 15 years ago HDDs really did perform fairly close to their stated UER, but the adoption of advanced format was an inflection point in them getting significantly better with respect to UER.
Do you mean those 2,5" 10k/15k disks which capacity maxed out at i think 2,4TB or the gigantic near 30TB helium 3,5" disks?

This is very true. I've been on projects where ZFS is explicitly banned from use partially due to it's freespace write hole that will end up making you loose data if you don't have a pristine datacenter environment that can guarantee excelent power/dust/temperature conditions.


Hardware raid handles bitrot via patrol scrubs, just as ZFS would. Hardware raid handles write holes via super capacitors now as opposed to the lithium ion batteries that it used decades ago.
I want to point out that almost all software raid, including ZFS suffers from a write hole problem it cannot overcome due to lack of energy storage cache; Klennet's blog has some very interesting analysis and speculation on how/why ZFS will render user data inaccessible, it turns out that not all structures of ZFS are as protected as claimed.
Interesting, never heard of that freespace write hole and the banning of ZFS, but i am not anymore in the IT since a very long time ago.
I only believed that the ZFS solution is the best what is affordable (if one uses a system with working/validated ECC-DRAM, and use DC/enterprise-SSD's with power-loss-protection or HDD's of the "ultrastar"-class...) the copy-on-write solution sounds reasonable.
I had some bad experiences with HW-RAID, and i had several different controllers. The solution with the super-caps and NAND-flash was indeed a step forward to protect the write-cache, but the hardware quality suffered anyway, yeah i was angry that for the >1000$ price tag no appropriate cooling solution fitted the BOM, so that the controller-card can run without 60dB case cooling. But that is a thing for itself, the degrading quality of hardware over the years/decades.
My experience is fading (like the rest of me), because i am not in any way involved with enterprise hardware outside my own home and the $$$ is very limited. Anyway i need to read this Klennet's post.
 
  • Like
Reactions: T_Minus

twin_savage

Active Member
Jan 26, 2018
170
127
43
35
Do you mean those 2,5" 10k/15k disks which capacity maxed out at i think 2,4TB or the gigantic near 30TB helium 3,5" disks?
it would have been at the time ~2TB drives were considered cutting edge, but basically it was when all the drives went from 512byte sectors with mildly okay ECC to 4k byte sectors with really good ECC based off of the LDPC algorithm.

Interesting, never heard of that freespace write hole and the banning of ZFS,
Well, the banning of ZFS was just on one specific contract I happen to be on as opposed to more industry wide; the majority of the problem was that it was a more of an edge deployment that saw some very rough conditions and things would go wrong fairly often and we could never get people with the more specialized ZFS recovery knowledge out to the site to fix anything on reasonable time scales.

yeah i was angry that for the >1000$ price tag no appropriate cooling solution fitted the BOM, so that the controller-card can run without 60dB case cooling.
This is still very much a problem today, especially with the Broadcom cards. Microchip seems to do alittle better with their heatsinks, but still not great. Areca hardware raid cards almost always had built in active cooling fans which is appreciated.

Anyway i need to read this Klennet's post.
I think this was the post I was remembering when I wrote that comment:
 
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,886
2,219
113
Just because home-labbers find the price of good raid cards "exorbitant", doesn't mean..
curious what cards you're talking about? what are the "good" ones you mention, what about great ones?


Areca hardware raid cards almost always had built in active cooling fans which is appreciated.
is areca common\ popular\rare in your experience for datacenter usage?
 

kapone

Well-Known Member
May 23, 2015
2,009
1,374
113
curious what cards you're talking about? what are the "good" ones you mention, what about great ones?
Any modern LSI/Adaptec from the last...oh say 10yrs ago, with onboard cache and a Supercap. If it can do tiered storage (i.e. SSD cache, or even a RAID array of SSD cache...), even better.

Used these can go for peanuts, but we're not talking used. I'm talking about speccing a new server and the decision to include a hardware raid card or not. The cost of the raid card is a rounding error.
 

kapone

Well-Known Member
May 23, 2015
2,009
1,374
113
This is still very much a problem today, especially with the Broadcom cards. Microchip seems to do alittle better with their heatsinks, but still not great. Areca hardware raid cards almost always had built in active cooling fans which is appreciated.
See...RAID cards are meant for servers and being run in a server environment...i.e. a DC. It's when home labbing, media servers and the likes of data hoarding took off that people started running them in a home environment. And then they started complaining... :)

The heatsinks on any modern LSI/Adaptec etc are "adequate enough", IF they're running in a server environment. You can't have giant heatsinks on them from the factory, because they're meant to occupy a single slot, and that too within HHHL/FHFL/FHHL specs. There simply isn't room on the card for more heatsink without compromising the specs.
 
  • Like
Reactions: twin_savage

kapone

Well-Known Member
May 23, 2015
2,009
1,374
113
This is very true. I've been on projects where ZFS is explicitly banned from use partially due to it's freespace write hole that will end up making you loose data if you don't have a pristine datacenter environment that can guarantee excelent power/dust/temperature conditions.
Yes, there's issues with ZFS and..well, if "write hole" is what we're calling it, that's fine. The issue is ZFS caching data in RAM, especially on writes. That 5s of buffer (by default) in RAM is ill suited for enterprise data strategies. The only alternative is to do "sync always", but then...the performance tanks.

So..you now have to add a fast "power protected" device as SLOG...wait where have I heard that before? Oh yes, a hardware RAID card. :)

I like ZFS, it has a lot to offer, but it does have its issues. So...you know...if you run ZFS on a hardware raid array...which is protected by a hardware raid card...and do sync always... :p

Blasphemy!
 

gea

Well-Known Member
Dec 31, 2010
3,646
1,444
113
DE
Modern filesystems like btrfs, ReFS and ZFS have brought us software raid with the two main improvements Copy on Write and checksums for every written datablock even in a software raid on every single disk.

Copy onWrite means that no datablock on disk is modified but always written newly. A crash during write means that the former datablock remains valid not the partly written new one. The former databocks are also basis of data versioning based on snaps. A hardware raid with cache protection can protect writes in a crash situation but not at the same level and without snap versioning.

Filesystems like ZFS do rambased read/write caching, ex 5s write cache on Solaris. OpenZFS has no fix 5s cache but a more dynymic behaviour but in the end also a few seconds. On a crash the ZFS filesystem remains valid due Copy on Write but the last few seconds may be lost. You can enable sync to protect these last few seconds to rewrite them on next boot.

Checksums means that every datablock ex 1M is verified during read and auto repaired from redundancy on errors. This also protects from bitrot or a different data state ex in a mirror where it the bad or good mirror part can be identified. In a degraded raid situation this also means that only a file is marked bad not the whole array as you will see it for example in a degraded Raid-5 on next read error.

In the end, only a hardware raid with cache protection is good but modern software raid like ZFS is superiour regarding data security and regarding features.
 

kapone

Well-Known Member
May 23, 2015
2,009
1,374
113
In the end, only a hardware raid with cache protection is good but modern software raid like ZFS is superiour regarding data security and regarding features.
@gea - With all due respect..."RAID" is very much not a strong suit with ZFS. As a filesystem and volume manager it's excellent, and of course all the good things that come with it, like compression, data security, CoW etc etc.

But...managing disks is...not a ZFS strength. At all. As a trivial example, if a disk fails in a hardware raid card, you can immediately go to the GUI, right click on that disk to light up the LED and identify which one it is. Or use automation, and alerting. With ZFS...uh...need to remember arcane commands...

That would be fine in a homelab scenario, but certainly not in a DC, where you have to instruct remote hands to "Go replace the disk in the bay that is blinking red..."

RAIDz expansion is just now making its way into the codebase. After how many years/decades? RAID cards have been able to expand raid arrays for decades, all the while being online.
 

gea

Well-Known Member
Dec 31, 2010
3,646
1,444
113
DE
Disk identification is not a ZFS feature nor a filesystem feature at all but can be done with a management GUI and/or a disk controller card or HBA that supports disk identification. Raid-Z expansion (ex 6 disk Z2 to 7 disk Z2) is not really an enterprise need but in ZFS for over a year.

Hybrid raid pools from hd and flash are a big ZFS advantage. You can define per vdev whether Metadata, small files or all files should be on hd or flash. A zfs rewrite command can move data between hd and flash.

ZFS Draid with hundreds of disks can offer ultrashort rebuild time in case of a disk failure.

The upcoming ZFS AnyRaid is a breakthrough in Raid technology, not only for ZFS. It fully supports the complete disk capacity in a Raid 1 or Raid Zn config from disks of different size with vdev expansion and vdev shrink (add/remove disks on AnyRaid).

ZFS can check/verify data in a Raid in online state, no offline chkdsk/fs check needed that can last days

ZFS replication can sync Terabyte Raid highload pools down to a delay of a few seconds even with open files in current state.

Advantages in a hardware raid not available or possible in ZFS software raid are very rare. The main advantage of a hardware raid may be a Windows system with a boot os ntfs mirror. This is mainly due the current lack of a stable boot software raid in Windows (modern Windows software raid =Storage Spaces cannot boot) but an upcoming ReFS bootmirror with Copy on Write and checksums may change that too.

The overall package is what counts.
 
  • Like
Reactions: Fritz and T_Minus

i386

Well-Known Member
Mar 18, 2016
4,901
1,928
113
37
Germany
is areca common\ popular\rare in your experience for datacenter usage?
From what I've seen in forums and blogs: Broadcom (LSI as a brand/name is long gone...) -> Adaptec -> others. Broadcom & adaptec are the dominant brands in datacenter, Broadcom is used by every big server brand, adaptec is mostly used by specific HPE server models.
Areca (or others) I've seen only being mentioned when people talk about custom build servers & workstations.
From hardware point they often look spectacular (even better than broadcom & adaptec stuff), but seem to have software issues (mostly driver support)*

* That's probably because the people only mention them when they have problems. The "happy" users without problem are less noticeable :D
RAIDz expansion is just now making its way into the codebase. After how many years/decades?
2017 Matt Ahrens had a proof of concept, 2023 it was "complete" and merged, but it seems to be "broken" (I didn't follow the progress in recent years: Issues · openzfs/zfs)
 
  • Like
Reactions: T_Minus

twin_savage

Active Member
Jan 26, 2018
170
127
43
35
Areca (or others) I've seen only being mentioned when people talk about custom build servers & workstations.
From hardware point they often look spectacular (even better than broadcom & adaptec stuff), but seem to have software issues (mostly driver support)*
I'd argue the opposite. As of late it's Broadcom that has been having a rash of software and driver issues, mostly linked to users wanting it to use ASPM or getting into trouble with forced UBM requirements on certain firmware.

Areca on the other hand is to my knowledge the first and only manufacturer to offer an NVMe-native mode for the individual drives or arrays it exposes (even the trimode HBAs can't do this which are supposed to not have "smarts" in them), which can be very important since it lets you use the OS's native NVMe driver instead of abstracting the drives/arrays as SCSI devices. Arcea also went to the trouble of writing a firmware mode to support Apple silicon host interface for their PCIe cards which is pretty edge case.
 

mattventura

Well-Known Member
Nov 9, 2022
774
432
63
Yes, there's issues with ZFS and..well, if "write hole" is what we're calling it, that's fine. The issue is ZFS caching data in RAM, especially on writes. That 5s of buffer (by default) in RAM is ill suited for enterprise data strategies. The only alternative is to do "sync always", but then...the performance tanks.

So..you now have to add a fast "power protected" device as SLOG...wait where have I heard that before? Oh yes, a hardware RAID card. :)

I like ZFS, it has a lot to offer, but it does have its issues. So...you know...if you run ZFS on a hardware raid array...which is protected by a hardware raid card...and do sync always... :p

Blasphemy!
That's not exactly how it works.
What matters isn't the exact data path of the write, it's when the write is reported as complete. For a sync write, the write is reported as complete when it is written to the ZIL (which can be part of the main storage drives, or the SLOG device if one exists). For async writes, it buffers into RAM, because the entire point of an async write is that you aren't waiting for it to complete. You can override this behavior by setting the sync property (sync=always forces everything to be treated as if it were synchronous, sync=never forces everything to be treated as if it were async), but you typically only need that to work around buggy software that uses sync vs async incorrectly.

The SLOG doesn't need to be "power protected" any more than you'd need your main devices to be "power protected". The requirements are the same - the device just needs to only report a write as completed when it is actually committed. For example, with an NVMe drive, you need either a drive with PLP, or a drive without PLP that doesn't prematurely report writes as complete (i.e. a reputable brand, not some cheap no-name junk that tries to cheat on benchmarks). When comparing reputable drives, PLP is actually less of a data safety concern, and more so a performance enhancement, because it lets the drive report completion of the write before it is committed to flash.

If ZFS is losing data, it typically means one of a few things:
  • You're using async writes for something that should be sync (this will be a problem with any storage setup - ZFS, a plain filesystem with no RAID, or HW RAID)
  • Your drives are reporting writes as complete before they're actually committed (usually this means you are using junk or left drive-level write caching enabled)
  • The actual underlying drive messed up the data (e.g. lost power in the middle of a read-modify-write cycle where the physical block size doesn't match the logical block size, like 512e drives)

@gea
But...managing disks is...not a ZFS strength. At all. As a trivial example, if a disk fails in a hardware raid card, you can immediately go to the GUI, right click on that disk to light up the LED and identify which one it is. Or use automation, and alerting. With ZFS...uh...need to remember arcane commands...
Maybe this is just a difference of opinion but I don't think `ledctl failure=/dev/sdwhatever` (or just having a monitoring script doing it for you) is that big of an issue.

Hardware raid handles bitrot via patrol scrubs, just as ZFS would.
There's two things hardware RAID can do. The first is to have block-level checksums, where you format the drive as 520 or 528 bytes instead of 512 (or 4160/4224 instead of 4096) and the extra space is used for checksums. This lets the controller immediately detect a damaged block, and doesn't risk a write hole. However, this requires drives and a controller that support that. Typical patrol reads are only looking for unreadable sectors, and don't protect against bit rot, at least not unless they also have the extra checksum bytes. You can do a full consistency check to verify that the data on each drive matches (or that the parity matches the expected data in RAID 5/6), but unless you have a 3-way mirror or RAID 6, or block-level checksums, that suffers from the issue that there's no authoritative source for which drive is correct. Even if the FS or application has some way of detecting the bit rot, there's no way to tell the HW raid controller to try reading from the other drive instead.
 
  • Like
Reactions: T_Minus