I was just sipping my coffee sitting at my home workstation and noticed I lost my main data hard drive that I use for a network drive. It is a 7 disk array of which 6 drives were active in hardware RAID 6, with battery backup and cache.
I signed in to the Broadcom LSI Storage Manager webpage and looked at the logs. It was not a normal issue, but a catastrophic one.
The best I can tell is that a single drive had a "lost sense," i.e. a drive failure where it stops responding, but it apparently cascaded to two other drives during recovery or they had independent, unlucky errors. This kind of failure seemed a bit unlikely, so I wonder if there was a power supply glitch or the RAID card failed during its automatic recovery and the errors were transient?
Before we get into it, my backups are so-so, and it's likely I will lose a small-ish amount of unbacked up data either due to timing between critical backups, or due to data I did not consider vital enough to backup. I would like to recover the array due to recent file changes and convenience, of course. This VD holds about 6 TB in an array of 3 TB Seagates, of which 90 to 95% is backed up and around 0.1% may be recent critical data files. The unbacked up data is my collection of replaceable things that I can re-download from the Interweb Tubes. I am using Windows 11. I have a controller card I bought off eBay and I modified it to have a fan that blows constantly on the controller chip. My last controller would get CRC errors occasionally, so I figured it wouldn't hurt to keep it cool.
Also, FYI, I had been running a monthly patrol read of the VD. The last patrol read was on the 30th, as in 7 days ago. They take about a day.
Around 9:03 pm, Device ID 16 had several timeouts. It's the failed drive. And then... the controller tried to bring on my Global Hot Spare (GHS), Device 17
9:03:04 ID 16, Unexpected sense, logical unit not ready [several like this including bus resets]
9:03:17 ID 16, Unrecoverable medium error during recovery
9:03:17 ID 16, Puncturing bad block 0x117cad820
9:03:17 ID 17, Puncturing bad block 0x117cad820 [That's not good! Other one simultaneously bad?]
9:03:36 ID 16, Timeout (again)
9:03:36 ID 16, Reset
[Several timeouts and unexpected senses from ID 16]
9:04:52 ID 17, Unrecoverable medium error during recovery
9:04:52 ID 16, Unrecoverable medium error during recovery
9:05:24 ID 16, Unexpected sense, Unrecovered read error [that's a first]
9:05:24 ID 16, Unrecoverable medium error during recovery [again]
Huge gap of over 30 minutes of nothing being logged. I suppose the array was degraded and it was copying to GHS
9:44:26 ID 16, Unexpected sense, Unrecovered read error [again]
9:44:26 ID 16, Unrecoverable medium error during recovery [again]
The last two are repeated - more read errors
9:45:09 ID 15, Puncturing bad block Location 0xaeb8408 [Oh, no, what's going on to ID 15?]
9:45:09 ID 15, Puncturing bad block Location 0xaeb8409 [Two?]
9:45:09 ID 17, Puncturing bad block Location 0x6450340 [Hello ID 17?]
9:45:09 ID 18, Puncturing bad block Location 0x6450340 [Hello ID 18?]
9:45:09 ID 17, Puncturing bad block Location 0x6450341
9:45:09 ID 18, Puncturing bad block Location 0x6450341
9:45:09 ID 17, Puncturing bad block Location 0x6450342
9:45:09 ID 18, Puncturing bad block Location 0x6450342
[Several unexpected senses and Unrecoverable read errors from ID 16]
9:47:21 [Several more puncturing bad blocks on ID 17 and ID 18 0x6450350 to 0x6450359]
9:47:22 [Several more puncturing bad blocks on ID 17 and ID 18 0x645035a to 0x645035f]
[Several unexpected senses and Unrecoverable read errors from ID 16]
9:48:01 ID 16, Reset Type 3
9:48:05 ID 16, Failed. Drive Error Counter: 44
9:48:05 ID 16, Previous: Online, Current: Failed
9:48:05 Virtual Drive. State change on VD: 0 Previous: Degraded; Current: Offline;
9:48:05 Controller cache pinned for missing or offline VD: VD 0
9:48:05 VD is now OFFLINE VD 0
9:48:05 ID 15, Puncturing bad block Location 0xaeb84e8 [repeated for sequential blocks up to 0xaeb84f7]
9:48:08 Number of valid snapdump available is 1
[Several unexpected senses, resets, and then disk removed from ID 16]
9:48:31 ID 16, Previous: Failed, Current: UnConfigured Bad
9:50:37 ID 16, Disk: Inserted
So in less than an hour, I had ID 16 go bad, the GHS ID 17 failed during recovery, and then several bad blocks found on ID 18 and ID 15. The current state of things is dire. I am faced with trying to correct it by forcing some of these failed drives back into the array. However, the current drive list shows three drives in the failed VD and three drives listed as "foreign." (see pic) I got the PC to boot out of the LSI "safe mode" by removing the disk cache, which is probably toast anyway.

In the VD: ID 17, ID 18, and ID 15. All of these have reported bad blocks
Foreign drives: ID 19, ID 13, and ID 14
Weirdly enough, the foreign drives were never mentioned in the logs. That array seems to be the best one. I wish I could switch over to the Foreign drives and see if any of the failed drives would add to its array?
So it would be: IDs 19, 13, 14, and either 18 or 14 or both. (I need four drives of course for the array to work)
I don't trust ID 17, the GHS as of 9:00 pm yesterday. (see pic)

My question is this. Can I remove the current array (I mean, it's populated with just the baddest ones) and then ask the controller to import the foreign array? I don't understand how the drives got labeled that way or if it matters. I am wondering if anyone has worked with punctuating bad blocks and tried to remove them with megacli or megacli64, which I have installed.

I signed in to the Broadcom LSI Storage Manager webpage and looked at the logs. It was not a normal issue, but a catastrophic one.
The best I can tell is that a single drive had a "lost sense," i.e. a drive failure where it stops responding, but it apparently cascaded to two other drives during recovery or they had independent, unlucky errors. This kind of failure seemed a bit unlikely, so I wonder if there was a power supply glitch or the RAID card failed during its automatic recovery and the errors were transient?
Before we get into it, my backups are so-so, and it's likely I will lose a small-ish amount of unbacked up data either due to timing between critical backups, or due to data I did not consider vital enough to backup. I would like to recover the array due to recent file changes and convenience, of course. This VD holds about 6 TB in an array of 3 TB Seagates, of which 90 to 95% is backed up and around 0.1% may be recent critical data files. The unbacked up data is my collection of replaceable things that I can re-download from the Interweb Tubes. I am using Windows 11. I have a controller card I bought off eBay and I modified it to have a fan that blows constantly on the controller chip. My last controller would get CRC errors occasionally, so I figured it wouldn't hurt to keep it cool.
Also, FYI, I had been running a monthly patrol read of the VD. The last patrol read was on the 30th, as in 7 days ago. They take about a day.
Around 9:03 pm, Device ID 16 had several timeouts. It's the failed drive. And then... the controller tried to bring on my Global Hot Spare (GHS), Device 17
9:03:04 ID 16, Unexpected sense, logical unit not ready [several like this including bus resets]
9:03:17 ID 16, Unrecoverable medium error during recovery
9:03:17 ID 16, Puncturing bad block 0x117cad820
9:03:17 ID 17, Puncturing bad block 0x117cad820 [That's not good! Other one simultaneously bad?]
9:03:36 ID 16, Timeout (again)
9:03:36 ID 16, Reset
[Several timeouts and unexpected senses from ID 16]
9:04:52 ID 17, Unrecoverable medium error during recovery
9:04:52 ID 16, Unrecoverable medium error during recovery
9:05:24 ID 16, Unexpected sense, Unrecovered read error [that's a first]
9:05:24 ID 16, Unrecoverable medium error during recovery [again]
Huge gap of over 30 minutes of nothing being logged. I suppose the array was degraded and it was copying to GHS
9:44:26 ID 16, Unexpected sense, Unrecovered read error [again]
9:44:26 ID 16, Unrecoverable medium error during recovery [again]
The last two are repeated - more read errors
9:45:09 ID 15, Puncturing bad block Location 0xaeb8408 [Oh, no, what's going on to ID 15?]
9:45:09 ID 15, Puncturing bad block Location 0xaeb8409 [Two?]
9:45:09 ID 17, Puncturing bad block Location 0x6450340 [Hello ID 17?]
9:45:09 ID 18, Puncturing bad block Location 0x6450340 [Hello ID 18?]
9:45:09 ID 17, Puncturing bad block Location 0x6450341
9:45:09 ID 18, Puncturing bad block Location 0x6450341
9:45:09 ID 17, Puncturing bad block Location 0x6450342
9:45:09 ID 18, Puncturing bad block Location 0x6450342
[Several unexpected senses and Unrecoverable read errors from ID 16]
9:47:21 [Several more puncturing bad blocks on ID 17 and ID 18 0x6450350 to 0x6450359]
9:47:22 [Several more puncturing bad blocks on ID 17 and ID 18 0x645035a to 0x645035f]
[Several unexpected senses and Unrecoverable read errors from ID 16]
9:48:01 ID 16, Reset Type 3
9:48:05 ID 16, Failed. Drive Error Counter: 44
9:48:05 ID 16, Previous: Online, Current: Failed
9:48:05 Virtual Drive. State change on VD: 0 Previous: Degraded; Current: Offline;
9:48:05 Controller cache pinned for missing or offline VD: VD 0
9:48:05 VD is now OFFLINE VD 0
9:48:05 ID 15, Puncturing bad block Location 0xaeb84e8 [repeated for sequential blocks up to 0xaeb84f7]
9:48:08 Number of valid snapdump available is 1
[Several unexpected senses, resets, and then disk removed from ID 16]
9:48:31 ID 16, Previous: Failed, Current: UnConfigured Bad
9:50:37 ID 16, Disk: Inserted
So in less than an hour, I had ID 16 go bad, the GHS ID 17 failed during recovery, and then several bad blocks found on ID 18 and ID 15. The current state of things is dire. I am faced with trying to correct it by forcing some of these failed drives back into the array. However, the current drive list shows three drives in the failed VD and three drives listed as "foreign." (see pic) I got the PC to boot out of the LSI "safe mode" by removing the disk cache, which is probably toast anyway.

In the VD: ID 17, ID 18, and ID 15. All of these have reported bad blocks
Foreign drives: ID 19, ID 13, and ID 14
Weirdly enough, the foreign drives were never mentioned in the logs. That array seems to be the best one. I wish I could switch over to the Foreign drives and see if any of the failed drives would add to its array?
So it would be: IDs 19, 13, 14, and either 18 or 14 or both. (I need four drives of course for the array to work)
I don't trust ID 17, the GHS as of 9:00 pm yesterday. (see pic)

My question is this. Can I remove the current array (I mean, it's populated with just the baddest ones) and then ask the controller to import the foreign array? I don't understand how the drives got labeled that way or if it matters. I am wondering if anyone has worked with punctuating bad blocks and tried to remove them with megacli or megacli64, which I have installed.

Last edited: