Identifying bad hardware - HDD/backplane/cables/HBA

Railgun · Apr 6, 2023

I figured I'd post this in this subforum as the overall goal is to have a solid NAS setup, but it also relates to HDDs in and of themselves as well as HBAs. Feel free to move if required.

For completeness, here's what I'm working with for the purposes of testing:

-Seagate EXOS X12 12TB (SAS/SED version) ST12000NM0037
-ET04 FW

-Broadcom 9305-24i (latest FW)

And this is the current NAS and the HW that this will eventually be going into (the relevant pieces anyway):

Chassis - Server Case SC-4824 (older version with 6Gbps backplane)
HBA - Same Broadcom as above
SW - ESXi 8 with HBA passed throught to TrueNAS Core VM
Disks - 21x 6TB WD RED Plus in raidz1 + 1 hotspare
Cables - 10Gtek SFF-8643 - 8643

I'm using some old hardware from a company called Scalable Infomatics. The whole setup is an old Supermicro 60-bay top loading chassis. But the important aspect here is that I'm leveraging the old expander PCBs for testing these drives as shown:

Now, these disks are somewhat known to have issues. So much so that Seagate replaced the entire compliment of these in our environment. There are enough stories, threads, websites about this line's reliability. But, I'm going through testing them anyway to see if I can actually induce some issues.

I had them in my original NAS setup and TrueNAS was complaining left, right, and center about read and write errors as well as checksum errors. Enough that they only lasted a few weeks in that setup.

However, I'm now going through and trying to do a full test leveraging badblocks to see if there are any problems I can eek out.

I've done one pass and all disks thus far, save two, have passed.

However, I have several disks that have the following output during the course of this testing:

Apr 1 05:06:49 truenas mpr0: Controller reported scsi ioc terminated tgt 23 SMID 325 loginfo 31120303
Apr 1 05:06:49 truenas (da15:mpr0:0:23:0): WRITE(10). CDB: 2a 00 05 25 48 80 00 00 40 00
Apr 1 05:06:49 truenas (da15:mpr0:0:23:0): CAM status: CCB request completed with an error
Apr 1 05:06:49 truenas (da15:mpr0:0:23:0): Retrying command, 3 more tries remain

Apr 2 23:52:55 truenas mpr0: Controller reported scsi ioc terminated tgt 21 SMID 753 loginfo 31170000
Apr 2 23:52:55 truenas (da13:mpr0:0:21:0): READ(10). CDB: 28 00 02 db 50 80 00 00 40 00
Apr 2 23:52:55 truenas (da13:mpr0:0:21:0): CAM status: CCB request completed with an error
Apr 2 23:52:55 truenas (da13:mpr0:0:21:0): Error 5, Retries exhausted
Apr 2 23:52:56 truenas mpr0: Controller reported scsi ioc terminated tgt 21 SMID 1574 loginfo 31110e05
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): READ(10). CDB: 28 00 02 db 50 80 00 00 01 00
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): CAM status: CCB request completed with an error
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): Retrying command, 3 more tries remain
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): READ(10). CDB: 28 00 02 db 50 80 00 00 01 00
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): CAM status: SCSI Status Error
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): SCSI status: Check Condition
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): SCSI sense: UNIT ATTENTION asc:29,2 (SCSI bus reset occurred)
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): Field Replaceable Unit: 2
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): Retrying command (per sense data)
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): READ(10). CDB: 28 00 02 db 50 80 00 00 01 00
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): CAM status: SCSI Status Error
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): SCSI status: Check Condition
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): SCSI sense: NOT READY asc:4,1 (Logical unit is in process of becoming ready)
Apr 2 23:52:56 truenas (da13:mpr0:0:21:0): Polling device for readiness

The above happens a lot, but only across some disks, and after two batches run so far (16 disks concurrently) many seem to hit the same port (that is to say da15 in the first run and da15 as well in the second). There are some other messages, but for now, I think the above is more relevant.

In all but two disks, they pass the first pass with no issues, even with the logs above. Those two have had the following:

root@truenas[~]# badblocks -b 4096 -wsv /dev/da13
Checking for bad blocks in read-write mode
From block 0 to 2929721343
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing: 29909825done, 19:44:29 elapsed. (0/0/0 errors)
% done, 24:55:17 elapsed. (1/0/0 errors)
1.89% done, 24:55:36 elapsed. (1/0/0 errors)

That reads to me just a single error.

Of all the disks that had write or read errors, I'm testing again in different locations and have replaced one of the expansion boards to rule it out. Next will be cables, which are Silverstone versions.

I'm trying to weed out a few scenarios.

First, if I can confirm that it's a common port, will it be a cable issue or expander issue. The expander issue I can work around as I have 15 of those. Cables I have two additional that I'm not using for this testing.

Secondly, as I'm running 16 disks concurrently, could there be an issue with the HBA and some loading problem. As you all know, it's 100% write, then 100% read across all disks. At some point in the middle, it moves from 100% write to read gradually as the disks don't all finish their writes at the same time. That's when I think other errors pop up, but so far this has happened sometime in the middle of the night so I can't be 100% sure there's some odd transition issue.

Third, once I go through the initial batches, I'll re-run the ones that popped errors and see if they follow the disk or port/position. My current batch is only running eight disks, two across each HBA port/expander at the moment. I'm only 90m into that and thus far no issues, so we'll see.

Once I have a clean batch of 24 disks, I will set it up, albeit in different HW, as it will exist in my current server and let it bake for a bit, throwing some writes its way and see how it behaves.

So, two questions I have born from the above.

-Those that are more familiar with the above logs than I, how does that read to you? I can provide more logs if required.

-I'm somewhat wary about the quality of these cables. I'm usually firm into the camp of a cable is a cable, but in this case, I'm thinking something better is required. Does anyone have a recommendation of better cables?

EDIT:

So, 24H into my third run, with only eight drives going this time around, and not a single issue. Somewhat anecdotally, I'd seen references to these cards possibly having issues at higher loads. At best, I was pushing ~3.6GBps at 100% write or read (~230MBps per drive), which is less than half of what the bus and card can do.

Screenshot 2023-04-07 at 10.02.13 AM.png

And over time this drops as the test progresses. Now, while I am on an older mobo for the purposes of this testing (Asrock H87M Pro4) I am using the 3.0 x16 slot for the HBA, so no apparent issues there.

Search

Identifying bad hardware - HDD/backplane/cables/HBA

Railgun

Active Member