Edit / Conclusion:
These errors were intermittently being logged against drives in a SAS expander JBOD for writes only due to a mostly good, but partly bad port on an LSI SAS9206-16e card running IT mode P20.00.07 firmware:
mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
The problem was noticed due to resilvers restarting constantly and small numbers of checksum errors racking up against drives in a zpool while resilvering.
The exact defect of the card is unknown, and the card mostly works fine. This seems to happen to certain drives more than others even when they are moved to other backplanes, which may have something to do with SATA SAS id assignment by the card. It some sort of communication drop while sending write data to the drives, but the data never makes it to them. This is all on Supermicro SAS2 expander backplanes running the latest firmware, but could happen in other expanders, and probably directly connected drives as well.
Switching to a different port on the same card solved the issue, but replacing the card is recommended. Then run a zpool scrub, and ensure you do not see these errors again.
######
Quick version / TLDR:
I know others had problems with mis-flashed firmware from the Supermicro SC847 JBOD deal that came through about a year ago, but has anyone else had any other issues with SC847 JBOD units, especially with the rear 21-bay backplanes? If so, did you get your units in the cheap deal, or not? I have just had my second rear backplane issue with a SC847 JBOD unit, both of them were probably part of the deal (I'm not positive though.)
Details:
About two years or so ago I got several SC847 JBOD chassis', mostly expensively before that cheap deal came by, but I got I think 3 units from that as well. I'm trying to figure out if there are general issues with the rear backplane, or just some units which were part of that deal.
I never kept track of which of my JBODs were from the deal or not, but now I have had problems with the rear single SAS2 expander 21-bay backplane in two separate SC847 JBOD units dropping drives (often, but not always during heavy I/O) or bays going bad / flakey even with the latest firmware. My drives are all SATA, not SAS, so this is known to be a risk, but in my first instance, it was definitely the particular backplane at fault.
I have never had issues with the front backplanes in these, only the rear backplanes. My first failure was in a JBOD which had been fine almost fully loaded (36 of 45 drives moved from a 36-bay SC847) until a year ago or so, when it dropped 4 drives from a z3 zpool while I was copying to it, and went read-only. Resilvering after rebooting, random drives would rack up checksum errors, and one bay in particular seemed to continually drop whatever drive was in it from that point onwards. After changing cables and controllers, I eventually swapped the drives into another chassis, and it was able to resilver without many checksum errors, and scrub it and it has been fine ever since. None of the drives was actually bad once they were in a different chassis. I haven't had time to unrack it and replace the rear backplane since.
Yesterday when I was finally copying from that chassis to a newly created array on the rear backplane of a third SC847 JBOD (I had been using another zpool on the front backplane exclusively off and on for a year or so in this one without any issues), it dropped 2 drives during the copy. After rebooting, the drives were seen again, but I tried a zpool replace of them to 2 other drives I added in different bays, but it got stuck racking up a few checksum errors on random drives like the other unit had done, but also got stuck restarting it's resilver every few minutes. I powered off, moved the drives to the known good front backplane, but the resilver kept restarting itself there as well, but with fewer checksum errors (only 2 on one other drive). I zpool detached the drives I was trying to replace onto, and let it try to just resilver overnight. In the morning, it became clear that the resilver had been restarting, but now one drive has been racking up reallocations in SMART.
The failing drive could have been the culprit of the whole failure this time - I have heard tales of a single SATA drive starting to fail can cause lockups and drops on other drives or even the whole backplane, but this is the first time I have potentially seen it myself. However, the drives seem to generally be behaving better in the front backplane than they were in the rear backplane, which leads me to think backplane issue (again).
I am now running long SMART tests on all of the drives, and my next step will be to pull the drive with reallocations and ddrescue it onto a known good drive, and then try the resilver with the replacement drive instead and see if that completes instead of continuously restarting.
Cabling note:
When the first incident happened, each backplane was single cabled separately to the controller, but during my troubleshooting of that, I switched all of my JBODs to single cable from the controller to the rear backplane, daisy-chained to the front backplane internally (and two of the other 3 external ports cabled to the front backplane in case I have enough controllers to dual cable to the front and daisy chain to the rear instead without opening them up) Effectively, the entire time this second problematic JBOD has been running drives on the good front backplane, they were daisy chained from the rear backplane, not directly attached to the controller.
Temperatures:
I initially suspected the SAS expander chip on the rear backplane was overheating, but spot checking my drive temps, they are well under 40C, so the expander chip should not be too hot either. If I have time, I may load the rear expander with drives, write to it a lot, and then quickly power it off, open it up and check the SAS expander heatsink temp to rule this out, but it is now low on my suspects list.
These errors were intermittently being logged against drives in a SAS expander JBOD for writes only due to a mostly good, but partly bad port on an LSI SAS9206-16e card running IT mode P20.00.07 firmware:
mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
The problem was noticed due to resilvers restarting constantly and small numbers of checksum errors racking up against drives in a zpool while resilvering.
The exact defect of the card is unknown, and the card mostly works fine. This seems to happen to certain drives more than others even when they are moved to other backplanes, which may have something to do with SATA SAS id assignment by the card. It some sort of communication drop while sending write data to the drives, but the data never makes it to them. This is all on Supermicro SAS2 expander backplanes running the latest firmware, but could happen in other expanders, and probably directly connected drives as well.
Switching to a different port on the same card solved the issue, but replacing the card is recommended. Then run a zpool scrub, and ensure you do not see these errors again.
######
Quick version / TLDR:
I know others had problems with mis-flashed firmware from the Supermicro SC847 JBOD deal that came through about a year ago, but has anyone else had any other issues with SC847 JBOD units, especially with the rear 21-bay backplanes? If so, did you get your units in the cheap deal, or not? I have just had my second rear backplane issue with a SC847 JBOD unit, both of them were probably part of the deal (I'm not positive though.)
Details:
About two years or so ago I got several SC847 JBOD chassis', mostly expensively before that cheap deal came by, but I got I think 3 units from that as well. I'm trying to figure out if there are general issues with the rear backplane, or just some units which were part of that deal.
I never kept track of which of my JBODs were from the deal or not, but now I have had problems with the rear single SAS2 expander 21-bay backplane in two separate SC847 JBOD units dropping drives (often, but not always during heavy I/O) or bays going bad / flakey even with the latest firmware. My drives are all SATA, not SAS, so this is known to be a risk, but in my first instance, it was definitely the particular backplane at fault.
I have never had issues with the front backplanes in these, only the rear backplanes. My first failure was in a JBOD which had been fine almost fully loaded (36 of 45 drives moved from a 36-bay SC847) until a year ago or so, when it dropped 4 drives from a z3 zpool while I was copying to it, and went read-only. Resilvering after rebooting, random drives would rack up checksum errors, and one bay in particular seemed to continually drop whatever drive was in it from that point onwards. After changing cables and controllers, I eventually swapped the drives into another chassis, and it was able to resilver without many checksum errors, and scrub it and it has been fine ever since. None of the drives was actually bad once they were in a different chassis. I haven't had time to unrack it and replace the rear backplane since.
Yesterday when I was finally copying from that chassis to a newly created array on the rear backplane of a third SC847 JBOD (I had been using another zpool on the front backplane exclusively off and on for a year or so in this one without any issues), it dropped 2 drives during the copy. After rebooting, the drives were seen again, but I tried a zpool replace of them to 2 other drives I added in different bays, but it got stuck racking up a few checksum errors on random drives like the other unit had done, but also got stuck restarting it's resilver every few minutes. I powered off, moved the drives to the known good front backplane, but the resilver kept restarting itself there as well, but with fewer checksum errors (only 2 on one other drive). I zpool detached the drives I was trying to replace onto, and let it try to just resilver overnight. In the morning, it became clear that the resilver had been restarting, but now one drive has been racking up reallocations in SMART.
The failing drive could have been the culprit of the whole failure this time - I have heard tales of a single SATA drive starting to fail can cause lockups and drops on other drives or even the whole backplane, but this is the first time I have potentially seen it myself. However, the drives seem to generally be behaving better in the front backplane than they were in the rear backplane, which leads me to think backplane issue (again).
I am now running long SMART tests on all of the drives, and my next step will be to pull the drive with reallocations and ddrescue it onto a known good drive, and then try the resilver with the replacement drive instead and see if that completes instead of continuously restarting.
Cabling note:
When the first incident happened, each backplane was single cabled separately to the controller, but during my troubleshooting of that, I switched all of my JBODs to single cable from the controller to the rear backplane, daisy-chained to the front backplane internally (and two of the other 3 external ports cabled to the front backplane in case I have enough controllers to dual cable to the front and daisy chain to the rear instead without opening them up) Effectively, the entire time this second problematic JBOD has been running drives on the good front backplane, they were daisy chained from the rear backplane, not directly attached to the controller.
Temperatures:
I initially suspected the SAS expander chip on the rear backplane was overheating, but spot checking my drive temps, they are well under 40C, so the expander chip should not be too hot either. If I have time, I may load the rear expander with drives, write to it a lot, and then quickly power it off, open it up and check the SAS expander heatsink temp to rule this out, but it is now low on my suspects list.
Last edited: