Culprit: slightly bad LSI card - was: SC847 45-bay JBODs rear backplane issues - anyone else?

Discussion in 'Chassis and Enclosures' started by sfbayzfs, Aug 22, 2017.

  1. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    Edit / Conclusion:

    These errors were intermittently being logged against drives in a SAS expander JBOD for writes only due to a mostly good, but partly bad port on an LSI SAS9206-16e card running IT mode P20.00.07 firmware:
    mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)

    The problem was noticed due to resilvers restarting constantly and small numbers of checksum errors racking up against drives in a zpool while resilvering.

    The exact defect of the card is unknown, and the card mostly works fine. This seems to happen to certain drives more than others even when they are moved to other backplanes, which may have something to do with SATA SAS id assignment by the card. It some sort of communication drop while sending write data to the drives, but the data never makes it to them. This is all on Supermicro SAS2 expander backplanes running the latest firmware, but could happen in other expanders, and probably directly connected drives as well.

    Switching to a different port on the same card solved the issue, but replacing the card is recommended. Then run a zpool scrub, and ensure you do not see these errors again.

    ######

    Quick version / TLDR:

    I know others had problems with mis-flashed firmware from the Supermicro SC847 JBOD deal that came through about a year ago, but has anyone else had any other issues with SC847 JBOD units, especially with the rear 21-bay backplanes? If so, did you get your units in the cheap deal, or not? I have just had my second rear backplane issue with a SC847 JBOD unit, both of them were probably part of the deal (I'm not positive though.)

    Details:

    About two years or so ago I got several SC847 JBOD chassis', mostly expensively before that cheap deal came by, but I got I think 3 units from that as well. I'm trying to figure out if there are general issues with the rear backplane, or just some units which were part of that deal.

    I never kept track of which of my JBODs were from the deal or not, but now I have had problems with the rear single SAS2 expander 21-bay backplane in two separate SC847 JBOD units dropping drives (often, but not always during heavy I/O) or bays going bad / flakey even with the latest firmware. My drives are all SATA, not SAS, so this is known to be a risk, but in my first instance, it was definitely the particular backplane at fault.

    I have never had issues with the front backplanes in these, only the rear backplanes. My first failure was in a JBOD which had been fine almost fully loaded (36 of 45 drives moved from a 36-bay SC847) until a year ago or so, when it dropped 4 drives from a z3 zpool while I was copying to it, and went read-only. Resilvering after rebooting, random drives would rack up checksum errors, and one bay in particular seemed to continually drop whatever drive was in it from that point onwards. After changing cables and controllers, I eventually swapped the drives into another chassis, and it was able to resilver without many checksum errors, and scrub it and it has been fine ever since. None of the drives was actually bad once they were in a different chassis. I haven't had time to unrack it and replace the rear backplane since.

    Yesterday when I was finally copying from that chassis to a newly created array on the rear backplane of a third SC847 JBOD (I had been using another zpool on the front backplane exclusively off and on for a year or so in this one without any issues), it dropped 2 drives during the copy. After rebooting, the drives were seen again, but I tried a zpool replace of them to 2 other drives I added in different bays, but it got stuck racking up a few checksum errors on random drives like the other unit had done, but also got stuck restarting it's resilver every few minutes. I powered off, moved the drives to the known good front backplane, but the resilver kept restarting itself there as well, but with fewer checksum errors (only 2 on one other drive). I zpool detached the drives I was trying to replace onto, and let it try to just resilver overnight. In the morning, it became clear that the resilver had been restarting, but now one drive has been racking up reallocations in SMART.

    The failing drive could have been the culprit of the whole failure this time - I have heard tales of a single SATA drive starting to fail can cause lockups and drops on other drives or even the whole backplane, but this is the first time I have potentially seen it myself. However, the drives seem to generally be behaving better in the front backplane than they were in the rear backplane, which leads me to think backplane issue (again).

    I am now running long SMART tests on all of the drives, and my next step will be to pull the drive with reallocations and ddrescue it onto a known good drive, and then try the resilver with the replacement drive instead and see if that completes instead of continuously restarting.

    Cabling note:

    When the first incident happened, each backplane was single cabled separately to the controller, but during my troubleshooting of that, I switched all of my JBODs to single cable from the controller to the rear backplane, daisy-chained to the front backplane internally (and two of the other 3 external ports cabled to the front backplane in case I have enough controllers to dual cable to the front and daisy chain to the rear instead without opening them up) Effectively, the entire time this second problematic JBOD has been running drives on the good front backplane, they were daisy chained from the rear backplane, not directly attached to the controller.

    Temperatures:

    I initially suspected the SAS expander chip on the rear backplane was overheating, but spot checking my drive temps, they are well under 40C, so the expander chip should not be too hot either. If I have time, I may load the rear expander with drives, write to it a lot, and then quickly power it off, open it up and check the SAS expander heatsink temp to rule this out, but it is now low on my suspects list.
     
    #1
    Last edited: Aug 26, 2017
  2. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,679
    Likes Received:
    410
    Are that the Chassis with the Fan mods?
     
    #2
  3. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    Edit - for the drives in my scrollback, drive temps were 28-29C with a min of 23 and max of 28 or 29 depending on the drive. These were mostly the replacement drives I was trying to resilver onto, so they might not have warmed up as much, but the others can't have been more than a couple of degrees warmer.

    Yes, these are the units running 4x 0.35A fans - I was adding the temperature notes as you posted your reply - I need to go through scrollback, but I'm pretty sure my drive temps were in the 31-35 C range when this happened, definitely below 40C.
     
    #3
    Last edited: Aug 22, 2017
  4. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    Update - at this point, I think it was 2 drives at fault - one was one of the two which were dropped, the other was a different drive.

    Steps taken so far for anyone interested:
    1. SMART long read tests came back clean on all drives except for one, which got up to over 200 remapped sectors and lots of pending as well! It had only 60 power on hours too, but is a few months out of HGST warranty :{
    2. I put a known good drive in a free bay and ddrescued the drive with the remappings onto the known good drive - I used /dev/disk/by-id/ata-* names to reference the drives since drive letters can change. DDRescue reported something under 4MB of unreadable sectors in 9 areas at the end.
    3. I pulled the failing drive, rebooted, and zpool imported - the pool started resilvering with the /dev/sd?? name of the replacement drive instead of the old serial number it was replacing, and did not restart every minute or two!
    4. After an hour or so, it became apparent that the resilver had restarted :(
    5. Checking dmesg, I saw write errors from the LSI card driver for one drive, which I tracked down to one of the two drives that were originally dropped from the system causing the problem in the first place.
    6. Check SMART for the drive showing errors - It had just passed a long SMART read test and showed no SMART errors or other issues except one really old ICRC error logged against the drive. The errors were all for different sectors, and they were write errors, but they did not get logged in the SMART error log, not sure why.
    7. Moving the drive to a different slot and restarting the resilver showed errors on that drive again, so it's the drive, still no new errors logged against it in SMART.
    8. I started a ddrescue of this problem drive onto another known good drive, and I'll try again with the two replacements.
    ...The saga continues, but ddrescue rocks for replacing failing drives in a zpool - the resilver and a scrub will clean up any actually bad data, but you're starting with a lot of good data, reducing your risk level.

    A couple of years ago, a friend applied the same ddrescue procedure migrating from failing Seagate 3TB disaster drives to HGST 4TBs with great success.

    Anyway, the rear backplane might actually be just fine, but I'm not going to be able to test it thoroughly for at least a couple of days.
     
    #4
  5. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    I did a lot more troubleshooting and drive replacement on the pool, but long story short, moving the chassis connection to to a different controller port with a different cable in it solved the remaining strange problem with further resilvers restarting after the ddrescued replacement for the drive with bad sectors was traded in, and racking up small numbers of checksum errors on half of the drives in the pool during the aborted resilvers. I have scrubbed twice, and no more errors have appeared, so I am going to clear the pool, replace the damaged file, and call it good.

    Even though the pool had been copied to for hours with no parity (since 2 drives had dropped on a Z2 pool), in the end, only 1 file was actually damaged, the one which was probably being copied when the drive with actual bad sectors probably froze for a bit doing reallocations and caused the other 2 drives to be dropped!

    Interesting points and remaining mysteries about the dmesg write errors from the mpt2sas driver:
    1. The dmesg write errors for certain drives persisted for those same actual drives, even when they were moved to different slots on the same backplane, and through reboots, making it look like the drives had a communication issue.
    2. Those dmesg write errors seem to have never made it to the drives themselves - they did not log in SMART
    3. I am pretty sure the errors were correlated with the unlogged self-restarts of the resilver process
    4. From a zfs perspective, during the resilvers, checksum errors were racked up against the drives with these driver write errors, but also other drives - I suspect the zfs rewrites of the checksum error blocks on other drives were successful, but not to these specific drives.
    5. One of the two dropped drives in the rear backplane was the drive which had the largest number of these write errors in the front backplane
    6. The dmesg errors were only for writes, never reads, and seem to have been a controller to specific drive communication issue, where occasionally some packets sent to the specific "problem" drives was dropped somewhere - probably the card's fault, but maybe the expander backplane, and much less likely the cable.
    7. All of my JBODs are connected through a single SAS9206-16e card, (running IT mode 20.00.07 firmware) which is two LSI SAS2308 controllers on one card - the suspicious port is the top port, and the port I switched to is one of the bottom two ports, and I think the split of the 2 controllers on the card is top 2 ports on one controller, bottom two ports on the other controller. All 3 other ports seem to have been fine so far, but I haven't been doing super heavy writes to them - the second port has the zpool I was copying from attached to it, and that has had no errors.
    8. There seems to be absolutely nothing wrong with any of the drives other than the one with the actual reallocated sectors, other than them having intermittent write errors when attached to a certain port of a certain card.
    I think the mpt2sas driver write errors in dmesg must have something to do with the drive identity information which the controller uses for it's low level SAS id allocation for the drives, since it persists for certain drives, even when they are moved to other bays and even backplanes. The really annoying thing is it is really hard to test for this intermittent case, other than watching dmesg.

    The errors were so specifically tied to certain drives that I tried ddrescue copying them to other drives, and the copies were clean according to ddrescue, but I got suspicious when one of them was to another known good, just re-wiped and SMART tested drive which also started racking up the errors.

    I never found the same error pattern googling around, mostly another problem with drives which had been put to sleep not waking up enough, but I did eventually find someone with similar errors who'se errors all went away when they replaced the card.

    The frustrating thing is that the card seems fine, and you can write tons of data to drives which all works perfectly until it doesn't, so it is not easy at all to test for this case, or validate that any controller is actually working properly.

    I am going to trade the entire card for a spare eventually, and do some tests on the rear backplane with the drives which had write issues on this controller to see if they crop up, but I suspect even if another controller has a bad port, these drives will seem fine on it, and different drives will be the unlucky ones on the other controller, unless the drive SAS addressing algorithm has nothing to do with the card SAS address, and only properties of the drive.

    Here is an example of the full dmesg error block, since I didn't post it before:
    Code:
    [147053.091973] mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
    [147053.091977] mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
    [147053.091990] sd 1:0:17:0: [sdbc] FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [147053.092008] sd 1:0:17:0: [sdbc] FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [147053.092010] sd 1:0:17:0: [sdbc] CDB: Write(16) 8a 00 00 00 00 01 0f 18 9a 30 00 00 01 00 00 00
    [147053.092012] blk_update_request: I/O error, dev sdbc, sector 4548237872
    [147053.092018] sd 1:0:17:0: [sdbc] CDB: Write(16) 8a 00 00 00 00 01 0f 18 9b 30 00 00 01 00 00 00
    [147053.092022] blk_update_request: I/O error, dev sdbc, sector 4548238128
    [147053.841746] sd 1:0:17:0: CDB: Test Unit Ready 00 00 00 00 00 00
    [147053.841753] mpt2sas_cm1: sas_address(0x50030480019920df), phy(31)
    [147053.841756] mpt2sas_cm1: enclosure_logical_id(0x50030480019920ff),slot(19)
    [147053.841758] mpt2sas_cm1: handle(0x001c), ioc_status(success)(0x0000), smid(52)
    [147053.841760] mpt2sas_cm1: request_len(0), underflow(0), resid(-131072)
    [147053.841762] mpt2sas_cm1: tag(65535), transfer_count(131072), sc->result(0x00000000)
    [147053.841764] mpt2sas_cm1: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
    [147053.841766] mpt2sas_cm1: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
    
    Just grepping dmesg for this pattern should catch them:
    blk_update_request: I/O error, dev

    so:
    grep blk_update_request:.I/O.error,.dev

    When I get a chance, I will set up a 21-drive zpool including the drives I had the dmesg errors with on this controller, and try the pool on both this enclosure's rear backplane and he other one I was positive was bad before, but may not actually be bad, and try it with the known bad controller port and other controller/port combos and see if I can find a further pattern - the backplanes could both be fine, although I'm pretty sure for the other one I did try it in other controller ports, and the same bay was always flakey. It may take a while to get to that particular experiment though, but I want to know what is going on!
     
    #5
    Last edited: Aug 26, 2017
  6. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    Ahhah, going through my /var/log/messages, I see more of these errors when I was moving some files around on a different zpool in the front backplane on the same JBOD a month ago - it happened on all drives involved, anywhere from 1-10 times per drive.

    The common error signature is:
    mpt2sas_cm1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)

    Changing the subject accordingly!
     
    #6
  7. sfbayzfs

    sfbayzfs Active Member

    Joined:
    May 6, 2015
    Messages:
    245
    Likes Received:
    102
    Update:

    Another Hitachi 4TB drive in that first VDEV racked up some pending sectors, so I replaced it. I connected a freshly tested drive to a free bay in the JBOD and did a:
    zpool replace MYPOOL OLDDRIVE NEWDRIVE
    The resilver process finished on the first try, and there were no dmesg errors from the mpt2sas driver. Once the resilver got going, it was predicted to take just over 48 hours, but completed in only 18, probably because nearly all of the data was able to be copied directly from the problematic drive directly, and didn't have to be reconstructed - both drives had solid indicator lights during the process.

    Disturbingly, checksum errors racked up against every other drive in the VDEV during the resilver process (the other 3 VDEVs were clean of course, since they were not involved.) The checksum errors happened pretty early in the resilver process, and then counts stayed constant, so I think they were only in the first batch of data written to the drives. I was pretty sure I had done a scrub or two after the cable/card port switch that fixed things initially, but maybe not, since I was in a hurry to get copying again since I was way behind schedule. The highest CRC error count against any drive after the resilver was only 14, but it should have been completely clean, hmmm...

    Overall, both problem drives were almost new drives which I bought used, and had done at least a full wipe successfully a few months before they were deployed and went effectively bad. The first one had not had a long SMART test run on it, and had just over 50 power on hours when I deployed it. The second one that developed pending sectors had definitely gone through a full badblocks 4-pass and a long SMART read test a while ago.

    The first one is up to 277 reallocated sectors and 8 pending, and the latest badblock run against it yielded 1/0/1926 errors!

    The second one is in the middle of the second write pattern test of it's first badblocks run (no errors so far), the 8 SMART pending sectors have all cleared, and there are no reallocations either in SMART so far.
     
    #7
  8. matt52

    matt52 New Member

    Joined:
    Sep 24, 2017
    Messages:
    8
    Likes Received:
    0
    My experience with the 847 family is that the 2 backplanes do not play nice cascaded and should have their own 4x uplink to a controller. I think there is also a major heat problem with the expander chip. When the server was cold it would run fine. Once the temps got uncomfortable the rear one would get all wonky. Run the case in a freezer (50f) and it was a hell of a lot more reliable. I took to drilling holes in the top and sides to let datacenter air to mix into the middle and bring the temps down. Worked well enough.
     
    #8

Share This Page