Safe to flash Supermicro SAS3 Backplane Firmware?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

windycat

New Member
Apr 2, 2024
2
0
1
Long story short: Is updating SAS backplane firmware likely to solve intermittent SAS failures that the HBA firmware shows as EVENT_SAS_DEVICE_STATUS_CHANGE , and how safe is updating backplane firmware?

I have a couple of machines in Supermicro 846BE1C-R1K23B chassis, with the single expander backplane: SC846BE1C-R1K23B | 4U | Chassis | Products | Supermicro .

They all have MegaRAID 9361-8i RAID controllers running in JBOD mode, directly presenting the drives to the OS. They've been working in Ceph for years, but there's been intermittent transient disk resets underneath it that I've been ignoring. These show up in dmesg as sd 0:0:41:0: Power-on or device reset occurred, and show up from the RAID controller's firmware logs as:

Code:
04/02/24 16:58:02: C0:iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
04/02/24 16:58:02: C0: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=5000c500adcc8fb5 TaskTag=x0167 ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0f - TASK_ABORT_INTERNAL complete
Ceph sees the disk disappear, then re-appear, and it's fast enough that it doesn't boot the OSD from the pool. These happen rarely during regular read operations, at moderate frequency during heavy writing, and constantly when Ceph is scrubbing. As the data on the array has grown and Ceph is scrubbing more, this has turned into a real problem.

I've updated the firmware on the MegaRAID 9361 8i to latest, but that did not help. My next thought is backplane firmware, but I also tried replacing one of the 9361s with an HBA 9300, and that fixed the problem.

When I look at the backplane firmware with Supermicro's CLITXL, I see:

Code:
UNIT SPECIFIC INFORMATION:
    SAS ADDRESS    - 5003048020FA81FF
    ENCLOSURE ID   - 5003048020FA81FF

ENCLOSURE INFORMATION:
    PLATFORM NAME  - SMC846ELSAS3P       
    SERIAL NUMBER  -                         
    VENDOR ID      - SMC     
    PRODUCT ID     - SC846-P         

VERSION INFORMATION:
    FLASH REGION 0 - 66.16.11.00
    FLASH REGION 1 - 66.16.11.00
    FLASH REGION 2 - 66.16.11.00
    FLASH REGION 3 - 16.11

DEVICE INFORMATION:
    DEVICE NAME    - /dev/sg0
    BMC IP         - NULL
Thoughts? Do I just swap out hardware here and replace working 9361s with 9300s, or is it worth trying to update backplane firmware? I reached out to Supermicro because the backplane firmwares are not publicly available, but I would really like to avoid a situation like https://forums.servethehome.com/ind...sas3-backplane-firmware-update-problem.27149/ or Recovering the Firmware on a Supermicro BPN-SAS3-846EL1 Backplane
 

mrpasc

Well-Known Member
Jan 8, 2022
579
320
63
Munich, Germany
I would recommend to swap the 9361 with real HBA 9300 ones. They have become very cheap available (used) and you might be able to sell your existing 9361 for same price so it is a 1:1.
Even if it looks like a 9361 will work with Ceph (or ZFS) if set to JBOD mode you probably suffer from the reduced queue deepth and other shenanigans with that kind of adapter.
 

azev

Well-Known Member
Jan 18, 2013
770
251
63
I've also bricked supermicro backplane after flashing in the past,... if you insist to try flashing, you should open a ticket with supermicro and ask for guidance.
 

windycat

New Member
Apr 2, 2024
2
0
1
Thank you for the guidance on this, I will not be updating the backplanes. I found the real culprit after looking at RAID controller firmware logs right after boot, via storcli /c0 show termlog. This is almost certainly the problem and I intend to strap some fans on them. They are in machines with what I thought was sufficient front-back airflow, but no ducts specifically for the storage controller:

Code:
04/11/24 18:36:43: C0:Max Temp is 110 Deg C on Channel 4
04/11/24 18:36:43: C0:Measured chip temperature at Channel 0 is 105
04/11/24 18:36:43: C0:Measured chip temperature at Channel 1 is 107
04/11/24 18:36:43: C0:Measured chip temperature at Channel 2 is 106
04/11/24 18:36:43: C0:Measured chip temperature at Channel 3 is 106
04/11/24 18:36:43: C0:Measured chip temperature at Channel 4 is 110
04/11/24 18:36:43: C0:LdDcmdRaidMapCompleteExt: Completing FW_RAID_MAP cmd
04/11/24 18:36:43: C0:EVT#1005227-04/11/24 18:36:43: 506=Controller temperature threshold exceeded. This may indicate inadequate system cooling. Switching to low performance mode
 

nabsltd

Well-Known Member
Jan 26, 2022
547
389
63
I found the real culprit after looking at RAID controller firmware logs right after boot, via storcli /c0 show termlog. This is almost certainly the problem and I intend to strap some fans on them. They are in machines with what I thought was sufficient front-back airflow, but no ducts specifically for the storage controller:
I have a very similar layout (2U box, 3x 80mm fan wall, no ducts), and my LSI controller sits at around 60°C, and reaches 75°C during load. You might want to see if a known good new card would have lower temperatures. In that case, you might want to check the thermal compound behind the old card heatsink.

Also, in another box, I have both a 9361-8i (RAID mode) and a 9300-8e (IT mode), with a fan blowing across both the heatsinks. The temperatures are within a few degrees of each other when idle or loaded, so I don't think a true HBA would be any better.