Long story short: Is updating SAS backplane firmware likely to solve intermittent SAS failures that the HBA firmware shows as EVENT_SAS_DEVICE_STATUS_CHANGE , and how safe is updating backplane firmware?
I have a couple of machines in Supermicro 846BE1C-R1K23B chassis, with the single expander backplane: SC846BE1C-R1K23B | 4U | Chassis | Products | Supermicro .
They all have MegaRAID 9361-8i RAID controllers running in JBOD mode, directly presenting the drives to the OS. They've been working in Ceph for years, but there's been intermittent transient disk resets underneath it that I've been ignoring. These show up in dmesg as sd 0:0:41:0: Power-on or device reset occurred, and show up from the RAID controller's firmware logs as:
Ceph sees the disk disappear, then re-appear, and it's fast enough that it doesn't boot the OSD from the pool. These happen rarely during regular read operations, at moderate frequency during heavy writing, and constantly when Ceph is scrubbing. As the data on the array has grown and Ceph is scrubbing more, this has turned into a real problem.
I've updated the firmware on the MegaRAID 9361 8i to latest, but that did not help. My next thought is backplane firmware, but I also tried replacing one of the 9361s with an HBA 9300, and that fixed the problem.
When I look at the backplane firmware with Supermicro's CLITXL, I see:
Thoughts? Do I just swap out hardware here and replace working 9361s with 9300s, or is it worth trying to update backplane firmware? I reached out to Supermicro because the backplane firmwares are not publicly available, but I would really like to avoid a situation like https://forums.servethehome.com/ind...sas3-backplane-firmware-update-problem.27149/ or Recovering the Firmware on a Supermicro BPN-SAS3-846EL1 Backplane
I have a couple of machines in Supermicro 846BE1C-R1K23B chassis, with the single expander backplane: SC846BE1C-R1K23B | 4U | Chassis | Products | Supermicro .
They all have MegaRAID 9361-8i RAID controllers running in JBOD mode, directly presenting the drives to the OS. They've been working in Ceph for years, but there's been intermittent transient disk resets underneath it that I've been ignoring. These show up in dmesg as sd 0:0:41:0: Power-on or device reset occurred, and show up from the RAID controller's firmware logs as:
Code:
04/02/24 16:58:02: C0:iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
04/02/24 16:58:02: C0: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=5000c500adcc8fb5 TaskTag=x0167 ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0f - TASK_ABORT_INTERNAL complete
I've updated the firmware on the MegaRAID 9361 8i to latest, but that did not help. My next thought is backplane firmware, but I also tried replacing one of the 9361s with an HBA 9300, and that fixed the problem.
When I look at the backplane firmware with Supermicro's CLITXL, I see:
Code:
UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 5003048020FA81FF
ENCLOSURE ID - 5003048020FA81FF
ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - SMC
PRODUCT ID - SC846-P
VERSION INFORMATION:
FLASH REGION 0 - 66.16.11.00
FLASH REGION 1 - 66.16.11.00
FLASH REGION 2 - 66.16.11.00
FLASH REGION 3 - 16.11
DEVICE INFORMATION:
DEVICE NAME - /dev/sg0
BMC IP - NULL