OmniOS deadlock with drive failures

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

noons

New Member
Jun 16, 2015
2
0
1
38
We have a few systems running OmniOS r151012 with 30 drives in a 5 drive raidz1 running over an LSI HBA 9207-8i p18. These systems are about 8 months old and about a month ago we experienced our first true drive failure. This failure resulted in a full zpool deadlock and I unfortunately made the mistake of rebooting the server which then caused system startup failure until the drive was removed. Oddly enough even the HBA bios screen would fully lock up past the main window..

Second failure happened over the weekend on a different system (exact same configuration). This time the drive was marked bad, resilivered and was humming along perfectly. Prior to pulling the drive for replacement I ran sas2ircu to verify which slot the drive was in, thats when things went bad. sas2ircu just wouldnt return anything and even a kill 9 wasnt closing the application. Eventually I couldnt even ssh to to server. I jumped on the remote console and and rand a prstat and that ended up locking up that shell as well. Our remote applications were unable to write/read files so I believe we encountered another deadlock. I was forced to reboot and again got stuck booting due to this bad disk.

Anyone have any ideas that may help prevent this? What's odd is even the HBA goes nuts with these bad disks so I question if this is truly an omnios issue or some sort of firmware bug on the drives/HBA.

Anyone have experience with omnio's commercial support or consulting support? A server full of seagate drives I am expecting this to happen rather frequently...

Below are the errors that occurred during the first failure.

Disconnected command timeout for target 57 w5000c5007XXXXXXX. scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Disconnected command timeout for target 57 w5000c5007XXXXXXX.
scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,172
1,197
113
DE
Depending on the problem, the HBA firmware and the mpt_sas driver, a failed disk can block the whole controller.
You may update OmniOS or try another firmware (there is a report at hardforum that a downgrade to p16/17 firmware from P19 reduces a problem)

If you use Sata disks over an expander this can be another reason of a deadlock on disk problems,
follow the discussion around [OmniOS-discuss] iSCSI target hang, no way to restart but server reboot
with the comment of OmniTi
 

noons

New Member
Jun 16, 2015
2
0
1
38
Correction I am running p18, but I can give p17 a shot. Also these are all SAS drives over a dual expander SAS backplane.

Thanks!