We have a few systems running OmniOS r151012 with 30 drives in a 5 drive raidz1 running over an LSI HBA 9207-8i p18. These systems are about 8 months old and about a month ago we experienced our first true drive failure. This failure resulted in a full zpool deadlock and I unfortunately made the mistake of rebooting the server which then caused system startup failure until the drive was removed. Oddly enough even the HBA bios screen would fully lock up past the main window..
Second failure happened over the weekend on a different system (exact same configuration). This time the drive was marked bad, resilivered and was humming along perfectly. Prior to pulling the drive for replacement I ran sas2ircu to verify which slot the drive was in, thats when things went bad. sas2ircu just wouldnt return anything and even a kill 9 wasnt closing the application. Eventually I couldnt even ssh to to server. I jumped on the remote console and and rand a prstat and that ended up locking up that shell as well. Our remote applications were unable to write/read files so I believe we encountered another deadlock. I was forced to reboot and again got stuck booting due to this bad disk.
Anyone have any ideas that may help prevent this? What's odd is even the HBA goes nuts with these bad disks so I question if this is truly an omnios issue or some sort of firmware bug on the drives/HBA.
Anyone have experience with omnio's commercial support or consulting support? A server full of seagate drives I am expecting this to happen rather frequently...
Below are the errors that occurred during the first failure.
Disconnected command timeout for target 57 w5000c5007XXXXXXX. scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Disconnected command timeout for target 57 w5000c5007XXXXXXX.
scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
Second failure happened over the weekend on a different system (exact same configuration). This time the drive was marked bad, resilivered and was humming along perfectly. Prior to pulling the drive for replacement I ran sas2ircu to verify which slot the drive was in, thats when things went bad. sas2ircu just wouldnt return anything and even a kill 9 wasnt closing the application. Eventually I couldnt even ssh to to server. I jumped on the remote console and and rand a prstat and that ended up locking up that shell as well. Our remote applications were unable to write/read files so I believe we encountered another deadlock. I was forced to reboot and again got stuck booting due to this bad disk.
Anyone have any ideas that may help prevent this? What's odd is even the HBA goes nuts with these bad disks so I question if this is truly an omnios issue or some sort of firmware bug on the drives/HBA.
Anyone have experience with omnio's commercial support or consulting support? A server full of seagate drives I am expecting this to happen rather frequently...
Below are the errors that occurred during the first failure.
Disconnected command timeout for target 57 w5000c5007XXXXXXX. scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Disconnected command timeout for target 57 w5000c5007XXXXXXX.
scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Log info 0x31140000 received for target 57 w5000c5007XXXXXXX.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g5000c5007XXXXXXX (sd8): Command Timeout on path mpt_sas2/disk@w5000c5007XXXXXXX,0
scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,e08@3/pci15d9,691@0 (mpt_sas0):
Timeout of 60 seconds expired with 1 commands on target 57 lun 0.
Last edited: