disk problem crashed system

joisey04 · Jul 22, 2013

My configuration:

Server: Supermicro X9SCL+-F, E31240V2, 16GB RAM

USB stick: ESXi
2 x 160GB 2,5" disks: 20GB VM datastore on each, OMNIOS installed in one and then mirrored
2 x 320GB 2,5" disks: on separate disk controller mounted in OMNIOS as ZFS mirror, one file system, stores all the vm images, mounted in ESXi via NFS
2 x 3TB 3,5" disks: datastore, used by various applications

on ESXi there are several servers installed, all having their datastore on the NFS mounted volume

This morning no ESXi machine was running except the OMNIOS. On all the others it showed "not available". I tried to console into omnios, was able to, but did not get any answer from the zfs filesystems. (I was trying to "ls /data" and didn't get a response where "/data" sits on the 3TB ZFS pool.)
I rebooted the OMNIOS and then found out that one of the 320GB disks, which hold the vm datastore is unavailable. Prior to the reboot I had no access to the web interface.

My question now is:
Shouldn't this setup prevent exactly that kind of thing???
And, is there a log file somewhere where we can maybe find out what went wrong?

joisey04 · Jul 23, 2013

this is what the log says:

Jul 23 00:11:20 storage scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
Jul 23 00:11:24 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:11:24 storage mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a IOCLogInfo=0x0 target=13
Jul 23 00:11:24 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:11:24 storage mptsas_ioc_task_management failed try to reset ioc to recovery!
Jul 23 00:11:25 storage scsi: [ID 365881 kern.info] /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:11:25 storage mpt0 Firmware version v14.0.0.0 (?)
Jul 23 00:11:25 storage scsi: [ID 365881 kern.info] /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:11:25 storage mpt0: IOC Operational.
Jul 23 00:13:02 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:13:02 storage config header request timeout
Jul 23 00:13:02 storage scsi: [ID 365881 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:13:02 storage NULL command for address reply in slot 2
Jul 23 00:13:22 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,1976@10 (mpt0):
Jul 23 00:13:22 storage Disconnected command timeout for Target 1
Jul 23 00:14:02 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:14:02 storage config header request timeout
Jul 23 00:14:32 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,1976@10 (mpt0):
Jul 23 00:14:32 storage Disconnected command timeout for Target 1
Jul 23 00:14:32 storage vmxnet3s: [ID 654879 kern.notice] vmxnet3s:0: getcapab(0x200000) -> no
Jul 23 00:14:32 storage last message repeated 3 times
Jul 23 00:15:02 storage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:15:02 storage config header request timeout
Jul 23 00:15:02 storage scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Jul 23 00:15:02 storage sd3: path mpt_sas1/disk@w5000c50013af4644,0, reset 1 failed
Jul 23 00:15:02 storage scsi: [ID 365881 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:15:02 storage NULL command for address reply in slot 6
Jul 23 00:15:02 storage scsi: [ID 365881 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci1000,3040@0 (mpt_sas0):
Jul 23 00:15:02 storage NULL command for address reply in slot 8
Jul 23 00:15:02 storage scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1

gea · Jul 23, 2013

A faulted disk can always result in a service interruption on any system.

I would look at the ESXi logfile. If there is a disconnect on timeout then it may be that ZFS is waiting on a write longer than ESXi resulting in a ESXi disconnect followed by a disk failure + pool degration afterwards. This can happen if a disk is not fully dead or disconnected but responds with errors or blocks communication on controller or driver level or on any other hardware or driver problem.

Only way to minimize such failures is using enterprise disks with a reduced failure rate.

joisey04 · Jul 23, 2013

I have absolutely no problem with a faulty disk, on the contrary, I expect them.
Thats why I'm setting up a system like that.
What I don't expect is, that the system stops working after the alleged corruption of one disk in a mirror.

What I would like to find out is, why all my vm machines stopped working after one disk reported sick.

Where do I look in ESXi for logs?
any help appreciated

the disk, btw. is not faulted. I started the system up again this evening and currently everything is working. And that scares me even more since this is a test machine. If everything goes well, we want to implement it for production.

gea · Jul 23, 2013

joisey04 said:
I have absolutely no problem with a faulty disk, on the contrary, I expect them.
Thats why I'm setting up a system like that.
What I don't expect is, that the system stops working after the alleged corruption of one disk in a mirror.

What I would like to find out is, why all my vm machines stopped working after one disk reported sick.

Where do I look in ESXi for logs?
any help appreciated

the disk, btw. is not faulted. I started the system up again this evening and currently everything is working. And that scares me even more since this is a test machine. If everything goes well, we want to implement it for production.

The ESXi log can be watched in vsphere.
About the disk: I would pull it and use a manufacturers tool to do a low level check

The problem is mostly on writes. ZFS waits until the disk commits a write or the driver reports an error. On bad blocks the disk tries to repair the bad block. This can last too long for ESXi. The intention is to avoid unneccessary disk errors resulting in a pool degration.

read also
https://www.illumos.org/issues/1553

In the last two years I have had similar problems twice. (Datastore timeout -> NFS datastore offline due to disk errors below a disk failure)

TechTrend · Jul 22, 2016

joisey04 said:
What I don't expect is, that the system stops working after the alleged corruption of one disk in a mirror.

Did you find a way to avoid this? I had a similar incident this week on a ZFS storage server. After adding two disks to expand capacity, the server stopped serving iSCSI LUNs. The "mptsas_check_task_mgt: Task 0x3 failed." came up shortly after adding the drives. The disks were not part of any ZFS pool yet. Motherboard is SuperMicro X9DRD-7LN4F-JBOD with the BPN-SAS2-846EL1 expander. The onboard LSI2308 is running the P19 IT firmware provided by SuperMicro. Operating system is OmniOS r151014 with napp-it 16.02f running as a VM with ESXi 6.0u2.

Search

disk problem crashed system

joisey04

Member

joisey04

Member

gea

Well-Known Member

joisey04

Member

gea

Well-Known Member

TechTrend

Member