OmniOS + NappIT VM Appliance: Major Fault/Very high CPU

abstractalgebra · Jan 29, 2014

OmniOS + NappIT VM: Major Fault (kernel.panic)/High CPU/Hard + Soft Smart errors

I'm concerned since i have very high CPU usage 50-75% without much of any SMB or NFS Access. SMB seams fast but NFS really drags. For example, I got over 200 MB/s by SMB and just 20 MB/s copying that same large file from SMB to a Win7 VM (on that same ZPool, by ESXI NFS Datastore). Everything is on a 6x 3TB WD Red NAS drives in RAID-Z2 + Kingston V300 120GB L2ARC. I'm moving some VMs to a separate 256GB SSD in ESXI.

I also see these two faults with SEVERITY=major in the logs below. The link does not help so I am uncertain how to troubleshoot. Any suggestions?
Running nappit-14a on ESXI 5.5, default 2 vCPUs and 8GB RAM.

fmadm faulty

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 28 18:37:22 2380911a-891b-461b-dedc-c2b798451062 SUNOS-8000-KL Major

Host : napp-it-14a
Platform : VMware-Virtual-Platform Chassis_id : VMware-56-4d-e1-e1-51-f8-cf-cc-2f-df-7c-42-1d-4c-79-fb
Product_sn :

Fault class : defect.sunos.kernel.panic
Affects : sw:///ath=/var/crash/unknown/.2380911a-891b-461b-dedc-c2b798451062
faulted but still in service
Problem in : sw:///ath=/var/crash/unknown/.2380911a-891b-461b-dedc-c2b798451062
faulted but still in service

Description : The system has rebooted after a kernel panic. Refer to
SUNOS-8000-KL for more information.

Response : The failed system image was dumped to the dump device. If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be
written to the savecore directory /var/crash/unknown.

Impact : There may be some performance impact while the panic is copied to
the savecore directory. Disk space usage by panics can be
substantial.

Action : If savecore is not enabled then please take steps to preserve the
crash image.
Use 'fmdump -Vp -u 2380911a-891b-461b-dedc-c2b798451062' to view
more panic detail. Please refer to the knowledge article for
additional information.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 21 20:14:42 1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f SUNOS-8000-KL Major

Host : napp-it-14a
Platform : VMware-Virtual-Platform Chassis_id : VMware-56-4d-e1-e1-51-f8-cf-cc-2f-df-7c-42-1d-4c-79-fb
Product_sn :

Fault class : defect.sunos.kernel.panic
Affects : sw:///ath=/var/crash/unknown/.1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f
faulted but still in service
Problem in : sw:///ath=/var/crash/unknown/.1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f
faulted but still in service

Description : The system has rebooted after a kernel panic. Refer to
SUNOS-8000-KL for more information.

Response : The failed system image was dumped to the dump device. If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be
written to the savecore directory /var/crash/unknown.

Impact : There may be some performance impact while the panic is copied to
the savecore directory. Disk space usage by panics can be
substantial.

Action : If savecore is not enabled then please take steps to preserve the
crash image.
Use 'fmdump -Vp -u 1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f' to view
more panic detail. Please refer to the knowledge article for
additional information.

Stat: fmstat

module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0
disk-lights 0 0 0.0 0.1 0 0 0 0 28b 0
disk-transport 0 0 0.0 18.7 0 0 0 0 32b 0
eft 0 0 0.0 0.0 0 0 0 0 1.3M 0
ext-event-transport 3 0 0.0 6.4 0 0 0 0 46b 0
fabric-xlate 0 0 0.0 0.0 0 0 0 0 0 0
fmd-self-diagnosis 14 0 0.0 0.6 0 0 0 0 0 0
io-retire 0 0 0.0 0.0 0 0 0 0 0 0
sensor-transport 0 0 0.0 0.5 0 0 0 0 32b 0
ses-log-transport 0 0 0.0 0.2 0 0 0 0 40b 0
software-diagnosis 0 0 0.0 0.0 0 0 0 0 316b 0
software-response 0 0 0.0 0.0 0 0 0 0 2.3K 2.0K
sysevent-transport 0 0 0.0 389.2 0 0 0 0 0 0
syslog-msgs 0 0 0.0 0.0 0 0 0 0 0 0
zfs-diagnosis 15 0 0.0 0.7 0 0 0 0 0 0
zfs-retire 15 0 0.0 1.2 0 0 0 0 168b 0

Important: fmdump -I

TIME CLASS
Jan 11 20:12:19.2634 ireport.os.sunos.panic.savecore_failure
Jan 11 20:12:45.7120 resource.sysevent.EC_datalink.ESC_datalink_phys_add
Jan 11 20:19:17.9062 ireport.os.sunos.panic.savecore_failure
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_static_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_static_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_send_targets_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_send_targets_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_slp_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_slp_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_isns_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_isns_end
....(cut)

Stats

Disk statistics via iostat -xn 1 2 (shows second value only)

device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 fd0
0.0 67.0 0.0 461.6 0.0 0.0 0.6 0.1 0 0 rpool
0.0 69.0 0.0 461.6 0.0 0.0 0.0 0.1 0 0 c2t0d0
7358.9 0.0 380107.0 0.0 71.8 10.7 9.8 1.5 100 100 Red-RaidZ2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
1264.2 0.0 63299.8 0.0 0.0 1.1 0.0 0.9 0 54 c4t50014EE0AE1EB0EEd0
1478.2 0.0 63395.8 0.0 0.0 1.3 0.0 0.9 0 64 c4t50014EE0AE1BD935d0
1033.1 0.0 64035.9 0.0 0.0 1.8 0.0 1.7 0 66 c4t50014EE0AE1EB6CAd0
1120.1 0.0 62603.8 0.0 0.0 1.8 0.0 1.6 0 71 c4t50014EE65838EB7Ad0
1358.2 0.0 64256.0 0.0 0.0 1.3 0.0 0.9 0 57 c4t50014EE25D8E9FACd0
1105.1 0.0 62515.8 0.0 0.0 2.8 0.0 2.5 0 81 c4t50014EE2097D3AD6d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t50026B773C00642Cd0

important is the wait value. This is the number of IO operations that are waiting to be serviced

I/O situation via fsstat -F 1 2 (shows second value only)
name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 0 0 0 0 0 0 0 0 ufs
0 0 0 0 0 0 0 0 0 0 0 nfs
0 0 0 157 0 330 0 14 1.19K 1 32 zfs
0 0 0 8 0 0 0 0 0 0 0 lofs
2 0 0 14 0 21 0 0 0 2 149 tmpfs
0 0 0 2 0 0 0 0 0 0 0 mntfs
0 0 0 0 0 0 0 0 0 0 0 nfs3
0 0 0 0 0 0 0 0 0 0 0 nfs4
0 0 0 0 0 0 0 0 0 0 0 autofs

abstractalgebra · Jan 29, 2014

Oh this seams bad, is it pointing to my L2ARC being bad or one of the spinning disks?
The System log has lots of these errors (scrub is just finishing).

Everytime I reload smartinfo the errors increase... (outside 40s buffer?)
Scrub completed without errors and no repairs. EDIT: Update Memtest86+ 5.01 passed full round

Jan 30 02:20:49 napp-it-14a Error for Command: Error Level: Recovered
Jan 30 02:20:49 napp-it-14a scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0
Jan 30 02:20:49 napp-it-14a scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: WD-WCC4
Jan 30 02:20:49 napp-it-14a scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
Jan 30 02:20:49 napp-it-14a scsi: [ID 107833 kern.notice] ASC: 0x0 (), ASCQ: 0x1d, FRU: 0x0
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g50014ee65838eb7a (sd5):
Jan 30 02:20:50 napp-it-14a Error for Command: Error Level: Recovered
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: WD-WMC1
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] ASC: 0x0 (), ASCQ: 0x1d, FRU: 0x0
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g50026b773c00642c (sd8):
Jan 30 02:20:50 napp-it-14a Error for Command: Error Level: Recovered
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 50026B773C00
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
Jan 30 02:20:50 napp-it-14a scsi: [ID 107833 kern.notice] ASC: 0x0 (), ASCQ: 0x1d, FRU: 0x0

gea · Jan 30, 2014

You can ignore the softerror from iostat as it counts on every smartcheck.
I would remove the cache ssd for another checks. The harderrors there are only messages but try what happens.

Have you used e1000 vnics or vmxnet3?
Try the other but prefer vmxnet3

Have you enabled sync (always or default) on your filesystem?
Disable sync and retry.

Search

OmniOS + NappIT VM Appliance: Major Fault/Very high CPU

abstractalgebra

Active Member

abstractalgebra

Active Member

gea

Well-Known Member