OmniOS + NappIT VM: Major Fault (kernel.panic)/High CPU/Hard + Soft Smart errors
I'm concerned since i have very high CPU usage 50-75% without much of any SMB or NFS Access. SMB seams fast but NFS really drags. For example, I got over 200 MB/s by SMB and just 20 MB/s copying that same large file from SMB to a Win7 VM (on that same ZPool, by ESXI NFS Datastore). Everything is on a 6x 3TB WD Red NAS drives in RAID-Z2 + Kingston V300 120GB L2ARC. I'm moving some VMs to a separate 256GB SSD in ESXI.
I also see these two faults with SEVERITY=major in the logs below. The link does not help so I am uncertain how to troubleshoot. Any suggestions?
Running nappit-14a on ESXI 5.5, default 2 vCPUs and 8GB RAM.
fmadm faulty
I'm concerned since i have very high CPU usage 50-75% without much of any SMB or NFS Access. SMB seams fast but NFS really drags. For example, I got over 200 MB/s by SMB and just 20 MB/s copying that same large file from SMB to a Win7 VM (on that same ZPool, by ESXI NFS Datastore). Everything is on a 6x 3TB WD Red NAS drives in RAID-Z2 + Kingston V300 120GB L2ARC. I'm moving some VMs to a separate 256GB SSD in ESXI.
I also see these two faults with SEVERITY=major in the logs below. The link does not help so I am uncertain how to troubleshoot. Any suggestions?
Running nappit-14a on ESXI 5.5, default 2 vCPUs and 8GB RAM.
fmadm faulty
Stats--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 28 18:37:22 2380911a-891b-461b-dedc-c2b798451062 SUNOS-8000-KL Major
Host : napp-it-14a
Platform : VMware-Virtual-Platform Chassis_id : VMware-56-4d-e1-e1-51-f8-cf-cc-2f-df-7c-42-1d-4c-79-fb
Product_sn :
Fault class : defect.sunos.kernel.panic
Affects : sw:///ath=/var/crash/unknown/.2380911a-891b-461b-dedc-c2b798451062
faulted but still in service
Problem in : sw:///ath=/var/crash/unknown/.2380911a-891b-461b-dedc-c2b798451062
faulted but still in service
Description : The system has rebooted after a kernel panic. Refer to
SUNOS-8000-KL for more information.
Response : The failed system image was dumped to the dump device. If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be
written to the savecore directory /var/crash/unknown.
Impact : There may be some performance impact while the panic is copied to
the savecore directory. Disk space usage by panics can be
substantial.
Action : If savecore is not enabled then please take steps to preserve the
crash image.
Use 'fmdump -Vp -u 2380911a-891b-461b-dedc-c2b798451062' to view
more panic detail. Please refer to the knowledge article for
additional information.
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 21 20:14:42 1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f SUNOS-8000-KL Major
Host : napp-it-14a
Platform : VMware-Virtual-Platform Chassis_id : VMware-56-4d-e1-e1-51-f8-cf-cc-2f-df-7c-42-1d-4c-79-fb
Product_sn :
Fault class : defect.sunos.kernel.panic
Affects : sw:///ath=/var/crash/unknown/.1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f
faulted but still in service
Problem in : sw:///ath=/var/crash/unknown/.1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f
faulted but still in service
Description : The system has rebooted after a kernel panic. Refer to
SUNOS-8000-KL for more information.
Response : The failed system image was dumped to the dump device. If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be
written to the savecore directory /var/crash/unknown.
Impact : There may be some performance impact while the panic is copied to
the savecore directory. Disk space usage by panics can be
substantial.
Action : If savecore is not enabled then please take steps to preserve the
crash image.
Use 'fmdump -Vp -u 1178e515-dd0b-ed91-bdd8-ab0a28eb5f4f' to view
more panic detail. Please refer to the knowledge article for
additional information.
Stat: fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0
disk-lights 0 0 0.0 0.1 0 0 0 0 28b 0
disk-transport 0 0 0.0 18.7 0 0 0 0 32b 0
eft 0 0 0.0 0.0 0 0 0 0 1.3M 0
ext-event-transport 3 0 0.0 6.4 0 0 0 0 46b 0
fabric-xlate 0 0 0.0 0.0 0 0 0 0 0 0
fmd-self-diagnosis 14 0 0.0 0.6 0 0 0 0 0 0
io-retire 0 0 0.0 0.0 0 0 0 0 0 0
sensor-transport 0 0 0.0 0.5 0 0 0 0 32b 0
ses-log-transport 0 0 0.0 0.2 0 0 0 0 40b 0
software-diagnosis 0 0 0.0 0.0 0 0 0 0 316b 0
software-response 0 0 0.0 0.0 0 0 0 0 2.3K 2.0K
sysevent-transport 0 0 0.0 389.2 0 0 0 0 0 0
syslog-msgs 0 0 0.0 0.0 0 0 0 0 0 0
zfs-diagnosis 15 0 0.0 0.7 0 0 0 0 0 0
zfs-retire 15 0 0.0 1.2 0 0 0 0 168b 0
Important: fmdump -I
TIME CLASS
Jan 11 20:12:19.2634 ireport.os.sunos.panic.savecore_failure
Jan 11 20:12:45.7120 resource.sysevent.EC_datalink.ESC_datalink_phys_add
Jan 11 20:19:17.9062 ireport.os.sunos.panic.savecore_failure
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_static_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_static_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_send_targets_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_send_targets_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_slp_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_slp_end
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_isns_start
Jan 11 20:22:59.3006 resource.sysevent.EC_iSCSI.ESC_isns_end
....(cut)
Disk statistics via iostat -xn 1 2 (shows second value only)
device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 fd0
0.0 67.0 0.0 461.6 0.0 0.0 0.6 0.1 0 0 rpool
0.0 69.0 0.0 461.6 0.0 0.0 0.0 0.1 0 0 c2t0d0
7358.9 0.0 380107.0 0.0 71.8 10.7 9.8 1.5 100 100 Red-RaidZ2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
1264.2 0.0 63299.8 0.0 0.0 1.1 0.0 0.9 0 54 c4t50014EE0AE1EB0EEd0
1478.2 0.0 63395.8 0.0 0.0 1.3 0.0 0.9 0 64 c4t50014EE0AE1BD935d0
1033.1 0.0 64035.9 0.0 0.0 1.8 0.0 1.7 0 66 c4t50014EE0AE1EB6CAd0
1120.1 0.0 62603.8 0.0 0.0 1.8 0.0 1.6 0 71 c4t50014EE65838EB7Ad0
1358.2 0.0 64256.0 0.0 0.0 1.3 0.0 0.9 0 57 c4t50014EE25D8E9FACd0
1105.1 0.0 62515.8 0.0 0.0 2.8 0.0 2.5 0 81 c4t50014EE2097D3AD6d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t50026B773C00642Cd0
important is the wait value. This is the number of IO operations that are waiting to be serviced
I/O situation via fsstat -F 1 2 (shows second value only)
name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 0 0 0 0 0 0 0 0 ufs
0 0 0 0 0 0 0 0 0 0 0 nfs
0 0 0 157 0 330 0 14 1.19K 1 32 zfs
0 0 0 8 0 0 0 0 0 0 0 lofs
2 0 0 14 0 21 0 0 0 2 149 tmpfs
0 0 0 2 0 0 0 0 0 0 0 mntfs
0 0 0 0 0 0 0 0 0 0 0 nfs3
0 0 0 0 0 0 0 0 0 0 0 nfs4
0 0 0 0 0 0 0 0 0 0 0 autofs
Last edited: