Help... lockups

dragonme · Dec 23, 2016

Gea...

I am getting some napp-it lockups.. looks like disk freezes...

running a esxi all in one...

last I was in an ssh session into napp-it checking pool status and metrics with iostat and one pool locked up... one disk in the pool and the pool went to %b 100 and %w 1oo

had to shut down all the vms.. likely corrupted some stuff...

so shut everything down.. including napp-it and tried to reboot napp-it.. dont think that brought the disk back

so rebooted the host.. started the vms back up .. and things looked ok for about an hour

then noticed in the ssh session that the napp-it rpool went to the same 100%...

ssh failed, the console was frozen, the web page was down.. seems frozen but the smb/cifs shares are still being served out but cant log into the vm in any way?

HELP

dragonme · Dec 23, 2016

I have napp-it booting off a ssd for esxi.. can it be rebooted with vm's live and running? I am finding conflicting reports...

vm's are being stored on a disk based pool being served by napp-it via nfs to esxi

thanks

otherwise I will retry shutting all vms down, and rebooting napp-it.. if that doesnt work.. I have to reboot the host again... ugh...

whitey · Dec 23, 2016

dragonme said:
I have napp-it booting off a ssd for esxi.. can it be rebooted with vm's live and running? I am finding conflicting reports...

vm's are being stored on a disk based pool being served by napp-it via nfs to esxi

thanks

otherwise I will retry shutting all vms down, and rebooting napp-it.. if that doesnt work.. I have to reboot the host again... ugh...

You certainly cannot reboot the AIO napp-it VM while VM's are running and expect them to be up/running/good state if the NAS/filer went down. They may look booted but they very likely SCSI timed out and some may go into read-only state (See this on Linux VM's a bit when this 'used' to happen to me on OmniOS) if they stay booted.

gea · Dec 23, 2016

If possible, try to shutdown your VMs, wait up to 60s (default disk timeout on OmniOS)
If this is not possible as the NFS datastore is blocking, power them all off.

The reason is propably a bad disk (the one with 100% b/w) if this is the only disk with such a load. Usually a bad disk is removed after a timeout (default 60s, I would reduce ex to 15s). Sometimes the disk is not removed but blocks the HBA or expander. In such a case optionally remove all datadisks priot boot and check logs to identify the bad disk. Boot without this disk and optionally replace with a new one.

dragonme · Dec 23, 2016

@gea

thanks but it was 2 different disk in 2 different pools on 2 different conrollers .. so not likely a disk issue...

the first lockup was a spinning hard drive on a attached pool

the second time was the boot ssd for the napp-it vm

gea · Dec 28, 2016

Other possible reasons are then more general hardware related like a problem with nic/network, psu, ram, cpu etc. Any hints in System > Logs or System > Faults?

Especially with an expander but also without, a single bad disk can block a whole system. If you can identify a disk for troubles, remove/replace and retry or check/move cabling

The default disk timeout is 60s. Is the system reacting after 60s. Optionally reduce timeout to ex 15s.

dragonme · Dec 28, 2016

@gea
thanks for the replies...

still getting lockups.. the underliying pools continue in most part to be served via nfs to esxi and smb ... but the web page is inaccessable, and cant ssh into the vm.

I have shutdown the entire host and rebooted napp-it then all the vms... locked up again

this latest time I just rebooted the napp-it vm and as advertised, the minimal nfs downtime was handeled by esxi and the vms continued to run no problem.

tried looking at the logs, although omni/solaris is not my strong point but I only saw a ton of spamming over the email sendmail which I dont have setup

any pointers at tracking this down appreciated

I did upgrade to 11f not long ago.. and 11f showed a much lower idle cpu usage, but perhaps an issue there?

the one or 2 times I was in the napp-it vm when it locked up showed a disk lock up, differnent disk on different controllers the 2 differnent times.. one was a datapool the other was napp-its own SSD ...

thanks

gea · Dec 28, 2016

1. Sendmail requires a fully qualified domainname in /etc/inet/hosts or it is spamming the console
You can also disable sendmail in menu Services > Sendmail as there is no need for (I would).

2. Check System > Log or System > Faults for disk related errors then.
I know, logs are full of messages not related to a problem but after a freeze, the logs are the first where you can look for reasons.

You can also check ESXi logs for ESXi related problems.

dragonme · Dec 28, 2016

@gea

thanks.. I am not yet terminal ninja...

I usally use osx which has a nice 'console' app for searching and organizing the logs..

I will shut down sendmail since I dont have a FQDN and not using it

will try and parse the logs for napp-it and esxi manually...

Search

Help... lockups

dragonme

Active Member

dragonme

Active Member

whitey

Moderator

gea

Well-Known Member

dragonme

Active Member

gea

Well-Known Member

dragonme

Active Member

gea

Well-Known Member

dragonme

Active Member