Troubleshooting OmniOS silent hang/crash

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

danwood82

Member
Feb 23, 2013
66
0
6
After playing around with a test setup for a week, I'm gotten reasonably familiar with OmniOS and napp-it, but I've still got a recurring problem that I can't seem to work out.

Occasionally, and it seems inevitably, the whole machine will simply hang. napp-it stops responding, shares disappear, the computer itself stops displaying anything to the screen (the monitor goes into standby, receiving no signal). I can do nothing but a hard-reset / hard-poweroff to get it running again.

Once I get it going again, I can't find any mention of an error anywhere (although I'm pretty clueless where to look)

The nearest I've found to something suspicious is in the System>Faults log, for fmdump -I. The first line logged after the machine is started up again is always:
"Jan 11 00:31:51.9433 ireport.os.sunos.panic.savecore_failure"

I thought this might be indicating something unpleasant, but then I've just discovered it writes that line even after I've correctly powered down the machine.


I've got equipment turning up next week to build my server proper, and I'm really hoping this is just an odd phantom problem with the Dell T7600 workstation I'm testing on, and it will vanish once I build the server. So far I've tested with different HBAs, and different zpool drives. The only thing in common has been the machine itself, and an old Intel 40GB SSD I've been using as a boot drive. The SSD has perfect SMART data, and doesn't seem to have any kind of faults in itself, so I'm presuming it's not that (although anything is possible I suppose)

Does anyone have any ideas what might be causing this, or can think of an effective way to diagnose the problem?
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
A freeze or kernelpanic indicates a problem in core hardware components or drivers.
I would first do a RAM check or remove half of the RAM and try then with the other half.

You may also try another bootsdisk, use sata only (or a HBA only) for your pool.
Reload OS and burn a new bootCD, try with a setup to an USB stick or boot from an HBA (Sata problem)
Check cabling, replace cabling. Check CPU cooling.

Is this a Solaris only problem? Compare with another OS like Windows (driver problem)
otherwise it may be a power supply problem
 

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
ensure you have disabled most all of powersaving options. This is a sore stickler for even the newest platform intel chipsets. MAX/MAX/MAX - disable C1E and C-states to run the system at max performances. Don't let anything hibernate/sleep, it will screw the pooch should timeouts occur.