NappIt AIO esxi freeze issue ... help requested!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

dragonme

Active Member
Apr 12, 2016
282
25
28
Every couple weeks my esxi 6 server goes offline but in a strange way..

the setup is boot off of USB on an intel 5520 mobo, Napp-it and one other VM boot off of a SSD on SATA

rest of the VMs boot off of storage proved by napp-it via passthrough of a LSI card in IT mode

standard setup right?

so the failure mode looks something like this..

all VMs are in a wait state. they show as running but no storage, cant ping or ssh into any of them, even the ones on SATA storage

the only VM i get any info off of is the (frozen) console screen of the Napp_it vm as it is going down.. warnings about pci device busy too long, or scsi_status=22 or met_process_inter invalid scsi status

a

esxcfg-scsidevs -c on the host

results in seeing all storage except for the LSI card

I can ssh into and web into the host esxi shell and esxi web console everthing else hangs no reply

so what is going on here.. is the LSI card going off line for some reason?

the only solution is a hard shutdown of the server and bring it back up...

if the lsi card is at fault.. why is theVM that is on the SATA drive and has nothing to do with Napp-it freezing as well..?

Thanks in advance!!
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
If the LSI freezes or is not responding, the whole OmniOS freezes, does not matter that OmniOS is running from Sata.

What you can check
- Menu System > Logs and System > Faults for more infos
- check Firmware of the LSI
LSI 2008 based cards are buggy with firmware 20.0.0.0 - 20.0.0.4
- control iostat for an increasing number of soft/hard/transfer erros on a single disk that can indicate a disk problem
- run a short smart check on all disks to check health

- optionally decrease disk timout wait from the ZFS default of 60s to a lower value like 15s (System > Tuning)
 

zuni11

New Member
Mar 25, 2018
1
0
1
44
Same error with lsi 9300-8i (firmware 15.00.02.00) at last AIO.
Whether 9300 with same firmware bug?
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
There is no known firmware bug with LSI 9300
Problem must be disk, cable, backplane, PSU, mainboard or ESXi/OS related (search in this order)
 

dragonme

Active Member
Apr 12, 2016
282
25
28
I have napp-it and observium(which also runs my UPS monitor) installed on a patriot SSD for boot.

I am seeing glimpses every couple of days where esxi will log a huge increase in latency on that disk (or at least I think that is the disk that starts the cascade)

perfomance has deteriorated i/o latency increased from an average of 4xxx microseconds to 10xxx+ microseconds..

most times it resolves with further messages stating latency returned to normal.. but every couple weeks it seems to cascade further with latencies about 20xxx and that is when it gives up..


that SSD is on the sata controller and I also have some sata disks being RDM into napp-it for a pool while the VM pool for the rest of the VMs are on intel SSDs on the passed through LSI

what I feel might me happening is that ESXI times out the SSD on the sata port that napp-it is running from.. cascading into a failure of the pools run by napp-it.. ??

question is why the latency on that SSD which is doing almost no work goes through the roof.. napp-it reads/writes next to nothing and obserium is only logging a couple devices and not writing much at all either...?
 

rune-san

Member
Feb 7, 2014
81
18
8
I've found client SSDs to be a real problem with a lot of enterprise stuff. Remember ESXi does no buffering whatsoever, so every couple of writes relies on waiting for a buffer flush.

My assumption is the same as yours that the SSD is choking. I had this problem when I had consumer SSDs on an Expander Backplane. Even once I got rid of that, I started dropping SSDs like crazy even with direct attach. They'd just suddenly shoot their latency way up, and they'd stop responding. Some would turn out really dead, others would revive after performing a secure erase on them.

My point is that without TRIM and UNMAP support, my guess is that your budget grade SSD is locking up and not flushing its writes. I recommend taking it out of the equation.
 

dragonme

Active Member
Apr 12, 2016
282
25
28
@rune-san

very true and thanks for the comments.

none of the 2 VMs on that SSD are big writers.. but since that drive is also on the ICH SATA controller along with 3 drives that I RDM into napp-it .. might expound the issue.

I have also been reading about a esxi compatibility issue with some kind of interrupt
vHBAs and other PCI devices may stop responding in ESXi 6.0.x, ESXi 5.x and ESXi/ESX 4.1 when using Interrupt Remapping (1030265)

but i don't see those kinds of error messages specified in that article..

I really am no expert in esxi and napp-it logs and dont know if I have looked in the right places to collect all the data..

my gameplan is to move these 2 VMs to an intel 3500 80gb SSD or a pair on an intel raid expander mezz card that I picked up for the S5520HC.. it gives 4 ports .. I can use that for the napp-it boot up and thus pass the SATA controller to napp-it for storage vs RDM'ing individual drives.. that should clean things up a bit.
 

dragonme

Active Member
Apr 12, 2016
282
25
28
the other issue that probably does me no favors is having vcsa VM running off the napp-it storage vs being in native wmfs with napp-it and observium..

because if/when this 'device busy for too long' bug hits.. ALL napp-it storage comes to a standstill and the VMs on it are suspended.. along with vcsa.. so that can't be helping..
 

dragonme

Active Member
Apr 12, 2016
282
25
28
I think I have this narrowed down..although I didn't take good notes...

I have a disk shelf attached to the esxi host through an external cable to the lsi card that is in turned passed to the napp-it vm.

what I had done when running zfs on OS X is to power on the shelf, do a backup, export the backup pool, then shut down the shelf. it was simple.

however now that I am running napp-it as a vm on esxi it does not seem to like this behavior.

I 'believe' that a couple months ago I found some steps to remove devices from oracle/omnios so that after I export the pool.. I ran a couple commands that removed the now 'dead' drives from the now powered down disk shelf.

if that is all I did I think it worked.. hadn't given it much thought but had like 3 months uptime until I just did another backup and I am back to crashes every couple days since I didn't write down how I did the device removals ..


any of this make sense to you solaris experts..
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
Not really
A Jbod with an expander should not give problems every few weeks. Only known problem may be Sata disks behind the expander. A troubleful single Sata disk can block/reset the expander. With SAS disks such risks are much lower due the more advanced ptotocol.

Can the problems be related to other factors like temperature? Can you replace cables?

Removed/ attached disks are detected automatically with an SAS controller. They only remain visible in iostat as this keeps a history since bootup.