ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

socra · Oct 1, 2015

Yeah could be the way I move from appliance to appliance with the export/import of pools that is failing me since it's working for other people..(maybe esxi gets pissed but dunno why)

At least I now know OmniOS nfs is then not available for anything not just vmware.

I hadn't seen this behavior on my new server with the 1.5c appliance (but wasn't really paying attention to it then) I still think I have that VM around maybe I'll export import my pool from the 1.5d to the 1.5c and see what happens..
One thing for sure..NOT hardware related!

My "spidey sense" keep telling me that I got the issue on my new server after installing the 1.5d appliance and export/imported my 1.5c pool..

@whitey how do you update your Omni appliance..? (you update or reinstall and do import/export of the pools ?)

whitey · Oct 1, 2015

I have used this process to take me from 008 to 010 release of OmniOS. Honestly if I find a rock solid config I typically leave it alone and maybe update one or twice a yr. YMMV.

I've had graceful export and not so graceful force import scenario's play out over the yrs. :-D Never had an issue.

Upgrade_r151008_r151010
Upgrade_to_r151014

socra · Oct 6, 2015

About 4 days ago I went back to the VM that I initially created the volume with (nappit-15c) I think_ OmniOS 5.11 omnios-7648372 July 2015.
haven't had any APD since then.
So dunno what's up with that...maybe it's the export/import or maybe ESXi doesn't like the fact that I create my new appliance with the same ip addresses as the old appliance.
Going to re-deploy the 1.5d vm again and not edit anything (so 1 adapter with e1000 and 1 with vmxnet3 instead of deleting the nics and using 2 x vmxnet3). Export/import see what happens.
If that fails, maybe delete my volumes and re-create them with 1.5d to see if that has anything to do with it..
Still haven't given up...

whitey · Oct 6, 2015

nostradamus99 said:
About 4 days ago I went back to the VM that I initially created the volume with (nappit-15c) I think_ OmniOS 5.11 omnios-7648372 July 2015.
haven't had any APD since then.
So dunno what's up with that...maybe it's the export/import or maybe ESXi doesn't like the fact that I create my new appliance with the same ip addresses as the old appliance.
Going to re-deploy the 1.5d vm again and not edit anything (so 1 adapter with e1000 and 1 with vmxnet3 instead of deleting the nics and using 2 x vmxnet3). Export/import see what happens.
If that fails, maybe delete my volumes and re-create them with 1.5d to see if that has anything to do with it..
Still haven't given up...

You're a BEAST! Check this garbage out, NFS datastores went offline shortly after a vSphere VDP hammering. Seen this a time or two before, really depressing that the articles are so conflicting on whether or not to use e1000/vmxnet3 (I use vmxnet3 now) and virt eth devices seem to suck so bad under *nix.

SMH :-(

socra · Oct 7, 2015

whitey said:
You're a BEAST! Check this garbage out, NFS datastores went offline shortly after a vSphere VDP hammering. Seen this a time or two before, really depressing that the articles are so conflicting on whether or not to use e1000/vmxnet3 (I use vmxnet3 now) and virt eth devices seem to suck so bad under *nix.

SMH :-(

If you login to your ESXi host is there any info in the /var/log/vobd.log?

I'm also grasping at air here..the appliance from Gea uses 2 nics (1 e1000 + 1 VMXNET3) the appliance with the stable connection so far is using the vmxnet3 for the NFS traffic and the E1000 for CIFS. (my veeam replication comes in through the E1000)

When I deployed the 1.5d appliance I deleted both nics and selected 2 vmxnet3 adapters because well it looks silly using 1 e1000 and 1 vmxnet3 adapter at the same time no?

gea · Oct 7, 2015

nostradamus99 said:
When I deployed the 1.5d appliance I deleted both nics and selected 2 vmxnet3 adapters because well it looks silly using 1 e1000 and 1 vmxnet3 adapter at the same time no?

I added both types per default because on previous ESXi versions sometimes one was more stable than the other. It is usefull to try both on problems. Mostly e1000 is more stable while vmxnet3 is faster with a lower CPU load. ESxi 5.5 initial was an example that e1000 was completely unstable.

TechIsCool · Oct 9, 2015

@nostradamus99 Did you see my post about bad memory and APD? I had a bad stick of RAM that was faulting with ECC catching it. Once I Remove the RAM APD stopped happening. Have you checked your BIOS to confirm no errors occur?

socra · Oct 9, 2015

TechIsCool said:
@nostradamus99 Did you see my post about bad memory and APD? I had a bad stick of RAM that was faulting with ECC catching it. Once I Remove the RAM APD stopped happening. Have you checked your BIOS to confirm no errors occur?

No not yet, but it's a good tip..I didn't know a BIOS could find a bad ram stick...
Don't think it's the problem though because my new server has had the same issue but will check to make sure. (not running ECC memory on my current server)

TechIsCool · Oct 9, 2015

ECC has a log if you don't have ECC...

@gea I thought it was required to have ECC for ZFS to actually not fail silently

gea · Oct 10, 2015

TechIsCool said:
ECC has a log if you don't have ECC...

@gea I thought it was required to have ECC for ZFS to actually not fail silently

Without ECC you have a chance of undetectable corrupted writes. Even ZFS cannot protect against as this can happen prior adding checksums. But this is the case for every computer system, OS or filesystem, nothing special to ZFS.

But it is unlikely that RAM problems are only affecting NFS and ESXi as client without problems or crashed elsewhere. A single bitflop may cause a single crash but not a reproducable crash.

socra · Oct 25, 2015

Small update:
Still testing my microserver Nappit appliance 1.5d (also updated OmniOS with pkg update to include the latest nfs fixes)
No issues since I cleanly removed the M1015 from the older appliance, added it to the new 1.5d and then rebooted, even though I have done this before on my production machine. (Have not done any messing with the nics so far)
I did notice that Gea's newer appliance now has 2 vcpu's instead of 1 in earlier builds

On my production machine:
Also something that I noticed earlier that is still present is the annoyances with the nics. OmniOS doesn't like it when you change the napp-it appliance installed nics. (1x E1000 + 1x vmxnet3, So Gea maybe 1.5e appliance 2xvmxnet3 ?

)

I changed the first nic from e1000 to vmxnet3 which gave me problems assigning an ip. So I deleted both nics and added two vmxnet3 adapters. OmniOS (or ESXi) screws up and reverses vmxnet3s0 and vmxnet3s1 (read this also on hardforum I think somewhere.)
So my CIFS nic is now not getting an ip.
I was able to check this because I saw the MAC addresses in OmniOS while in ESXi I mapped the first nic to CIFS and the 2nd to NFS which in OmniOS was the other way around

Crazyiness continued after messing with the nics...when I used vsphere client to shutdown the vm I got a kernel dump.

Luckily I created a snapshot before I started messing with it..so reverted and went ahead:

Way around this was to leave the e1000 nic and configure it as management. mapped the 2nd nic to cifs and added a 3rd nic and placed that into nfs portgroup. Now OmniOS matches the nics nicely. (should be easy to reproduce to check it out and compare results..)
The VM has been shut down about 5-6 times no issues..

Other issue was upgrading the vmware tools. (If you install the latest 6.x patches, vmware tools is outdated)
after messing with the nics this also gave me kernel dumps when I tried to upgrade the tools.
When I first uninstalled the vmware tools installation went smoother...
(on my microserver where I did not mess with the nics I was able to upgrade the vmware tools no problem)

Planning to go and replace my prod OI vm with OmniOS this week so we'll see what happens

*EDITED*
A lot because I've been at it too long and my English was failing me..

TechIsCool · Oct 25, 2015

Vmware tools is now a separate download from ESXi. This happened in September.
VMware Tools 10.0.0 Released

socra · Oct 25, 2015

Yes I heard..we'll have to wait and see if this is a smart move on the part of VMWare..could be, must be, should be, hopefully will be...

acmcool · Oct 27, 2015

I am having issues as well..Everytime I poweroff the VM just boots into maintenance mode..Not sure what to do...

socra · Oct 27, 2015

What do you mean issues..? the same or different..please elaborate on what you're seeing or have done so far...what are you running, how did you install OmniOS etc..

acmcool · Oct 27, 2015

I used NAPPit all in one appliance to install...I am on esxi 6.
When I do reboot/shutdown through Nappit the BE gets corrupted and its goes to maintenance mode
I tried svcadm clear system/boot-archive.
But it does not help..

gea · Oct 27, 2015

I would consider a broken ESXi or OmniOS.
Reinstall OmniOS via the OVA template.

If this does not help, try a new ESXi setup

socra · Oct 28, 2015

Did you use the nics (e1000+vmxnet3) configured with the appliance or did you change the nics? (added/removed nics)
When you redeploy the appliance, don't change the nics and then try to shutdown the vm using the shutdown guest os button

socra · Oct 30, 2015

Just went ahead and did the following:

Exported Pools
Shutdown OI VM.
Unmounted NFS datastore
Shutdown all VM's
Removed M1015 from OI
Gave the M1015 to OmniOS appliance
Shutdown ESXi host
Started ESXi host
Started OmniOS
Emptied the E1000 ip configuration because it was making OmniOS listen on the wrong NIC (couldn't ping my CIFS server)
Rebooted the appliance, all good Omni OS was now pingable. (left the E1000 attached to the VM just not connected..seen too many weird things when shuffeling with the NICS)
Imported Pools
Created new NFS datastore
So far so good..installed centos test vm on it..went well...keeping fingers crossed

Did see something strange when it comes to a failed disk that the alert log didn't mention but I posted that on hardforum because it isn't related to the original issue. OpenSolaris derived ZFS NAS/ SAN (Nexenta*, OpenIndiana, Solaris Express) - Page 356 - [H]ard|Forum

acmcool · Oct 31, 2015

I figured out my issue...the ssd i used for vmware os storage just died today...so it was corrupting omni os until today..

ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

Member

Moderator

Member

Moderator

Attachments

Member

Well-Known Member

Active Member

Member

Active Member

Well-Known Member

Member

Active Member

Member

Banned

Member

Banned

Well-Known Member

Member

Member

Banned