ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

socra

Member
Feb 4, 2011
79
2
8
Yeah could be the way I move from appliance to appliance with the export/import of pools that is failing me since it's working for other people..(maybe esxi gets pissed but dunno why)

At least I now know OmniOS nfs is then not available for anything not just vmware.

I hadn't seen this behavior on my new server with the 1.5c appliance (but wasn't really paying attention to it then) I still think I have that VM around maybe I'll export import my pool from the 1.5d to the 1.5c and see what happens..
One thing for sure..NOT hardware related!

My "spidey sense" keep telling me that I got the issue on my new server after installing the 1.5d appliance and export/imported my 1.5c pool..

@whitey how do you update your Omni appliance..? (you update or reinstall and do import/export of the pools ?)
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,770
865
113
38
I have used this process to take me from 008 to 010 release of OmniOS. Honestly if I find a rock solid config I typically leave it alone and maybe update one or twice a yr. YMMV.

I've had graceful export and not so graceful force import scenario's play out over the yrs. :-D Never had an issue.

Upgrade_r151008_r151010
Upgrade_to_r151014
 

socra

Member
Feb 4, 2011
79
2
8
About 4 days ago I went back to the VM that I initially created the volume with (nappit-15c) I think_ OmniOS 5.11 omnios-7648372 July 2015.
haven't had any APD since then.
So dunno what's up with that...maybe it's the export/import or maybe ESXi doesn't like the fact that I create my new appliance with the same ip addresses as the old appliance.
Going to re-deploy the 1.5d vm again and not edit anything (so 1 adapter with e1000 and 1 with vmxnet3 instead of deleting the nics and using 2 x vmxnet3). Export/import see what happens.
If that fails, maybe delete my volumes and re-create them with 1.5d to see if that has anything to do with it..
Still haven't given up...
 

whitey

Moderator
Jun 30, 2014
2,770
865
113
38
About 4 days ago I went back to the VM that I initially created the volume with (nappit-15c) I think_ OmniOS 5.11 omnios-7648372 July 2015.
haven't had any APD since then.
So dunno what's up with that...maybe it's the export/import or maybe ESXi doesn't like the fact that I create my new appliance with the same ip addresses as the old appliance.
Going to re-deploy the 1.5d vm again and not edit anything (so 1 adapter with e1000 and 1 with vmxnet3 instead of deleting the nics and using 2 x vmxnet3). Export/import see what happens.
If that fails, maybe delete my volumes and re-create them with 1.5d to see if that has anything to do with it..
Still haven't given up...
You're a BEAST! Check this garbage out, NFS datastores went offline shortly after a vSphere VDP hammering. Seen this a time or two before, really depressing that the articles are so conflicting on whether or not to use e1000/vmxnet3 (I use vmxnet3 now) and virt eth devices seem to suck so bad under *nix.

SMH :-(
 

Attachments

socra

Member
Feb 4, 2011
79
2
8
You're a BEAST! Check this garbage out, NFS datastores went offline shortly after a vSphere VDP hammering. Seen this a time or two before, really depressing that the articles are so conflicting on whether or not to use e1000/vmxnet3 (I use vmxnet3 now) and virt eth devices seem to suck so bad under *nix.

SMH :-(
If you login to your ESXi host is there any info in the /var/log/vobd.log?

I'm also grasping at air here..the appliance from Gea uses 2 nics (1 e1000 + 1 VMXNET3) the appliance with the stable connection so far is using the vmxnet3 for the NFS traffic and the E1000 for CIFS. (my veeam replication comes in through the E1000)

When I deployed the 1.5d appliance I deleted both nics and selected 2 vmxnet3 adapters because well it looks silly using 1 e1000 and 1 vmxnet3 adapter at the same time no?
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
When I deployed the 1.5d appliance I deleted both nics and selected 2 vmxnet3 adapters because well it looks silly using 1 e1000 and 1 vmxnet3 adapter at the same time no?
I added both types per default because on previous ESXi versions sometimes one was more stable than the other. It is usefull to try both on problems. Mostly e1000 is more stable while vmxnet3 is faster with a lower CPU load. ESxi 5.5 initial was an example that e1000 was completely unstable.
 

socra

Member
Feb 4, 2011
79
2
8
@nostradamus99 Did you see my post about bad memory and APD? I had a bad stick of RAM that was faulting with ECC catching it. Once I Remove the RAM APD stopped happening. Have you checked your BIOS to confirm no errors occur?
No not yet, but it's a good tip..I didn't know a BIOS could find a bad ram stick...
Don't think it's the problem though because my new server has had the same issue but will check to make sure. (not running ECC memory on my current server)
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
ECC has a log if you don't have ECC...

@gea I thought it was required to have ECC for ZFS to actually not fail silently
Without ECC you have a chance of undetectable corrupted writes. Even ZFS cannot protect against as this can happen prior adding checksums. But this is the case for every computer system, OS or filesystem, nothing special to ZFS.

But it is unlikely that RAM problems are only affecting NFS and ESXi as client without problems or crashed elsewhere. A single bitflop may cause a single crash but not a reproducable crash.
 

socra

Member
Feb 4, 2011
79
2
8
Small update:
Still testing my microserver Nappit appliance 1.5d (also updated OmniOS with pkg update to include the latest nfs fixes)
No issues since I cleanly removed the M1015 from the older appliance, added it to the new 1.5d and then rebooted, even though I have done this before on my production machine. (Have not done any messing with the nics so far)
I did notice that Gea's newer appliance now has 2 vcpu's instead of 1 in earlier builds

On my production machine:
Also something that I noticed earlier that is still present is the annoyances with the nics. OmniOS doesn't like it when you change the napp-it appliance installed nics. (1x E1000 + 1x vmxnet3, So Gea maybe 1.5e appliance 2xvmxnet3 ? :) )

I changed the first nic from e1000 to vmxnet3 which gave me problems assigning an ip. So I deleted both nics and added two vmxnet3 adapters. OmniOS (or ESXi) screws up and reverses vmxnet3s0 and vmxnet3s1 (read this also on hardforum I think somewhere.)
So my CIFS nic is now not getting an ip.
I was able to check this because I saw the MAC addresses in OmniOS while in ESXi I mapped the first nic to CIFS and the 2nd to NFS which in OmniOS was the other way around :(

Crazyiness continued after messing with the nics...when I used vsphere client to shutdown the vm I got a kernel dump.

Luckily I created a snapshot before I started messing with it..so reverted and went ahead:

Way around this was to leave the e1000 nic and configure it as management. mapped the 2nd nic to cifs and added a 3rd nic and placed that into nfs portgroup. Now OmniOS matches the nics nicely. (should be easy to reproduce to check it out and compare results..)
The VM has been shut down about 5-6 times no issues..

Other issue was upgrading the vmware tools. (If you install the latest 6.x patches, vmware tools is outdated)
after messing with the nics this also gave me kernel dumps when I tried to upgrade the tools.
When I first uninstalled the vmware tools installation went smoother...
(on my microserver where I did not mess with the nics I was able to upgrade the vmware tools no problem)

Planning to go and replace my prod OI vm with OmniOS this week so we'll see what happens



*EDITED*
A lot because I've been at it too long and my English was failing me..
 
Last edited:

socra

Member
Feb 4, 2011
79
2
8
Yes I heard..we'll have to wait and see if this is a smart move on the part of VMWare..could be, must be, should be, hopefully will be...
 

socra

Member
Feb 4, 2011
79
2
8
What do you mean issues..? the same or different..please elaborate on what you're seeing or have done so far...what are you running, how did you install OmniOS etc..
 

acmcool

Banned
Jun 23, 2015
611
76
28
36
Woodbury,MN
I used NAPPit all in one appliance to install...I am on esxi 6.
When I do reboot/shutdown through Nappit the BE gets corrupted and its goes to maintenance mode
I tried svcadm clear system/boot-archive.
But it does not help..
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
I would consider a broken ESXi or OmniOS.
Reinstall OmniOS via the OVA template.

If this does not help, try a new ESXi setup
 

socra

Member
Feb 4, 2011
79
2
8
Did you use the nics (e1000+vmxnet3) configured with the appliance or did you change the nics? (added/removed nics)
When you redeploy the appliance, don't change the nics and then try to shutdown the vm using the shutdown guest os button
 

socra

Member
Feb 4, 2011
79
2
8
Just went ahead and did the following:

  1. Exported Pools
  2. Shutdown OI VM.
  3. Unmounted NFS datastore
  4. Shutdown all VM's
  5. Removed M1015 from OI
  6. Gave the M1015 to OmniOS appliance
  7. Shutdown ESXi host
  8. Started ESXi host
  9. Started OmniOS
  10. Emptied the E1000 ip configuration because it was making OmniOS listen on the wrong NIC (couldn't ping my CIFS server)
  11. Rebooted the appliance, all good Omni OS was now pingable. (left the E1000 attached to the VM just not connected..seen too many weird things when shuffeling with the NICS)
  12. Imported Pools
  13. Created new NFS datastore
  14. So far so good..installed centos test vm on it..went well...keeping fingers crossed

Did see something strange when it comes to a failed disk that the alert log didn't mention but I posted that on hardforum because it isn't related to the original issue. OpenSolaris derived ZFS NAS/ SAN (Nexenta*, OpenIndiana, Solaris Express) - Page 356 - [H]ard|Forum
 
Last edited: