ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

Thread starter uto
Start date Jun 21, 2015

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

U

uto

New Member

#1

Hi,
Has anyone encountered stability issues with ESXi 6.0 with a datastore on OmniOS+Napp-it, where under a very specific situation, the NFS datastore would go down in the APD state (all paths down)? Everything's fine, power up VM, restart VM guest, even high IOPS load test. The trigger was when any VM with CBT turned on gets powered off.

It was very difficult to troubleshoot - at first I thought it was my HBA (an LSI), then suspecting my SSD pool (combo of Samsung 850 + Crucial MX200), and finally nailed it down (albeit not with 100% certainty) to the ESXi 6.0 NFS code being incompatible with OmniOS's NFS (and potentially other variants) under very specific conditions.

The symptom shown in the vmkernel.log is as follows. The following is shown immediately after issuing a Stop-VM command:

2015-06-22T00:35:30.788Z cpu0:65098)CBT: 1468: Disconnecting the cbt device 1e01de-cbt with filehandle 1966558
2015-06-22T00:35:45.158Z cpu1:32860)StorageApdHandler: 1204: APD start for 0x4304dc32efa0 [12860b4d-5de423fb]
2015-06-22T00:35:45.158Z cpu0:32907)StorageApdHandler: 421: APD start event for 0x4304dc32efa0 [12860b4d-5de423fb]
2015-06-22T00:35:45.158Z cpu0:32907)StorageApdHandlerEv: 110: Device or filesystem with identifier [12860b4d-5de423fb] has entered the All Paths Down state.
2015-06-22T00:36:02.158Z cpu0:32819)WARNING: SwapExtend: vm 65098: 426: Failed to truncate swapfile /vmfs/volumes/12860b4d-5de423fb/wlc1/vmx-wlc1-1912640390-1.vswp to 0 bytes: I/O error
2015-06-22T00:36:04.158Z cpu1:33213)NFSLock: 612: Stop accessing fd 0x4301b5496bd0 3
2015-06-22T00:36:04.158Z cpu1:33213)NFSLock: 612: Stop accessing fd 0x4301b549fff0 3
201

Has anyone else encountered this?

regards,
Uto

G

gea

Well-Known Member

#2

can you compare

- e1000 vnic vs vmxnet3
- another lsi firmware (mainly with p20)
- OmniOS stable

- ESXi 5.5U2

disable jumbo frames or trunking
try NetApp Knowledgebase - How to troubleshoot NFS APD (All-Paths-Down) issues on VMware ESXi

W

whitey

Moderator

#3

Dunno if i have too much to add but for reference I have the following config and it is ROCK solid (minus napp-it for now)

OmniOS v11 r151010
vmxnet3 vnic's
LSI 2008 ctrls VT-d passthrough to OmniOS (IT mode v19 FW)
Vmware tools installed
Jumbo frames turned up end-to-end (I have 10G network so pursued this)
ESXi 6.0 build 2494585

Now I will admit I have had a kernel panic or two over several years that brought holy hell upon me to get it fixed. No data loss at all but a major PITA to fix. First time it must have taken me hours, second time just 30 mins or so w/ lessons learned. Something along the lines of Omni giving up the ghost on NFS and having to bring it back to life via svcadm/svcs cmds, sometimes hunting down the offending .lck file on the NFS volume, and finally doing some nasty esxcli storage nfs bla blah unmount and remount cmds as the GUI just would NOT let me get them mounted in again, even after HEAVY bashing over head. Had to do it a couple times and have it documented somewhere if anyone needs this tidbit of saving grace if you happen to run into this.

Last edited: Jul 16, 2015

L

leftside

New Member

#4

I am having the same issue. I decided to use OmniOS instead of OpenIndiana, but I'm going to install OpenIndiana 151a8 to see if I have the same issue.

W

whitey

Moderator

#5

leftside said:
I am having the same issue. I decided to use OmniOS instead of OpenIndiana, but I'm going to install OpenIndiana 151a8 to see if I have the same issue.

I ain't going anywhere from OmniOS...maybe SmartOS/Joyent but no OI for me. They really seem to lag in Illumos tree merges/feature flags. This happens so rarely and I know how to fix it now that I don't even sweat it anymore. Although I must say I had a Solaris 11 GA ZFS hybrid array up for some 1300+ days running about 300 VDI sessions that never had one hiccup. I was impressed.

Another DR ZFS array of mine at the co-lo. 1K day uptime club boi!!!

root@sumodr01:/tank/data/surveillancecam# uptime
9:08pm up 1009 day(s), 20:42, 8 users, load average: 0.02, 0.01, 0.01

Last edited: Jul 16, 2015

Reactions: Magnus919

S

socra

Member

#6

@whitey,
Can you please share your tidbid documentation on how to bring your host back to life in case of kernel panics?

Facing horrors right now..want to move from my rock solid OI 151a7 with napp-it to OmniOS but so far no luck with my ESXI 5.5 U2 machine. Using a IBM M1015 with P19 IT firmware.
Every time after I try to start a VM I get an APD error:

Code:

2015-07-30T21:28:09.387Z cpu0:32780)StorageApdHandler: 421: Device or filesystem with identifier [10ec8dec-c323e318] has exited the All Paths Down state.
2015-07-30T21:28:09.387Z cpu0:32780)StorageApdHandler: 912: APD Exit for ident [10ec8dec-c323e318]!
2015-07-30T21:28:20.003Z cpu0:33843)World: 14302: VC opID 4E57F36D-0000057C maps to vmkernel opID 1291dd9d
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 265: APD Timer started for ident [10ec8dec-c323e318]
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 414: Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 856: APD Start for ident [10ec8dec-c323e318]!
2015-07-30T21:28:39.842Z cpu3:33841)WARNING: NFS: 2149: Failed to get attributes (I/O error)
2015-07-30T21:28:39.842Z cpu3:33841)NFS: 2212: Failed to get object 40 10ec8dec c323e318 f620e570 a4ac1b08 4000a 0 148 4000a 14800000000 0 0 0 :I/O error
2015-07-30T21:29:09.883Z cpu3:33841)WARNING: NFS: 2149: Failed to get attributes (I/O error)
2015-07-30T21:29:09.883Z cpu3:33841)NFS: 2212: Failed to get object 40 10ec8dec c323e318 f620e570 a4ac1b08 4000a 0 148 4000a 14800000000 0 0 0 :I/O error

I'm running the Nappit 1.5b appliance from Gea, also tried upgrading that OmniOS appliance to the July 27 update (7648372) but no matter what I try, it doesn't seem to work under Esxi 5.5 U2 patch 3 while OI runs without a hitch..
going to catch a few Zzzz's and try to reinstall my host to ESXi 6.0b (using a different usb stick so I can hop between 5.5 and 6.0)

Last edited: Jul 31, 2015

G

gea

Well-Known Member

#7

I have a couple of machines running with ESXi 5.5 u2 build 1331820 with u2 update
without any problem. Can you rule out a simple problem like a permission problem?

- enable NFS with defaults
- reset ACL recursively to everyone@=modify

you can use napp-it to reset acl in menu filesystems >> folder acl
below your acl listings, you find: reset acl's

In my own setup I have a testmachine with ESXi 6.0 GA
I have not jet had a basic NFS problem. (but only low load)
VMware ESXi Release and Build Number History | Virten.net

if this does not help, you may ask at omnios irc
omnios IRC logs [July 31 - 2015]

You need an irc client like
the addon chatzilla for firefox to chat

or the omnios mailing list
OmniOS-discuss Info Page

Last edited: Jul 31, 2015

S

socra

Member

#8

Hi Gea!
I'm running ESXi 5.5 at version 2143827.
This is what I did:
- Exported pools,
- Shutdown OI
- Disconnected the LSI from OI
- Connected the LSI to OmniOS VM
- Gave OmniOS VM the same name + ip addresses as OI (1 for CIFS, 1 for NFS)
- Imported the pools.
As soon as I did that, the shares also came back (NFS and CIFS).
I went into the datastore browser of ESXi to make sure that I could create a folder.
Will try to reset the NFS acl now...but getting a bit anxious with exporting and importing my pools constantly.

Last edited: Jul 31, 2015

G

gea

Well-Known Member

#9

You may need to

- unmount the NFS datastore in ESXi storage settings
- re-add the NFS datastore
- import th VMs (ESXi filebrowser, right mouse click to the .vmx file)

Import/ Export a pool between OI and OmniOS is uncritical unless you do not activate new features in OmniOS.

S

socra

Member

#10

Tried it again..still no luck...
- Exported pools,
- Disconnected the IBM M1015 (LSI) from OI
- Connected the IBM M1015 (LSI) to OmniOS VM
- Gave OmniOS VM the same name + ip addresses as OI (1 for CIFS, 1 for NFS)
- Imported the pools.
- Removed VM's from inventory
- Unmounted NFS datastore from ESXi,
- Unshared and re-shared the NFS datastore from OmniOS
- Tried to start a VM after adding the VMX from the Datastore, this time I received an error:

Code:

Failed to start the virtual machine.
Cannot open the disk '/vmfs/volumes/fda1d3b3-ffd055e2/TESTVM/TESTVM.vmdk' or one of the snapshot disks it depends on.
Insufficient permission to access file

- Reset the ACL from napp-it to everyone= modify
- Added a VMX from the datastore.
- Tried to start the VM: no error about permissions but NFS connection dies:

vmkernel.log (don't know which log I should look at in OmniOS)

Code:

2015-07-30T21:27:45.578Z cpu3:51327243)StorageApdHandler: 337: APD Timeout for ident [10ec8dec-c323e318]!
2015-07-30T21:27:45.700Z cpu2:33845)World: 14302: VC opID 4E57F36D-0000056B maps to vmkernel opID ec960be7
2015-07-30T21:28:00.002Z cpu1:3737403)World: 14302: VC opID 4E57F36D-0000056E maps to vmkernel opID ad5e3857
2015-07-30T21:28:09.387Z cpu0:32780)NFS: 325: Restored connection to the server 192.168.20.2 mount point /vol00/nfs-ds00, mounted as 10ec8dec-c323e318-0000-000000000000

("nfs-ds00")
2015-07-30T21:28:09.387Z cpu0:32780)StorageApdHandler: 421: Device or filesystem with identifier [10ec8dec-c323e318] has exited the All Paths Down state.
2015-07-30T21:28:09.387Z cpu0:32780)StorageApdHandler: 912: APD Exit for ident [10ec8dec-c323e318]!
2015-07-30T21:28:20.003Z cpu0:33843)World: 14302: VC opID 4E57F36D-0000057C maps to vmkernel opID 1291dd9d
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 265: APD Timer started for ident [10ec8dec-c323e318]
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 414: Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-07-30T21:28:25.817Z cpu0:32790)StorageApdHandler: 856: APD Start for ident [10ec8dec-c323e318]!
2015-07-30T21:28:39.842Z cpu3:33841)WARNING: NFS: 2149: Failed to get attributes (I/O error)
2015-07-30T21:28:39.842Z cpu3:33841)NFS: 2212: Failed to get object 40 10ec8dec c323e318 f620e570 a4ac1b08 4000a 0 148 4000a 14800000000 0 0 0 :I/O error
2015-07-30T21:29:09.883Z cpu3:33841)WARNING: NFS: 2149: Failed to get attributes (I/O error)
2015-07-30T21:29:09.883Z cpu3:33841)NFS: 2212: Failed to get object 40 10ec8dec c323e318 f620e570 a4ac1b08 4000a 0 148 4000a 14800000000 0 0 0 :I/O error
2015-07-30T21:29:37.392Z cpu0:32779)StorageApdHandler: 296: APD Timer killed for ident [10ec8dec-c323e318]
2015-07-30T21:29:37.392Z cpu0:32779)StorageApdHandler: 421: Device or filesystem with identifier [10ec8dec-c323e318] has exited the All Paths Down state.
2015-07-30T21:29:37.392Z cpu0:32779)StorageApdHandler: 912: APD Exit for ident [10ec8dec-c323e318]!
2015-07-30T21:29:37.707Z cpu1:3737403)World: 14302: VC opID 4E57F36D-00000585 maps to vmkernel opID e38a0e09
2015-07-30T21:29:37.877Z cpu2:51331982)NetPort: 1632: disabled port 0x6000021
2015-07-30T21:29:37.878Z cpu2:51331982)VSCSI: 6440: handle 13505(vscsi0:0):Destroying Device for world 51331985 (pendCom 0)
2015-07-30T21:29:37.878Z cpu2:51331982)VSCSI: 6440: handle 13506(vscsi0:1):Destroying Device for world 51331985 (pendCom 0)
2015-07-30T21:29:37.879Z cpu2:51331982)VSCSI: 6440: handle 13504(vscsi4:0):Destroying Device for world 51331985 (pendCom 0)
2015-07-30T21:29:37.897Z cpu1:32818)Net: 3354: disconnected client from port 0x6000021
2015-07-30T21:29:37.925Z cpu0:32816)WARNING: SwapExtend: vm 51331985: 403: Failed to truncate swapfile /vmfs/volumes/10ec8dec-c323e318/TESTVM/TESTVM-57bd918a.vswp to 0

bytes: No connection
2015-07-30T21:29:40.004Z cpu1:33843)World: 14302: VC opID hostd-a6a9 maps to vmkernel opID dda3aae5
2015-07-30T21:29:40.207Z cpu2:33352)NFSLock: 570: Start accessing fd 0x4108be70b8a8 again
2015-07-30T21:29:40.207Z cpu3:51327243)NFSLock: 570: Start accessing fd 0x4108be69a9c8 again

- Exported the pools again,
- Stopped Omni OS VM
- Disconnected the IBM M1015 (LSI) from OmniOS
- Connected the IBM M1015 (LSI) to OI VM
- Imported the pools.
- Add VM from inventory, start it...everything works again

(also no need to remove datastore from esxi etc)

@gea,
Do your 5.5 test machines also have a IBM M1015 (LSI) adapter to test the all-in-one setup?
I tried another Nappit 1.5b testvm and added a small NFS datastore to it (using a 5 GB vmdk), when I add this to my ESXi machine as a new NFS datastore, I can add and run a VM from there without issue.

As soon as I try it with one of my VM's running from my imported pools (behind LSI M1015) I get APD issues...

OI doesn't want me to leave it for OmniOS it seems...very frustrating

Last edited: Jul 31, 2015

G

gea

Well-Known Member

#11

There is no relation between HBA and filesystem (I use the IBM 1015 as well). But there is also no obvious relation between Illumos releases that creates a filesystem and the NFS service.

If there is no problem with NFS on a newly created pool, you may copy a VM over to a new pool and import the VM to check this.

S

socra

Member

#12

Hmm, do you mean adding a separate vmdk as a pool to the omnios vm...?
I have two 3-way mirrors on my M1015 Adapter: (vol00 is for my VM's, vol02 is CIFS)

Code:

 pool: vol00
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support feature
    flags.
  scan: scrub repaired 0 in 6h30m with 0 errors on Tue Jul 28 09:45:45 2015
config:

    NAME                       STATE     READ WRITE CKSUM     CAP            Product
    vol00                      ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c6t5000  ONLINE       0     0     0     1 TB           ST1000DM003-1CH1
        c6t50024  ONLINE       0     0     0     1 TB           SAMSUNG HD103SJ
        c6t5000  ONLINE       0     0     0     1 TB           ST1000DM003-1CH1

errors: No known data errors

  pool: vol02
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support feature
    flags.
  scan: scrub repaired 0 in 5h30m with 0 errors on Thu Jul 30 08:30:44 2015
config:

    NAME                       STATE     READ WRITE CKSUM     CAP            Product
    vol02                      ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c6t5000C5004014  ONLINE       0     0     0     2 TB           ST2000DM001-9YN1
        c6t5000C500401E  ONLINE       0     0     0     2 TB           ST2000DM001-9YN1
        c6t5000C500504  ONLINE       0     0     0     2 TB           ST2000DM001-1CH1

The OI and Omni OS VM's are running from a local SSD..(about 119GB free)

So I don't have any disks to create new pool:
Create Pool

no disks available!

Also found some more errors in the esxi logs during the NFS disconnects:
/var/log/vobd.log

Code:

2015-07-31T12:26:51.314Z: No correlator for vob.vmfs.nfs.server.disconnect
2015-07-31T12:27:23.162Z: [APDCorrelator] 19351682950720us: [vob.storage.apd.timeout] Device or filesystem with identifier [fda1d3b3-ffd055e2] has entered                                    the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2015-07-31T12:27:23.162Z: [APDCorrelator] 19352111849042us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [fda1d3b3-ffd055e2] has                                    entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2015-07-31T12:27:47.643Z: [APDCorrelator] 19351707431839us: [vob.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exited the                                    All Paths Down state.
2015-07-31T12:27:47.643Z: [APDCorrelator] 19352136330656us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exite                                   d the All Paths Down state.
2015-07-31T12:27:47.644Z: No correlator for vob.vmfs.nfs.server.restored
2015-07-31T12:27:47.644Z: [vmfsCorrelator] 19352136330995us: [esx.problem.vmfs.nfs.server.restored] 192.168.20.2 vol00/nfs-ds00 fda1d3b3-ffd055e2-0000-0000                                   00000000 nfs-ds00
2015-07-31T12:28:02.417Z: [APDCorrelator] 19351722204866us: [vob.storage.apd.start] Device or filesystem with identifier [fda1d3b3-ffd055e2] has entered th                                   e All Paths Down state.
2015-07-31T12:28:02.417Z: [APDCorrelator] 19352151103979us: [esx.problem.storage.apd.start] Device or filesystem with identifier [fda1d3b3-ffd055e2] has en                                   tered the All Paths Down state.
2015-07-31T12:29:21.416Z: [APDCorrelator] 19351801202190us: [vob.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exited the                                    All Paths Down state.
2015-07-31T12:29:21.416Z: [APDCorrelator] 19352230103082us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exite                                   d the All Paths Down state.
2015-07-31T12:30:57.417Z: [APDCorrelator] 19351897201489us: [vob.storage.apd.start] Device or filesystem with identifier [fda1d3b3-ffd055e2] has entered th                                   e All Paths Down state.
2015-07-31T12:30:57.417Z: [APDCorrelator] 19352326104524us: [esx.problem.storage.apd.start] Device or filesystem with identifier [fda1d3b3-ffd055e2] has en                                   tered the All Paths Down state.
2015-07-31T12:32:45.421Z: No correlator for vob.vmfs.nfs.server.disconnect
2015-07-31T12:32:45.421Z: [vmfsCorrelator] 19352434108599us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.20.2 vol00/nfs-ds00 fda1d3b3-ffd055e2-0000-00                                   0000000000 nfs-ds00
2015-07-31T12:33:17.422Z: [APDCorrelator] 19352037202789us: [vob.storage.apd.timeout] Device or filesystem with identifier [fda1d3b3-ffd055e2] has entered                                    the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2015-07-31T12:33:17.422Z: [APDCorrelator] 19352466109028us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [fda1d3b3-ffd055e2] has                                    entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2015-07-31T12:37:23.278Z: [APDCorrelator] 19352283053449us: [vob.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exited the                                    All Paths Down state.
2015-07-31T12:37:23.278Z: [APDCorrelator] 19352711965334us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [fda1d3b3-ffd055e2] has exite                                   d the All Paths Down state.
2015-07-31T12:37:23.278Z: [vmfsCorrelator] 19352711965339us: [esx.problem.vmfs.nfs.server.restored] 192.168.20.2 vol00/nfs-ds00 fda1d3b3-ffd055e2-0000-0000                                   00000000 nfs-ds00
2015-07-31T12:37:23.278Z: No correlator for vob.vmfs.nfs.server.restored

Last edited: Jul 31, 2015

S

socra

Member

#13

I wish VMWare had the same article for Napp-IT instead of NetApp

VMware KB: NFS connectivity issues on NetApp NFS filers on ESXi 5.x/6.0

I could try lowering the queue depth..but doesn't explain why I can use the same netapp appliance with a small 5 GB local NFS datastore..
Also not sure that I've seen any reports with people where lowering this, actually helped..

Just checked my settings and my current NFS queue depth is:

esxcfg-advcfg -g /NFS/MaxQueueDepth
Value of MaxQueueDepth is 4294967295

Although.......according to Netapp this issue is not related to NetApp only..:
Heads Up! NetApp NFS Disconnects - CormacHogan.com

Last edited: Jul 31, 2015

S

socra

Member

#14

Tried to edit NFS queue depth to 64 ..no luck.

- Exported the pools again,
- Stopped OI VM
- Disconnected the IBM M1015 (LSI) from OI
- Connected the IBM M1015 (LSI) to OmniOS
- Removed all the VM's from inventory
- Edited the NFS queue depth to 64
- Rebooted the host (200+ days uptime gone

)
- Imported the pools.
- Added the Datastore (same ip as the OI vm)
- Add VM from inventory, started it...All Paths Down:

Code:

vmkernel.log:
2015-08-01T06:44:29.611Z cpu2:37151)WARNING: NetDVS: 547: portAlias is NULL
2015-08-01T06:44:29.611Z cpu2:37151)Net: 2312: connected XPVM001 eth0 to VM Network, portID 0x2000006
2015-08-01T06:44:29.631Z cpu3:35213)Config: 346: "SIOControlFlag2" = 0, Old Value: 1, (Status: 0x0)
2015-08-01T06:44:29.643Z cpu1:35213)World: 14302: VC opID 4571B39B-000000B0 maps to vmkernel opID 8e4ab9ac
2015-08-01T06:44:37.081Z cpu2:37151)NetPort: 1426: enabled port 0x2000006 with mac 00:50:56:8d:3e:22
2015-08-01T06:44:40.004Z cpu1:35213)World: 14302: VC opID hostd-2e49 maps to vmkernel opID bc355319
2015-08-01T06:44:54.494Z cpu1:33841)World: 14302: VC opID 4571B39B-000000C4 maps to vmkernel opID 86543c76
2015-08-01T06:44:58.688Z cpu2:33845)StorageApdHandler: 265: APD Timer started for ident [10ec8dec-c323e318]
2015-08-01T06:44:58.688Z cpu2:33845)StorageApdHandler: 414: Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-08-01T06:44:58.688Z cpu2:33845)StorageApdHandler: 856: APD Start for ident [10ec8dec-c323e318]!
2015-08-01T06:45:00.003Z cpu0:33844)World: 14302: VC opID 4571B39B-000000C9 maps to vmkernel opID eb27bcf9
2015-08-01T06:45:01.365Z cpu2:33843)World: 14302: VC opID hostd-f9e1 maps to vmkernel opID d4088771
2015-08-01T06:45:10.708Z cpu1:32780)NFSLock: 610: Stop accessing fd 0x4108be6df3f8  3
2015-08-01T06:45:10.708Z cpu1:32780)NFSLock: 610: Stop accessing fd 0x4108be6dd8c8  3
2015-08-01T06:45:10.709Z cpu1:32780)NFSLock: 610: Stop accessing fd 0x4108be6de578  3
2015-08-01T06:45:10.709Z cpu1:32780)NFSLock: 610: Stop accessing fd 0x4108be704108  3

Code:

vobd.log:
2015-08-01T06:41:04.642Z: [vmfsCorrelator] 539329275us: [esx.problem.vmfs.nfs.server.restored] 192.168.20.2 /vol00/nfs-ds00 10ec8dec-c323e318-0000-000000000000 nfs-ds00
2015-08-01T06:44:58.688Z: [APDCorrelator] 773360249us: [vob.storage.apd.start] Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-08-01T06:44:58.689Z: [APDCorrelator] 773375883us: [esx.problem.storage.apd.start] Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-08-01T06:46:05.606Z: [APDCorrelator] 840276519us: [vob.storage.apd.exit] Device or filesystem with identifier [10ec8dec-c323e318] has exited the All Paths Down state.
2015-08-01T06:46:05.606Z: [APDCorrelator] 840293616us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [10ec8dec-c323e318] has exited the All Paths Down state.

Even when I do a simple "dir listing", the NFS path dies:
cd /vmfs/volumes
ls

Code:

vobd.log:
2015-08-01T06:59:23.026Z: [APDCorrelator] 1637678249us: [vob.storage.apd.start] Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.
2015-08-01T06:59:23.026Z: [APDCorrelator] 1637713115us: [esx.problem.storage.apd.start] Device or filesystem with identifier [10ec8dec-c323e318] has entered the All Paths Down state.

Reverted the NFS changes to original, rebooted the host and went back to OI, everything OK again ! I checked the OI NIC config and noticed that it's configured "wrong" because I don't use jumbo frames (so OI is working fine with this config while OmniOS with MTU 1500 barfs when even doing a ls command) :
OI:
vmxnet3s1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 9000 index 5
inet 192.168.20.2 netmask ffffff00 broadcast 192.168.20.255
ether 0:50:56:ac:a:ab

G

gea

Well-Known Member

#15

I would always start with a default config (mtu=1500) on OmniOS and ESXi - optionally with an e1000 vnic

S

socra

Member

#16

I used your appliance with default MTU=1500. Did not try e1000, only VMXNet3 (because I also use it with OI)
With OI I never set the MTU=9000, don't know why that is..but it works that is the strange part.

If there is no problem with NFS on a newly created pool, you may copy a VM over to a new pool and import the VM to check this.

Hmm, do you mean adding a separate vmdk as a pool to the omnios vm...?
I have two 3-way mirrors on my M1015 Adapter: (vol00 is for my VM's, vol02 is CIFS)

The OI and Omni OS VM's are running from a local SSD..(about 119GB free)

So I don't have any disks to create new pool:
Create Pool

no disks available!

Last edited: Aug 1, 2015

TechIsCool

Active Member

#17

Alright when it throws an APD can you still ping across the network adapter from host to VM and from VM to VM? Also you have not said if you have another network adapter in the VM that could be taking priority. Also do you have the NFS set via DNS or via IP Address?

S

socra

Member

#18

Hi
Well there is not much to ping

...the APD start as soon as I try to start a VM which is located on the NFS share presented throuh the imported pool on OmniOS.
I have 2 NICS in OmniOS, 1 for CIFS/Management and 1 for NFS.
The nic I use for NFS is internal (virtual 10G) I've always used the IP address to connect to the NFS server..
I don't know if I checked whether the NFS ip address for OmniOS is still reachable when the APD problems start.

Last edited: Aug 2, 2015

W

whitey

Moderator

#19

OK, sorry for the delay, let me gather up my notes and I will post back later today what resolves this for me.

Let's start w/ the KISS principle as Gea suggested. Start w/ a SINGLE e1000 vnic for NFS, ENSURE that it is not sending jumbo packets and also that the interface that it is hooked to your switch is configured for non-jumbo 1500 MTU. Then you need to look over your vmkernel (vmk1/2/3) whatever your dedicated vmkernel for NFS access is, check that they are configured for std framing sizes (1500) and that the vSwitch is also at 1500. Then you need to vmkping between hosts vmkernel interfaces, back to array interfaces, if this all works then it would seem to be OS specific config/mis-config, possibly w/ permissions. I sometimes take a very heavy handed approach to NFS permissions w/ Illumos and just blow them wide open as I stub/non-route my NFS/stg vlan's so no risk there.

zfs create poolname/nfs
zfs sharenfs=on poolname/nfs
share -F nfs -o rw /poolname/nfs
chmod -R 777 /poolname/nfs

Start there, I will gather my 'ohh hell no' notes up and see if I have any other tidbits of knowledge to share or get you ironed out.

S

socra

Member

#20

Thanks Whitey,
Have to head off for work tomorrow will probably be back on the grid on wednesday..
Was also wondering if maybe the pool version of OI and OmniOS might be biting me: (OmniOS defaults to 5000?)
ZFS pool version dilemma with napp-it upgrade, data lost? - [H]ard|Forum

vol00 and vol02 currently under OI:

Show hidden low quality content

You must log in or register to reply here.

Share:

Facebook Twitter Reddit Pinterest Tumblr WhatsApp Email Link

Top Bottom

This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Accept Learn more…