ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

TechIsCool · Aug 3, 2015

I don't have issues with my older pools. But that's just my experience @gea will know more.

whitey · Aug 3, 2015

nostradamus99 said:
Thanks Whitey,
Have to head off for work tomorrow will probably be back on the grid on wednesday..
Was also wondering if maybe the pool version of OI and OmniOS might be biting me: (OmniOS defaults to 5000?)
ZFS pool version dilemma with napp-it upgrade, data lost? - [H]ard|Forum

vol00 and vol02 currently under OI:

Pretty sure Illumos just uses 'feature flags' for versioning. I have never had issues if I take an older pool version and import to newer supported ZFS OS w/ a higher pool version avail. I DID however have issues taking pool ver 31/33 (Solaris GA) into a ver 28 pool (on OI if memory serves me correct). Had to instead zfs snapshot send/recv between pool versions to get my data in/out of whatever system I was zpool up-revved in. You can choose to either upgrade to new supported feature sets or remain at pool capabilities so i honestly don't think this is your issue in this case.

For reference on a couple of my pools I have:

status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.

socra · Aug 3, 2015

Hmm..What I can't grasp is that I have another testvm (same gea appliance, same settings as my "to-be-production-vm", vmxnet3) when I attach a 5 GB VMDK to this VM and present it as an NFS share to my ESXi, I can create and run a VM without problems..
Of course I created a new pool on this VMDK so don't know why this works and my imported pool gives me APD errors when I try to start a VM.

TechIsCool · Aug 3, 2015

To me this is starting to feel more and more like a permissions issue. Either that or something is broken internally in the pool.

How much data is on the pool and do you have enough disks to shuffle the data.

socra · Aug 3, 2015

The APD occurs even when doing an ls command from the esxi shell in /vmfs/volumes, does that also point to permissions?

I have 2 pools in a 3way mirror 3drives of 1 TB (nfs) and 3 drives of 2 TB.(cifs) passed trough an M1015 so currently no drives availabe and unsure how to shuttle data safely between the two

.

cperalt1 · Aug 4, 2015

If it is a 3 way ZFS mirror then you can use the zpool split command.

Adventures in ZFS: Splitting a Zpool | IT From All Angles

nostradamus99 said:
The APD occurs even when doing an ls command from the esxi shell in /vmfs/volumes, does that also point to permissions?

I have 2 pools in a 3way mirror 3drives of 1 TB (nfs) and 3 drives of 2 TB.(cifs) passed trough an M1015 so currently no drives availabe and unsure how to shuttle data safely between the two .

TechIsCool · Aug 4, 2015

cperalt1 said:
If it is a 3 way ZFS mirror then you can use the zpool split command.

Adventures in ZFS: Splitting a Zpool | IT From All Angles

Not sure if I would do that because it sounds like there is no space for backups either.

@nostradamus99 What is your total usage for ESXi datastore.

Also can you zfs send all your 1TB data to your 2TB drives?

cperalt1 · Aug 4, 2015

One scenario would be to take the 3 way mirror, split off one drive. At this point you can either format a single disk pool on the one drive and do a zfs send from the now 2 drive pool to the one drive pool. After this is done "disconnect" the two drive pool and see if the same issues appear and that way you can still keep your data safe while not using any new drives. If it solves the issues then you can add a new 1TB drive to the single drive pool making it a mirror or you can gamble and do a detach drive from the two drive mirror and attach to single drive pool thus triggering a resilver and wait for that to complete before you do the same to the last drive and attach to now had a 3 drive mirror again.

TechIsCool said:
Not sure if I would do that because it sounds like there is no space for backups either.

@nostradamus99 What is your total usage for ESXi datastore.

Also can you zfs send all your 1TB data to your 2TB drives?

TechIsCool · Aug 4, 2015

@cperalt1 The only problem I see with this is that when you break a 3 way mirror into a 2 way don't you lose your protection disk so if you have a failure you're out of luck? 3TB drivers can be had for 60$ and I almost would buy one just for Backup purposes. ZFS send the data to data disk then mess with what you need to do.

cperalt1 · Aug 4, 2015

I agree picking up another drive to the the zfs send and receive would be the best. The OP stated that it was a 3 Way Mirror. 3 Copies of same data. I think you might be thinking to Raid Z in which case you will never use the zpool split as now you are in a degraded state. A common use of 3 Way mirror would be as an example a monthly backup to take off site. Attach drive to make 3 way mirror, resilver, zpool split and then take 3 drive to off site storage and now you have a backup stored that you can then zpool import on same or other system.

socra · Aug 5, 2015

I was reading my notes over and I don't think it's a permissions issue. When I imported the pool I used the reset ACL from Napp-IT...I was able to create and remove folders from the VMWare datastore. (using vsphere-client)

I'll try the following this evening:
Add a 5 GB VMDK to my OI installation. Create a pool on it and then export this pool.
Then I'll connect this VMDK to my OmniOS VM and Import it then try to create vm and start it.
If I get the same problem then I know it's not the combination with my LSI adapter.

@Techisool @cperalt1
What would be your idea with moving my data if I buy a 3TB drive..?
remember al my disks are connected to my LSI M1015 which I can only connect to 1 VM at a time using VT-D passthrough.

cperalt1 · Aug 5, 2015

If you have a free port on the M1015 then you can just attach new drive, create new pool on drive, snapshot your mirror, zfs send/receive to new pool and now you have a backup. If you don't have a free port then you can zpool split (It is a 3 way mirror and not a raid-z?). physically replace one of the drives, create pool on new drive and zfs send/receive.

TechIsCool · Aug 5, 2015

Yup what he said. Do you have a extra HDD any size laying around that is sata? 2.5 or 3.5 would work. If you do I would just start with a simple pool with no backup see if you get the APD when you mount that pool from ESXi. If you don't you most likely need to play the shuffle game like we are talking about.

socra · Aug 5, 2015

I really have two 3 way mirrors... (maybe overkill but I haven't regretted it yet)

Do have a 320 GB sata drive laying around that I can use. Dodged a large bullet just yet...I tried to attach it live to my system because the LSI is hotplugable I used a molex to sata converter because I'm out of sata powerplugs...after attaching it..the power connector came loose, and my system rebooted

Now my server won't turn on if I have something connected to one of the molex connectors on the that molex rail (only connectors I have left)

I also had a 2.5 inch 160GB drive laying around..tried to connect it but server wouldn't start..disconnected the drive and server started immediately... OMG I freaked out..
So dunno if I can test anymore..looks like the molex rail on my PSU is fried... %$*#&$# damn..wish I never started this exercise...went from OmniOS testing to now having to find another PSU possibly..so mad at myself..

cperalt1 · Aug 5, 2015

In my opinion a 3-way mirror is not overkill for important data plus it should give better random reads. Also in a case like this it can let you do a shuffle with more security than a regular mirror.

whitey · Aug 5, 2015

TechIsCool said:
To me this is starting to feel more and more like a permissions issue. Either that or something is broken internally in the pool.

How much data is on the pool and do you have enough disks to shuffle the data.

Hence my suggestion to blow permissions wide open, I think Gea touched on this as well but OP still had issues. Very strange indeed.

TechIsCool · Aug 5, 2015

@whitey He did say he reset them but this is a really weird issue. I have never experienced APD myself but just extreme lag in my other thread.

Theoretically you could have had a bad converter cable not a bad rail. I would try a oldschool HDD if you have one laying around. If not a older CD drive just plugged in for testing. Shut the server down first then boot it back up so you don't crash it.

whitey · Aug 5, 2015

TechIsCool said:
@whitey He did say he reset them but this is a really weird issue. I have never experienced APD myself but just extreme lag in my other thread.

Theoretically you could have had a bad converter cable not a bad rail. I would try a oldschool HDD if you have one laying around. If not a older CD drive just plugged in for testing. Shut the server down first then boot it back up so you don't crash it.

LOL, I have a hard time keeping track of everyone on here, were you the poor soul who had horrific throughput/response until reseating your M1015? I suggested that as well on a WAY off chance. Yeah APD I can lookup the VMware KB to remember how to intelligently TS it.

EDIT: NM, that M1015 re-seat was JimPhreak

EDIT: Some light APD reading :-D

VMware KB: Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x

vSphere 5.0 Storage Features Part 8 - Handling the All Paths Down (APD) condition | VMware vSphere Blog - VMware Blogs

NetApp Knowledgebase - How to troubleshoot NFS APD (All-Paths-Down) issues on VMware ESXi

VMware KB: Troubleshooting connectivity issues to an NFS datastore on ESX and ESXi hosts

socra · Aug 5, 2015

I did try to connect an older CD Drive with molex straight to the PSU without the connector, system would not power on as soon as I connected something to the molex connector after my debacle yesterday. (so could be some protection from the SeaSonic PSU itself that is getting triggered,)
I'll try again tonight,I'll have more time then, if it doesn't work I'll RMA my PSU which has a 5 year warranty.
I also have another PSU laying around I can use to temporarily connect everything.

@whitey, thanks for the links..have read most of them..will read the others aswell..
Will post updates on progress

TechIsCool · Aug 6, 2015

@whitey Nope but I was the one having the crazy issues with Latency due to 840 Pros and Trash Collection.

@nostradamus99 Most power supplies have PTC that are basically a resettable fuse that either is having an issue or the RAIL is broken. I would ship the PSU back so it does not break anything else. Without know anything more your motherboard might be having issues. Could be part of the issue really. Hardware does funny things to software sometimes.

ESXi 6.0 NFS with OmniOS Unstable - consistent APD on VM power off

Active Member

Moderator

Member

Active Member

Member

Active Member

Active Member

Active Member

Active Member

Active Member

Member

Active Member

Active Member

Member

Active Member

Moderator

Active Member

Moderator

Member

Active Member