Extreme latencies after upgrade to Nexenta CE 3.1.5

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Andyreas

Member
Jul 26, 2013
50
4
8
Sweden
Hi all

I wrote the following on Nexentas forum but no luck there so I started thinking, what awesome forum could I post this on which has knowledgable people writing there...

-----
After doing a 30h workrun trying to sort the problem I've been run in to a dead end and need some help getting forward.

Yesterday I turned of my 25 vms on my HP DL380 G7 server. One of the vms is Nexenta with a passthru LSIcard running to an external supermicro storagechassi. In this I have a Raid Z2 Array with 2 Intel entreprise SSD as zil and an Intel 540 as readcache and a bunch of 2TB Sata drives. This has been running for over a year with excellent performance but after my upgrade yesterday from 3.1.3.5 to 3.1.5 I am getting extremelly sluggish performance. As soon as the workload starts to hit the NFS mounted share from Nexenta I see latencies of 1sec-10secs instead of normally a few milliseconds.

I've tried to revert back to 3.1.3.5, I've tried to reinstall Nexenta from the ground up. I've tried to upgrade ESXi to 5.5 and then down to 5.1 again with no apparent difference. I don't have a clue where to go from here. What logs should I look for trying to find the problem that started all this. I can see the latencies spike up in Vserver charts but that is about it.

I would be so extremely thankful if someone would help find a way forward in this mess.

UPDATE:
Some more info. After mapping up the share via both cifs and ftp I realised the writes are very fast saturating the GB network without any problem but reads run at 200 KB/s and latencies get in to the 10sec. I tried to disconnect the readcache Intel 540 240GB but that made no difference.

//Andreas
 

nle

Member
Oct 24, 2012
204
11
18
Could one (extreme?) solution be to abandon Nexenta and go for ie. omnios with napp-it? It could be faster to just set up a new install and import the pool.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Since reads are the problem, I like your idea of killing the read cache. Hopefully you "unconfigured" the cache and didn't just pull the plug. I would probably have killed both the cache and the ZIL in an attempt to get to a minimal configuration, and rebooted to flush any RAM caches.

But since that didn't work, and given that reverting your software change did not fix the issue, I would start to hunt for hardware problems. Swap out the cables first, and the card if you have a spare. If that doesn't work, think backplane and then drive problems. Yuck.
 

ant

New Member
Jul 16, 2013
21
5
3
I've seen similar things when messing with jumbo frames. Did the upgrade enable jumbo frames and then when you downgraded it was still enabled? Or did you enable it yourself around the same time on Nexenta or on client VMs (most likely Nexenta since it seems most/all VMs have the issue)? Unless everything in the network (ESXI, all VMs, switches, any other things on the network) has jumbo frames enabled it can cause headaches like this.

I've no experience with Nexenta so I don't know if it is an appliance with just a gui or if it has a user accessible command line. Does it let you do local file operations just in Nexenta so you can rule out a network issue? Even just a file copy should let you rule out the network as an issue.

If it is a jumbo frame issue - either disable it everywhere - or enable it everywhere.

If it is not a network issue, could it perhaps be a hardware failure that coincidentally happened about the time you upgraded? I've never had a drive fail with zfs but I expect if one failed - or didn't completely fail but had interface issues or disk errors - it might cause slow reads. Though I would expect slow writes too and you don't have that.

I assume Nexenta lets you run standard zfs and zpool commands on a command line or gui. There should definitely be alarms or logs you can access.

some things to try:

zpool status (this might indicate disk errors)
zpool iostat 2 (do this while doing data transfers - won't tell you much except pool throughput)

I won't suggest a zpool scrub at this stage as you have read issues and that might slow everything down for ages.

If Nexenta lets you do other unix type commands this might help:

iostat -x 2 (do this while doing data transfers - lets you see throughput on each drive - handy to see if one is behaving differently to the others)
iostat -xne 2 (as above)

good luck and I hope its an easy fix.

Ant
 
Last edited:

Andyreas

Member
Jul 26, 2013
50
4
8
Sweden
Thanks all for the feedback and help!

@nle, actually that migt be a good solution. I was just worried OmniOS wouldnt see the volume and be able to import it. I did reinstall nexenta on a new vm-harddrive from the ground up (damn that isnt the fastest software to install) and imported the volume but that didnt help.

@dba, I thought I pretty much just could pull the readcache-harddrive but never did. I unmounted it the correct way but I never restarted the server after that, will absolutely try that and then do the same with the arc if that doesnt help.

@ant, thanks alot, nexenta has both the command line interface (NMC) and a gui. I am mostly familiar with the gui but will try those commands. I never touched the jumboframes still at 1500 but I doublechecked it since I had an idea that could be the problem. No success there thou.

Thanks again for the feedback! You are about 300% (I know not correct math) better than the nexenta forum.
 

gea

Well-Known Member
Dec 31, 2010
3,167
1,196
113
DE
I have not used NexentaStor for years so I cannot comment problems of newest NexentaStor.

But you should not modify all parameters on problems like NexentaStor release AND ESXi release together. You are not able then to decide if there is a Nexenta or ESXi problem. If you only update Nexenta with problems afterwards I would expect the new Nexenta as the problem. If you go back to the former release and the problem persist, it is a hardware problem that happens accidentally at the same time. In such a case I would go to a minimal config without ZIL/ARC and recheck.

Other option ist to create a test pool and check on Nexenta and compare to another Solarish OS like OmniOS/OI/Solaris. You can import the pools on all of them. (Only difference on Nexenta is a different default ZFS mountpoint) You may compare with my preconfigured napp-it appliance. It is a copy and run appliance (no installation neeed, just download and import to ESXi)
 

Andyreas

Member
Jul 26, 2013
50
4
8
Sweden
Hey again.

@gea thanks alot for the info and I guess I learnt that lesson, only upgrade on thing / time.

I installed the VM you setup Gea as suggested above. No real trouble getting it up and running and I imported the volume but then I ran in to trouble. The current shares/folders where all my VMs were couldnt be found, there was just alot of error messages saying No such directory is found. I tried to create a new one instead but then there was a chmod error so I am now reverting back to the nexenta install hoping I havn't screwed anything up.

Thanks again for the help!

My only problem is since the read speed is so slow on the nexenta I just want to be able to get hold of my vm harddrives then I can dump that setup altogether but I can't figure out a fast enough way to export them.
 

gea

Well-Known Member
Dec 31, 2010
3,167
1,196
113
DE
Hey again.

@gea thanks alot for the info and I guess I learnt that lesson, only upgrade on thing / time.

I installed the VM you setup Gea as suggested above. No real trouble getting it up and running and I imported the volume but then I ran in to trouble. The current shares/folders where all my VMs were couldnt be found, there was just alot of error messages saying No such directory is found. I tried to create a new one instead but then there was a chmod error so I am now reverting back to the nexenta install hoping I havn't screwed anything up.

Thanks again for the help!

My only problem is since the read speed is so slow on the nexenta I just want to be able to get hold of my vm harddrives then I can dump that setup altogether but I can't figure out a fast enough way to export them.
NexentaStor mounts a pool under /volume whereas all other systems mount them under / as default.
If you like to move pools from/to Nexentastor, you must modify the zfs mountpoint property.

There are no other probllems moving pools beside that-
(unless you do not use newer Illumos features not available in NexentaStor 3)

ps
Have you imported the Nexenta pool with napp-it pool-import.
This should adjust the mountpoint otherwise you need a zfs set mountpoint after import.
 
Last edited:

Andyreas

Member
Jul 26, 2013
50
4
8
Sweden
Hey gea

Actually I imported with the napp-it import tool but still got that problem. I've fiddled around with the ACL alot in Nexenta to get stuff working thou so perhaps that is part of the problem. Hopefully getting the C6100 I'm ordering tomorrow in the next 10 Days or so. Gonna redo this whole setup on OmniOS then. Wish I knew what I know know Before I learnt it ;-) But on a happier note. Did a complete reinstall of ESXi down to 5.1 again and that with some extra fiddling solved the problem.
 

gea

Well-Known Member
Dec 31, 2010
3,167
1,196
113
DE
Hey gea

Actually I imported with the napp-it import tool but still got that problem. I've fiddled around with the ACL alot in Nexenta to get stuff working thou so perhaps that is part of the problem. Hopefully getting the C6100 I'm ordering tomorrow in the next 10 Days or so. Gonna redo this whole setup on OmniOS then. Wish I knew what I know know Before I learnt it ;-) But on a happier note. Did a complete reinstall of ESXi down to 5.1 again and that with some extra fiddling solved the problem.

I had stability problems when using e1000 on a default OmniOS VM under ESXi 5.5 as well.
For my appliance, it works either when using the VMXnet3 vnic or when adding the following lines to
/kernel/drv/e1000g.conf (reboot required)

#Disable TCP segmentation offload (napp-it all-in-one)
tx_hcksum_enable=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;
lso_enable=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;

Maybe similar to this problem:
VMware KB: Possible data corruption after a Windows 2012 virtual machine network transfer