Shared storage ideas...vSphere 6.5

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

SlickNetAaron

Member
Apr 30, 2016
50
13
8
43
I think you found the log that tells you the problem, and you confirmed it with the iSCSI decoder ring, but yer not looking in the right place to get the details... I don’t think this is a network issue at all, IMHO. I Could be dead wrong!

VMware Knowledge Base

vSphere Documentation Center

So my thought is that there is some soft limit to the LUN utilization configured.. somewhere.. I’m not familiar with where or how that is set, but yer definitely hitting it. If your storage says “OUT OF SPACE” the datastore/LUN goes offline and VMware has to pause all machines on the datastore to prevent data loss.

Also, the size of the ZFS file system/ZDEV does not necessarily equal the size of the data store. You can easily increase the size on the ZFS side, but you can’t use the space until you expand the data store in VMware.

What does the data store capacity and utilization look like on the host?
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
I think you found the log that tells you the problem, and you confirmed it with the iSCSI decoder ring, but yer not looking in the right place to get the details... I don’t think this is a network issue at all, IMHO. I Could be dead wrong!

VMware Knowledge Base

vSphere Documentation Center

So my thought is that there is some soft limit to the LUN utilization configured.. somewhere.. I’m not familiar with where or how that is set, but yer definitely hitting it. If your storage says “OUT OF SPACE” the datastore/LUN goes offline and VMware has to pause all machines on the datastore to prevent data loss.

Also, the size of the ZFS file system/ZDEV does not necessarily equal the size of the data store. You can easily increase the size on the ZFS side, but you can’t use the space until you expand the data store in VMware.

What does the data store capacity and utilization look like on the host?
Thanks for the links.

First, an update! On Wednesday, I removed Jumbo Frames from ALL of the storage network. On the vds, vmkernel and the FreeNAS storage NICs I set the MTU to 1500 and so far ALL of these errors have gone away:

Code:
WARNING: 192.168.61.3 (iqn.1998-01.com.vmware:esxi2-4d9a7f4c): no ping reply (NOP-Out) after 5 seconds; dropping connection
It's still early days but my storage hasn't disconnected since making this change. That doesn't mean it won't but at least all of the above errors (of which there were many) are no longer showing in the logs.

So moving on, I still have MANY warnings about disk space utilisation. This is odd as the one volume I have is 4TB in size in FreeNAS and shows up as a 4TB datastore on vSphere but I am only using about 700GB (or 17%) and yet even this LUN is complaining about "Space utilization on thin-provisioned device".

There are thresholds you can set in FreeNAS globally and on each extent but changing these hasn't helped, I've set the values really high, really low, 50% and zero and I still keep getting the logs spammed about "Space utilization on thin-provisioned device naa.xx) exceeded configured threshold.

Is there anything else I can check re these disk space utilisation warnings?
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I use connectx3-en cards in point to point mode between a vsphere 6.5 host and a CentOS7 host serving NFS. Rock solid...
Concur, NEVER have issues w/ NFS backed (FreeNAS in my case) stg served up to vSphere and performance is great to boot!
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Thanks for the links.

First, an update! On Wednesday, I removed Jumbo Frames from ALL of the storage network. On the vds, vmkernel and the FreeNAS storage NICs I set the MTU to 1500 and so far ALL of these errors have gone away:

Code:
WARNING: 192.168.61.3 (iqn.1998-01.com.vmware:esxi2-4d9a7f4c): no ping reply (NOP-Out) after 5 seconds; dropping connection
It's still early days but my storage hasn't disconnected since making this change. That doesn't mean it won't but at least all of the above errors (of which there were many) are no longer showing in the logs.

So moving on, I still have MANY warnings about disk space utilisation. This is odd as the one volume I have is 4TB in size in FreeNAS and shows up as a 4TB datastore on vSphere but I am only using about 700GB (or 17%) and yet even this LUN is complaining about "Space utilization on thin-provisioned device".

There are thresholds you can set in FreeNAS globally and on each extent but changing these hasn't helped, I've set the values really high, really low, 50% and zero and I still keep getting the logs spammed about "Space utilization on thin-provisioned device naa.xx) exceeded configured threshold.

Is there anything else I can check re these disk space utilisation warnings?
Tee hehe, now you see why a LOT of us despise iSCSI, loads of precious time pissed away all for not a whole lot of bang for buck 'juice for squeeze' gained :-D
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
Tee hehe, now you see why a LOT of us despise iSCSI, loads of precious time pissed away all for not a whole lot of bang for buck 'juice for squeeze' gained :-D
Glad everyone's enjoying my pain ;-)

I'm not ready to throw in the towel just yet and move over to NFS haha

This entire troubleshooting exercise has taught me loads and I'm just wondering if all the issues I had with Starwinds Virtual SAN was due to the jumbo frames too.

Will be interesting to see how the next few days/weeks go. Been keeping my eye on the logs and (touch wood and all that) it's looking better.
 
  • Like
Reactions: whitey

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Imma' bout to drop some knowledge on ya...think I know what the issue very likely is...cliffhanger/teaser...beg for it lol j/k, gimme a few to get to my computer so I dont beat my nubs to death on my phone
 
  • Like
Reactions: Monoman

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Ok bud, here's what I bet 'dollars to donuts' the issue you have encountered is...dunno why in the HELL I didn't think of this before...

You're using a direct connect nic to nic w/ DAC setup...guess what at the high rates of data/perf you are pushing w/ jumbo in the mix I BET those nic's (or any nic's for that matter) have a damn tough time keeping up w/ that data rate/pps...sh|t packet buffers. A switch provides VERY specialized ASICS and most if not all have either dedicated buffers assigned per port or dynamic queue's depending on switch vendor but they all hit that ASIC which have sufficient memory/packet buffers to deal w/ the rate of throughput/packets per second that iSCSI demands. That's GOTTA be it. Some switches even suffer this issue when they are not up to the task. AKA, all switches are NOT created equal. There has been raging debate on what classifies a 'good switch for IP SAN duties' and some can hack it...some cannot. In light IP SAN duties let's say a switch w/ 6MB of shared buffers (3MB ingress/3MB egress) that lil' guy WILL struggle when you pound it with 'steady-state' IP SAN duties, a switch w/ say 12-36MB (or greater) of memory dedicated to packet buffering will do MUCH better. Mystery solved :-D

Here's some light reading to 'enlighten' you.

packet buffers
Router/Switch Buffer Size Issues
Help with switch Buffers please - Hewlett Packard Enterprise Community (the callout that a HP procurve 3500YL is a GOOD iSCSI switch/highly recommended is an eye opener)
https://people.ucsc.edu/~warner/Bufs/HP2920.pdf (search for 'packet buffer' keyword in this doc, of particular interest is this lil' callout 'Is designed with the latest ProVision ASIC, providing very low latency, increased packet buffering, and adaptive power consumption')
https://people.ucsc.edu/~warner/Bufs/ex4600-buffer.pdf (One last good one for Juniper anyways discussing how they deal w/ the issue - 'buffer pools')
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
Well thanks for typing up such a great reply!

This is really interesting because I had an issue with iSCSI performance when I tried to use my Cisco SG300-28 switch for storage traffic. The performance was terrible (there's a post on this forum about it). That's why I direct connected all my storage back in the day when I was using 1Gb still (to bypass the switch for all storage traffic).This solved all the performance issues. Later on I bought 10Gb NICs for storage only but I always thought that the issue with buffer sizes applies to switches only for some reason and never considered that it could affect the NICs too. Having said that, most of the time the lab is idle and I only push it to it's limit when bench marking and/or when doing disk IO tasks like Citrix MCS or installing Windows Updates on all VMs at the same time.

So the $64k question is, will direct connect with the NICs I currently have be up to the task with jumbo frames OFF??

Appreciate the links and will check them out.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
8Mb (1MB) if I interpret that correct...GARBAGE for IP SAN duties unfortunately.

Cisco 300 Series Managed Switches Data Sheet

Only time will tell, did those disconnects seems to happen more prevelantly when you were pounding IP SAN/iSCSI usage or just randomly as well? If it was random it just may take awhile for whatever slim nic queue to fill up and kill over, regardless I think w/out a switch that is up to the task you will always run this risk, probably less at std frames (1500 bytes instead of jumbo 9K) but it's a guessing game.
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
8Mb (1MB) if I interpret that correct...GARBAGE for IP SAN duties unfortunately.

Cisco 300 Series Managed Switches Data Sheet
Yip, learned that they hard way!

When I still had jumbo frame enabled the disconnects funnily enough didn't happen at all while running some bench marking tests. It didn't happen when I manually ran a Veeam full backup job. I wasn't looking in the logs at that stage though but after the crash I noticed that there hundred of these:

Code:
WARNING: 192.168.61.3 (iqn.1998-01.com.vmware:esxi2-4d9a7f4c): no ping reply (NOP-Out) after 5 seconds; dropping connection
I just checked the log again while typing up this reply and there are no more of these warnings being logged.

The disconnect happened shortly after the Veeam backups started early in the morning which I thought was odd because when I ran a test Veeam backup manually it did a full backup which transfers about 300-400GB of data to the backup server but when the job ran scheduled it ran an incremental which only transferred about 40GB of (or less) to the backup server which is tiny.

Time will tell I guess what happens next. I have been eye balling 10Gb switches but none seem practical for home use in my lounge.

I just hope setting the MTU to 1500 for the storage and the fact that there are no more "dropping connection" warnings in the logs means things might run ok....
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
You sir need a Juniper EX3300 I do believe...pretty quiet environmentally and a pretty robust lil switch w/ 4 ports of 10GbE built in so you KNOW they accounted for appropriate packet buffer sizes. Can be had for $300 if you hunt...$400 immediately avail on ebay.

2cents, GL w/ 1500 frame setup on direct-connect...maybe that will hack it for your use case and the time being.

EDIT: THIS is the guy I got my initial EX330 from, for $550-600 I believe a couple of yrs back, of course I am on a EX4300 now w/ 40GbE ports and have no issues there as well.

Juniper EX3300 Juniper EX3300-24T Juniper Networks EX3300-24T | eBay
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
What I should have asked in my previous post is:

If I have the disconnect happen again, what NICs should I consider for storage traffic if continuing to use direct connect?

PS: Thanks for Juniper mention!
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
No idea, I have not looked into specific (Mellanox, Intel, Chelsio, etc.) nic spec's WRT packet buffer sizes. Research it up and educate me :-D
 

gmitch64

New Member
Jan 21, 2017
10
1
3
59
Tee hehe, now you see why a LOT of us despise iSCSI, loads of precious time pissed away all for not a whole lot of bang for buck 'juice for squeeze' gained :-D

That's why I use Fiber Channel at home as well as at work.. :)

:)
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Thanks for the comedic value/laughter @gmitch64

Zoning/Masking views/VSANs, N/F/E ports, WWNN/WWPN, etc...no thanks but if that's your cup of tea (or what you are experienced in) nothing wrong w/ that 'different strokes for different folks' :-D
 
Last edited:
  • Like
Reactions: audio catalyst

BSDguy

Member
Sep 22, 2014
168
7
18
53
So touch wood and all that I haven't had any disconnects since my last post so I'm assuming all my pain was due to jumbo frames....
 

BSDguy

Member
Sep 22, 2014
168
7
18
53
Now that my storage seems to be stable for now I would like to have a rethink about how I configure the ZFS volumes on my FreeNAS server.

Next week some new SSDs will arrive so I'll have the following:

  • 4 x Samsung SM863 480GB SSDs
  • 4 x Samsung SM863 960GB SSDs
  • 2 x Samsung Pro 840/850 512GB SSDs
I was thinking of using a striped mirror for the 960GB drives, RAID Z1 for the 480GB drives and a mirror for the 512GB drives. The 960GB drive volume will have most of my VMs running on them, the 480GB volume will be used for templates, images, content library and the 512GB volume will be spare.

Is this a good approach for ZFS and VMs?
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I don't think the Pro's will make a good datapool - despite being Pro I am not sure they will sustain prolonged writing activities.

Given that the other drives are the same family you could also just combine them into a big mirror - worst case you'd loose performance advantage of the 960 vs 480 drives but you'd get an 8 drive pool instead so read might be greatly improved ...