ZFS dedup vs datastore

ehfortin · Feb 4, 2016

Hi,

I've done lot of testing lately with ZFS and dedup and I got a result I was not expecting. First, to my knowledge, when you want to access ZFS through iSCSI, you have to create a volume in your pool (tank/myvol). When you define this volume, you have to specify the space needed. The way ZFS handle the size is by reserving the actual space from the pool without any consideration for dedup (which is normal as it doesn't know what kind of dedup ratio it could get). That also means that you can't put a number that is larger then the actual available space in the pool.

That said, I was expecting the volume to return the unused space to VMware which would report the free space accurately. That's not what is happening. Let say you have a 100GB volume as your datastore. If you create 3 clone of the same 30 GB VM in that datastore with dedup active, you were supposed to use 90 GB and that's what the datastore will report. However, when you look at your volume or pool from ZFS point of view, it will say that you are using about 30 GB so the pool benefit from the space but the volume seems to be as if no dedup was done from the datastore point of view. So, if you try to clone a 4th image, VMware will end with an error as it will say that there is no space left as only 10GB was remaining.

When sharing the same pool over NFS, the free space seems to be accurate. My guess is that VMware understand that NFS is a shared space so as it doesn't control it at 100%, it actually ask to the NFS share to report free space and VMware just display it as the NFS server is authoritative on this information. It is actually quite interesting as the "Capacity" is getting bigger and bigger then the pool was originally because, I assume, VMware is calculating the stuff it installed on it and add it to the free space remaining. That gives funny result when you clone the same VM 10 times on the same 50GB datastore

Is there any way to have "expandable" iSCSI/FC lun as a datastore that would permit to migrate more VMDK then what would normally be allowed without dedup?

As a side information, I did the same test with Starwind and a virtual disk is used the same way as a volume from ZFS would. There is a difference as we can create a virtual disk that is bigger then the physical pool. That cost a lot of memory as it will reserve the proper memory to handle the dedup of that volume upfront but it's kind of weird doing so because you have to guesstimate the capacity you will get from the physical space.

I guess the way Starwind is handling it is similar to thin-provisioning where you can create a volume bigger then the physical space. That can work but that's not easy to handle. Thinking about that, I just checked and ZFS allow the creation of a "sparse" volume that can be bigger then the pool. That would allow to do the same as Starwind. However, is that the only/best way of handling datastore when dedup is used?

Thank.

gea · Feb 5, 2016

First of all, I do not use ESXi + iSCSI.
NFS is as fast, offers a much easier handling and you must not care about a fixed volumesize. NFS works always thinprovisioned and its size can grow up to poolsize without any settings unless you do not restrict with quotas or reservations.

You should also take out dedupe of any considerations. ZFS dedupe is realtime dedupe. In the moment you write data to a pool, any data is checked against a dedup table to decide if you need a reference or must write data. This only affects real fillrate of a pool after writes.

Then it comes to thin provisioning. When you create a zvol or a filebased logical unit on Solarish, you can enable thin provisioning what means that you create a virtual disk of a certain size that do not block that size but only the data that is effectively used with the option to grow to that size. I am not certain about the ESXi behaviour but I have heard from other setups that it can happen that the whole size is reserved by the client even when created as thin provisioned. As a client like ESXi does not see the datapool behind a LUN it cannot give any informations about. This is different to NFS.

So it comes to the question, why do you want to use iSCSI over NFS on ESXi?
And, do you know about the RAM needs of dedup?

whitey · Feb 5, 2016

I had a NFS v.s. iSCSI rant in this thread if interested showing the litany of advantages.

Sanity check on all flash ZFS build. Solaris 11.2, SuperMicro chassis, Samsung 850 Pro 1TB SSDs...

gigatexal · Feb 5, 2016

gea said:
First of all, I do not use ESXi + iSCSI.
NFS is as fast, offers a much easier handling and you must not care about a fixed volumesize. NFS works always thinprovisioned and its size can grow up to poolsize without any settings unless you do not restrict with quotas or reservations.

You should also take out dedupe of any considerations. ZFS dedupe is realtime dedupe. In the moment you write data to a pool, any data is checked against a dedup table to decide if you need a reference or must write data. This only affects real fillrate of a pool after writes.

Then it comes to thin provisioning. When you create a zvol or a filebased logical unit on Solarish, you can enable thin provisioning what means that you create a virtual disk of a certain size that do not block that size but only the data that is effectively used with the option to grow to that size. I am not certain about the ESXi behaviour but I have heard from other setups that it can happen that the whole size is reserved by the client even when created as thin provisioned. As a client like ESXi does not see the datapool behind a LUN it cannot give any informations about. This is different to NFS.

So it comes to the question, why do you want to use iSCSI over NFS on ESXi?
And, do you know about the RAM needs of dedup?

Hmm. @gea are you saying not use dedup ever even if the underlying hardware is fast enough (speedy cpu) and has enough ram (in my case 128GB, datapool about 1TB)?

gea · Feb 5, 2016

Pro dedup
With nearly identical data example cloned VMs dedup can dramatically reduce disk space and increase sequential performance due reduced disk access.

Con dedup.
Depending on amount of duplicated data, the dedup table can grow up to several GB per TB of data. As the table must be processed on every read/write you must ensure that you can hold it in RAM or the system can become unusable slow. Processing the dedup table also increase latency. This means that you need a lot of RAM to reduce needed disk capacity while you mostly want the RAM to increase performance when used as readcache.

If you have a high dedup rate on a quite small and expensive high performance pool, then the dedup advantages can be more relevant than the disadvantages. On larger or cheaper pools the disadvantages are a problem.

With 128GB RAM and 1TB data, the RAM need is irrelevant so it should give you an advantage.

gigatexal · Feb 5, 2016

thanks for the feedback, my use case for dedup will be similar to the ops in that I will be dedup'ing windows VMs.

ehfortin · Feb 5, 2016

gea said:
So it comes to the question, why do you want to use iSCSI over NFS on ESXi?
And, do you know about the RAM needs of dedup?

I'm certainly with you that NFS is a lot easier to manage and has numerous other advantage. However, from my first batch of performance testing on the exact same setup, same Solaris 11.3 VM, I got better performance from iSCSI on my VMware cluster. If I can get the same performance, I'll most likely trash iSCSI.

I know about the RAM need of dedup. However, putting about 50 VM in about 400 GB of enterprise SSD as some value. For that usage, the RAM needed is relatively small and, if ZFS is as good as popular storage unit that do dedup, the read cache are staying "deduped" as well so you can put a lot in cache anyway. My server can hold 32 GB (lot less then what some have in hand) but... after keeping 8-10 GB for some "services VM", it will still give about 20-24 GB of RAM for OS + caching which is as much if not more then lot of production unit that are in the field that are running up to 500 drives and are connected to 10-20 hosts with a few hundreds of VM.

Now, I could use a "post process" dedup which would give me the same thing without incurring RAM usage but outside of Windows 2012, I don't know a good implementation of this and I don't really like the fact that I would have to force manual dedup while loading the datastore as I don't have enough SSD space to accommodate everything at once.

Will read your performance review again and will try to replicate it in iSCSI and NFS to see if I'm getting the same level of performance. Is there any performance enhancement that can be applied for NFS? If I understand correctly, NFS is forcing sync write. iSCSI is most likely not doing it but I seems to remember that ESX is enforcing sync write on everything. So it should not be a factor. Is that correct?

Thank you for your comments.

ehfortin · Feb 5, 2016

whitey said:
I had a NFS v.s. iSCSI rant in this thread if interested showing the litany of advantages.

Sanity check on all flash ZFS build. Solaris 11.2, SuperMicro chassis, Samsung 850 Pro 1TB SSDs...

I can contribute to your list of advantages

The kind of behavior I saw with my datastore is pratically enough to drive me back to NFS. I'll now have to figure why performance was not as good doing the same test on the same SSD disk from a VM that access a datastore over either NFS or iSCSI.

ehfortin · Feb 5, 2016

gigatexal said:
Hmm. @gea are you saying not use dedup ever even if the underlying hardware is fast enough (speedy cpu) and has enough ram (in my case 128GB, datapool about 1TB)?

Kind of jealous... Nice server. Don't want to run one like this around the family however. It is most likely noisy

gigatexal · Feb 5, 2016

nope it's in the tv room not noisy. i used a huge case and tower coolers

gea · Feb 6, 2016

ehfortin said:
Is there any performance enhancement that can be applied for NFS? If I understand correctly, NFS is forcing sync write. iSCSI is most likely not doing it but I seems to remember that ESX is enforcing sync write on everything. So it should not be a factor. Is that correct?
Thank you for your comments.

With default settings, ESXi forces sync write over NFS.
The corresponding setting with iSCSI is writeback cache. Set it in the LU settings to disable for the same behaviour (and performance effects). If you want to compare NFS with iSCSI, use the corresponding setting.

You can also tune some NFS settings (see my HowTo's or my napp-it tuning panel)
The defaults are the same for years and not optimal for faster networks or a lot of RAM.

Another item is the recordsize/blocksize.
ZFS use 128k per default on filesystems and 8k on volumes.
If you store foreign filesystems or databases on ZFS one or the other setting may be suboptimal.
You can set either when creating a filesystem or creating a volume.

more
http://www.vmware.com/files/pdf/techpaper/VMware-NFS-Best-Practices-WP-EN-New.pdf

ehfortin · Feb 6, 2016

gea said:
Another item is the recordsize/blocksize.
ZFS use 128k per default on filesystems and 8k on volumes.
If you store foreign filesystems or databases on ZFS one or the other setting may be suboptimal.
You can set either when creating a filesystem or creating a volume.

I read the document you suggested and in there, they are saying that a vendor is proposing to use 32K for sequential stuff and 4K-8K for random. I understand that ZFS is using 128K on filesystems and my assumption is that it "concatenate" various write to create a bigger write to disk in order to increase performance compared to doing each write independently. So, if that's the case and as I have multiple VMDK on the same NFS datastore, my hypothesis would be to stick on 8K for the guest filesystem which would align with the VMDK write size and ZFS would do it's stuff of concatenating the blocks to create it's 128K filesystem block for actual write. In my case, currently, all my NFS datastore use 4K (as per vmkfstools -P -v10 /vmfs/volumes/datastore_name). Apparently it would be best as well using small block when SSD is the underlying storage because if you create larger block, even if only a few KB are modified, you write a large chunk which is not optimal for SSD. Now, that may not change much if ZFS is doing 128K write anyway.

Does this make sense?

gea · Feb 6, 2016

The defaults are best for average use.
You can create a filesystem with a smaller recordsize ex 8k and compare ESXi/NFS performances.
It may give an improvement with many VMs

T_Minus · Feb 6, 2016

BTW: Oracle BLOG has MySQL InnoDB tuning advice too for matching record size and other misc settings, worth a read if doing DB stuff.

darklight · Feb 8, 2016

What I find quite awesome is that Starwind does in-line deduplication on the fly:

Data Deduplication and Compression – StarWind Virtual SAN™ - Starwindsoftware.com

And obviously consumes some RAM to do it. Both newer ZFS builds act the same way, however, the new Starwind is capable of doing hashes in flash memory which dramatically offloads RAM usage.

EDIT: Typo

gea · Feb 8, 2016

ZFS does this with the L2Arc but as RAM is far faster with a far lower latency than SSDs, this is only a solution if you want realtime dedup and can allow a much lower performance than with dedup=off.

Search

ZFS dedup vs datastore

ehfortin

Member

gea

Well-Known Member

whitey

Moderator

gigatexal

I'm here to learn

gea

Well-Known Member

gigatexal

I'm here to learn

ehfortin

Member

ehfortin

Member

ehfortin

Member

gigatexal

I'm here to learn

gea

Well-Known Member

ehfortin

Member

gea

Well-Known Member

T_Minus

Build. Break. Fix. Repeat

darklight

New Member

gea

Well-Known Member