ZFS Planning for virtualization and datastores for vmware esxi

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

levilevi

New Member
Dec 27, 2019
4
0
1
Hi Guys!

I'm in the middle of planning zfs based storage for my esxi cluster.

The storage hardware is this:

Supermicro X10dru-i+
2x Intel Xeon e5-2430L v3
32GB memory 2133Mhz running on 1866Mhz
4x10G NIC
2x Samsung PM1725a 800GB U.2 (plan to buy 2 more)
1x Samsung PM1735 6.4TB pcie 4.0
7x6TB SAS Seagate HDD's -> these are currently in a truenas box in raidz2
1x4TB for NVR

The software i would like to use is Osnexus quantastor 6.

My original plan is to give NFS to my hosts via 1x40G/host connection to the server, but i think i'll able to use NVMeOF as well.
Now for the test purposes i can have a 2x10G connection to the hosts.

I would like to use the ssd parts for virtualization with a mix usage of web servers, database servers, simple VMs and files as well.
The hdd part will be used for file storing and some backup of the VMs and yes the little 4TB one is just for only NVR purposes.

What are your recommendations for this usecase? I mean configurations, zpool settings etc etc for speed and reliablility maybe some hardware more ram etc...

Thank you!
 

ano

Well-Known Member
Nov 7, 2022
654
272
63
iscsi comes to mind

other than that, QS is superfast/easy to work with
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,095
658
113
Stavanger, Norway
olavgg.com
I am currently building an data architecture with ZFS and NFS over RDMA. I have a petabyte sized pool that is shared with 4 nodes running Clickhouse. The query speed is unbelivable, and maxing 40GbE was suprisingly easy. My next step is trying 100GbE with Mellanox ConnectX 5.

The advantage of using NFS over RDMA instead of NVMe-oF, is that you can use the property special_small_blocks for special metadata devices to write smaller blocks on ssd drives instead of a slow spinning hard drive. ZVol's doesn't support it yet, there is a pull request being worked on. NVMe-oF will of course have better performance with RDMA. Zvols also has slightly better performance.

RDMA is the key here, anything else, iscsi or nfs is rarely any faster than 500MB/s with clickhouse, once I enable RDMA, I easily read 3-4GB/s.
 

levilevi

New Member
Dec 27, 2019
4
0
1
Hi!

Thanks for the answers, so what do you recommend for the ZFS part? the caching, memory, layout etc...
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
ZFS read/write cache is memory where last/most read ZFS datablocks are cached (not files)
Regarding memory, use 32GB+ (on Linux a little more than on Solaris based systems)
Fastest layout is a 2/3way mirror (or Raid-10)
Enable sync without dedicated Slog when using fast NVMe
avoid encryption with sync enabled
Use NVMe with powerloss protection and high steady 4k write iops/ low latency
Use a lower recsize like 16-64K for VM storage

While iSCSI is similar in performance to NFS with same sync/writeback setting,
I would always prefer NFS due simplicity.

When you ZFS snap online VMs, there is a danger of a corrupt VM in the snap.
To avoid this either snap VMs in off state or create ESXi coalesce or hot memory snaps prior a ZFS snap
You can destroy them after the ZFS snap. After a rollback/restore go back to this safe ESXi snap.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Indeed, a snap is like a sudden powerloss.

If this happens during sequential atomic writes (write data + update metadata) within a VM, ZFS can do nothing to protect the VM filesystem against data corruption. Copy on write can only protect ZFS itself on a powerloss during write not guest filesystems or files.

Risk is not huge but there with a statistical rate.
If a guest looses power, the danger is lower as this happens under control of the guest while a guest has no control of snap contents.
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,095
658
113
Stavanger, Norway
olavgg.com
ZFS will not commit any partial write. Any application that doesn't use sync write to flush, will not be in a corrupted state. That means when the VM reboots it will not find corrupted data. Maybe missing data. But that doesn't matter when the write has happened async as the application never kept any track of the write process and could just restart this operation from scratch. For example when you move a file from one disk to another, you will have a partial write of the file. But the file is still available and will not be deleted before the move is complete. Metadata is written to a journal, and if it contains an operation that is partially written it will not be applied, and the filesystem state will still be ok.