How to maximize Optane for ZFS in ESXi

ipreferpie · Jun 16, 2020

I’m planning a new build and consolidating the hardware I have for a new ZFS server. I currently have 1x M.2 905P Optane 380GB and 1x U.2 905P Optane 960GB that I bought separately when I found good deals. However, due to their mismatched sizing and I’m new to ZFS, I’m wondering if anyone with experience can offer help on how to maximize the benefits of my Optane SSDs.

My specs are the following:
CPU: dual Xeon 8176 ES
Memory: 384GB for ESXi 6.7u3 host and 256-280GB allocated for Solarish ZFS VM
Storage: 6-10 mirrored pairs of SATA HDDs allocated for ZFS pool
Network: 10GBE to all my servers

ZFS Use case: Mainly to offer a faster storage solution compared to UnRAID. I’ll be keeping UnRaid for media and backups, but ZFS will be the solution for iSCSI targets for ESXi. In it, I’ll be storing around 3-6 VMs (Windows, Linux), storing databases such as Nextcloud, Plex & Lightroom, and Veeam LTO8 tape cache.

I’m interested in using the Optane drives mostly for SLOG and special vdevs, and perhaps L2ARC if needed. Is it possible to organize the Optane drives by creating VMware datastores and offering them as separate partitions like this:
SLOG: mirrored 40GB + 40GB from each Optane drive
special vdev (metadata + small blocks + dedupe): mirrored 340GB + 340GB from each Optane drive
L2ARC: 580GB from U.2 drive if needed

I know that using the same disk for different tasks will impose a penalty. And ideally, I should have dedicated drives for each vdev type, but I’m wondering if this can be viable. Many thanks in advance!

gea · Jun 16, 2020

ZFS allows any mentioned option and with Optane, even vdisks as L2Arc, Slog or special vdevs give a superiour performance.

Propably I would go a different way and set "keep it simple" as the major aspect.

- create a pool from an Optane mirror (380 GB) and put all VMs onto (no data) with sync enabled. Size should be enough for VMs. When you have the chance of a second 960 Optane, replace the 380GB for full capacity. I would use NFS to access the pool from ESXi (much simpler, similar fast and you can additionally use ZFS snaps as Windows previous version via SMB)

- use the disk pool for data with sync disabled. Pool layout according to needs (Multi raid-10 or Z2). Access to datapool via SMB, NFS or S3. iSCSI only when a "local" disk is needed (eventually Lightroom)

- with such a large memory, L2Arc is senseless. Give the storage VM up from 64 GB RAM. 256 GB for a storage VM can make sense but only for very many users and small files for example a mailserver. RAM on ZFS is used as Arc. This is a readcache with a read last/ read most strategy that can cache only random io not large sequential files. You can check arcstat but without many users, I doubt you will use that much.

- what is the use case of NextCloud?
If you mainly want Internet Access for a few users, Amazon S3 sharing of a ZFS filesystem to the Internet via minIO is much simpler and faster and more secure (only a single file vs the hundreds of a NextCloud). S3 is also a perfect target for VEEAM and other backup tools. For a client sync you can use one of the S3 sync tools.

sth · Jun 16, 2020

hi gea, your post made me curious, does ZFS cache Minio served blocks as well?

gea · Jun 16, 2020

From ZFS view, minIO is just a service reading/writing data.
Not different to other services.

ipreferpie · Jun 16, 2020

gea said:
ZFS allows any mentioned option and with Optane, even vdisks as L2Arc, Slog or special vdevs give a superiour performance.

Propably I would go a different way and set "keep it simple" as the major aspect.

- create a pool from an Optane mirror (380 GB) and put all VMs onto (no data) with sync enabled. Size should be enough for VMs. When you have the chance of a second 960 Optane, replace the 380GB for full capacity. I would use NFS to access the pool from ESXi (much simpler, similar fast and you can additionally use ZFS snaps as Windows previous version via SMB)

- use the disk pool for data with sync disabled. Pool layout according to needs (Multi raid-10 or Z2). Access to datapool via SMB, NFS or S3. iSCSI only when a "local" disk is needed (eventually Lightroom)

- with such a large memory, L2Arc is senseless. Give the storage VM up from 64 GB RAM. 256 GB for a storage VM can make sense but only for very many users and small files for example a mailserver. RAM on ZFS is used as Arc. This is a readcache with a read last/ read most strategy that can cache only random io not large sequential files. You can check arcstat but without many users, I doubt you will use that much.

- what is the use case of NextCloud?
If you mainly want Internet Access for a few users, Amazon S3 sharing of a ZFS filesystem to the Internet via minIO is much simpler and faster and more secure (only a single file vs the hundreds of a NextCloud). S3 is also a perfect target for VEEAM and other backup tools. For a client sync you can use one of the S3 sync tools.

Thanks for the detailed response. I never thought about the option to just host VMs on the Optane drives. But then, I was hoping to 1) utilize the extra 580GB on the U.2 Optane (maybe I'm being greedy), and 2) grant the benefits of the Optane drives for my SATA storage pool that will contain databases.

I think I'll do NFS instead of iSCSI based on your suggestion in that case.

Very cool suggestion about Amazon S3 -- never considered that before. I'll do some research on that.

But I'm curious whether sharing the same Optane disk and splitting it into SLOG and Special vdev (metadata, small files, dedupe) partitions can be done without losing all the benefits of the high Optane IOPS? Would love to hear your thoughts on this matter.

Much appreciated!

gea · Jun 16, 2020

You can create a 380 GB mirror. It is possible to create partitions on the 960 GB disks but propably I would add to ESXi and use vdisks as this is less complicated and nearly as fast.

Optane is superiour to other NVMe. It can write a single datablock directly. No need to read a page, delete the page and write new data together with old data or the need of garbage collection or trim. This and the super low latency of 10us and 500k write iops. Absolutely no problem to have multiple concurrent read write processes.

btw
If you want to add special vdevs, you need a mirror (vdev lost=pool lost) and you must care about/ force the same ashift than the pool. Per default a pool is often ashift=12 while the NVMe is often ashift=9 per default. There is a bug in current Open-ZFS that allows a different vdev for creation but hinders a later remove.

about minIO

https://forums.servethehome.com/index.php?threads/amazon-s3-compatible-zfs-cloud-with-minio.27524/

ipreferpie · Jun 17, 2020

gea said:
You can create a 380 GB mirror. It is possible to create partitions on the 960 GB disks but propably I would add to ESXi and use vdisks as this is less complicated and nearly as fast.

Optane is superiour to other NVMe. It can write a single datablock directly. No need to read a page, delete the page and write new data together with old data or the need of garbage collection or trim. This and the super low latency of 10us and 500k write iops. Absolutely no problem to have multiple concurrent read write processes.

btw
If you want to add special vdevs, you need a mirror (vdev lost=pool lost) and you must care about/ force the same ashift than the pool. Per default a pool is often ashift=12 while the NVMe is often ashift=9 per default. There is a bug in current Open-ZFS that allows a different vdev for creation but hinders a later remove.

about minIO

https://forums.servethehome.com/index.php?threads/amazon-s3-compatible-zfs-cloud-with-minio.27524/

Yes, I was thinking of going the route of ESXi vdisks for the flexibility. Heard good experiences from other people without sacrificing much.

And great to know that it's not too crazy to house SLOG and special vdevs on the same Optane disks (as mirrors). I'll give that a try to see if it's viable and will come back with results.

And thanks for the reminder about ashift. I'm looking to do ashift=13 across the whole pool in that case.

Several more questions if that's ok:
- I was thinking of doing 6x 10TB drives in mirrored pairs. But I heard that Optane grants an order of magnitude speed increase if I use it for special vdevs when using VMs and databases in some situations. Would I lose a lot of IOPS if I do 6x 10TB RAIDz2?
-I heard dedupe is not recommended usually. But since I'm using it for VMs, have quite a lot of RAM, and will use special vdev in Optane, would this be a viable case?
-Does Napp-It support zstd compression now and what are your thoughts on it?

Thanks again for the great support

gea · Jun 17, 2020

Raid 10 Mirror2 vs Z2

In general sequential performance is often similar but not iops.
A single Z2 has always the read/write ipos of a single disk, say 300.

In a Mirror/Raid-10 your read iops are 2n vdevs and the write iops is n vdev.

Ex:
A 10 disk Z2 of 1TB disks has 300 read/write iops and a capacity of 8 TB
A 5x mirror has 3000 read iops and 1500 write iops but only 5 TB capacity.

Main problem of dedup is the huge RAM need of up to 5GB per TB dedup data additionally to the RAM you want for caching. Never try to import such a pool to a system with low RAM.

With a special vdev for dedup, you can ignore the RAM aspect as the dedup table is there. Requires a very fast vdev and a min of around 5 GB/TB dedup data in size. With your RAM, you may ignore at all. Keep the possible dedup rate in mind. With a low dedup rate (<5-10), ignore dedup and add a disk or two.

A mirror of special vdevs on the same physical disk is useless (disk lost=pool lost)

ZSTD
Napp-it supports the features of the underlying OS.
ZSTD compression is beta. If its final and better than LZ4 it will be available on any OS with Open-ZFS, Introduce ZSTD compression to ZFS by c0d3z3r0 · Pull Request #10278 · openzfs/zfs

ipreferpie · Jun 22, 2020

gea said:
Raid 10 Mirror2 vs Z2

In general sequential performance is often similar but not iops.
A single Z2 has always the read/write ipos of a single disk, say 300.

In a Mirror/Raid-10 your read iops are 2n vdevs and the write iops is n vdev.

Ex:
A 10 disk Z2 of 1TB disks has 300 read/write iops and a capacity of 8 TB
A 5x mirror has 3000 read iops and 1500 write iops but only 5 TB capacity.

Main problem of dedup is the huge RAM need of up to 5GB per TB dedup data additionally to the RAM you want for caching. Never try to import such a pool to a system with low RAM.

With a special vdev for dedup, you can ignore the RAM aspect as the dedup table is there. Requires a very fast vdev and a min of around 5 GB/TB dedup data in size. With your RAM, you may ignore at all. Keep the possible dedup rate in mind. With a low dedup rate (<5-10), ignore dedup and add a disk or two.

A mirror of special vdevs on the same physical disk is useless (disk lost=pool lost)

ZSTD
Napp-it supports the features of the underlying OS.
ZSTD compression is beta. If its final and better than LZ4 it will be available on any OS with Open-ZFS, Introduce ZSTD compression to ZFS by c0d3z3r0 · Pull Request #10278 · openzfs/zfs

Thanks again for the excellent info, Gea. Settled on a mirror config with no dedupe. With ZSTD, would you know if I can convert to from LZ4 in the future once it's ready?

gea · Jun 22, 2020

Every compressor (and dedup) works for newly written data only. A replication of a filesystem with a destination that has a different compressor will convert compress settings.

Search

How to maximize Optane for ZFS in ESXi

ipreferpie

Member

gea

Well-Known Member

sth

Active Member

gea

Well-Known Member

ipreferpie

Member

gea

Well-Known Member

ipreferpie

Member

gea

Well-Known Member

ipreferpie

Member

gea

Well-Known Member