need zpool set up review/guidance

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

mokurum

New Member
Dec 20, 2021
25
6
3
Hi all,
In the last few weeks, I’ve been learning about “homelab” and how to make one from scratch.
Been using linux daily and on clusters for years but "using" is the keyword. Having built all my PCs in the last 20 years or so, I have never built a machine with the specs below and set up anything like this.
Just looking for a quick review and suggestions on my notes below.
This HPI will be the backbone of an attempt to start a business with 2 friends and achieve a lifelong dream of boomer free, financial independence.

Cheers!

PS: this seems to be the correct place to post this, if I missed the obvious sub forum please let me know.

OS/purpose: Proxmox 8.0 to host 1 main ubuntu-server vm, 1 main win11 pro vm, 1 TrueNas Scale to hold the storage vdevs and maybe some task specific VMs and cool containers.
3 users will use this server as a workstation (both on ubuntu-server and win11).
I will try to use it as my “home” PC as well, will try PCIe passthrough for my current PC GPU and test gaming on Win11 VM.

Questions: is it better to create the zpool under proxmox or under a TrueNas like VM? Does this question even make sense?

Filesystem: zfs, sync=disabled, no SLOG sdev, no L2ARC sdev
Redundancy requirement: Nothing more than mirror or z1, will rely on scheduled backups/snapshots of everything. Convenience of a mirror is more important than $/GB. Planning to add a separate file storage server with raidz2 in near future.
Storage hardware: Already bought, but open to suggestions if it will make a tangible difference. However, I cannot add more storage without using a PCIe card. All bays, slots full otherwise (building in a Fractal Torrent). To be honest I bought the storage hardware to max out the mobo and the Torrent with near max GB. List of drives below.
Rest of the hardware (bought from zac1 on STH):
ASRock Rack ROME2D16-2T Server Motherboard Dual Socket Dual 10G
2 x AMD EPYC 7742 64c/128t CPU 2.25GHz (3.4Ghz Turbo)
256GB RAM: 16 x 16GB DDR4 3200 EEC

zpool will look like this:

special vdev: 2 x 2TB NVMe SSDs in mirror
vdev: 2 x 4TB SATA SSDs in mirror
vdev: 2 x 4TB SATA SSDs in mirror
vdev: 2 x 16TB SATA HDDs in mirror

TOTAL POOL: ~ 24TB - will be kept at 90% = 21-22TB (enough)

I have not built and booted the machine yet, so I don't know the actual drive names. Assuming:
How to change the drive reference in a zfs pool from /dev/sdX to /dev/disk/by-id (ata-XXXXXX)


- NVMe Drives: `/dev/nvme0n1`, `/dev/nvme1n1`
- SATA SSDs: `/dev/sda`, `/dev/sdb`, `/dev/sdc`, `/dev/sdd`
- SATA HDDs: `/dev/sde`, `/dev/sdf`

Bash:
zpool create -o ashift=12 -o autoexpand=on -o autoreplace=on -o sync=disabled -f tank mirror sda sdb mirror sdc sdd mirror sde sdf
zpool add tank special mirror nvme0n1 nvme0n2
zfs create -o recordsize=32k -o compression=on tank/ssd_dataset1
zfs create -o recordsize=64k -o compression=on tank/ssd_dataset2
zfs create -o recordsize=128k -o compression=on tank/hdd_dataset1
zfs create -o recordsize=1M -o compression=on tank/hdd_dataset2
Will I be on the right track when I finish building the machine?

The nature of what we do is applied science, R&D/Innovation, numerical simulations of the atmosphere, climate, ocean circulation, storm surge, wave/hydrodynamics, CFD, running a bunch of mpi compiled fortran code with slurm etc. We sometimes deal with large datasets (global climate model datasets, AIS vessel datasets, large ascii/binary model inputs outputs etc.) which are accessed via python/fortran codes or 3rd party software as needed. Sometimes we generate 100k .csv files of 20kb size for no good reason, or 500 high-res png files to make an animation mp4. Pretty mixed usage.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Everything ok so far

some remarks
You can set everything at console with the zfs and zpool commands. ZFS wise it does not matter if you set at console or a storage VM. I am not sure about current TrueNAS. In the past it was not suggested to use cli commands outside the web-gui (In my Solaris based storage vm I allow this)

For daily use it is much easier with less errors to use a full featured web-gui than cli commands.
I use ZFS for more than 15 years but would not work without a web-gui

Smaller recordsizes may be better for VM storage (less io) but affect ZFS efficiency negatively.
I would not go below 32k without reason and performancetest.
For a mediafiles 1M is good. For a general use filer the default 128k is ok.

Disabling sync for databases or VM storage with user filesystems is a high risk. A crash during write may lead to a currupted guest filesystem and/or dataloss of confirmed writes. Not a good idea. While an Slog is not needed for a fast nvme pool, you should enable sync for VM filesystems.

For a general use filesystem ex SMB share, sync can be disabled due better performance.

Keep the configuration as minimalistic as possible, especially the VM task and the storage task to allow a quick recovery and high uptime. I would not add service to the Debian part or the TrueNAS part. Every problem (security or bug) there affects the whole thing. Virtualize everything beside core VM and core storage services.

Switch disknames from controller cabling numbers like sda to a naming that gives the real device name by id. You can switch via pool export + import with the wanted reference, How to change the drive reference in a zfs pool from /dev/sdX to /dev/disk/by-id (ata-XXXXXX)
 
Last edited:
  • Like
Reactions: Stephan

Stephan

Well-Known Member
Apr 21, 2017
945
714
93
Germany
To add: ashift=13 depending on NVME flash page size. Check datasheets. For SQL databases, try recordsize 16k. For VMs and e.g. QCOW2 images, try to align device page size with ZFS recordsize with QCOW2 cluster size for triple performance. Dig through ZFS – JRS Systems: the blog especially ZVOL vs QCOW2 with KVM – JRS Systems: the blog Might want also to work through Workload Tuning — OpenZFS documentation if you haven't done so yet. Google init_on_alloc=0 init_on_free=0 because from kernel 5.3 this is a big performance hit and you may not even need or want it. Wipes memory on alloc and free and that is expensive for ZFS. There's more hardcore tweaking possible like switching from interrupts to polling since you have fast NVME and enough cores. Here polling might be cheaper than interrupt overhead. At the expense of power consumption. And finally good friend Wendell has pretty much done what you need in terms of recommendations: Fixing Slow NVMe Raid Performance on Epyc

Edit: Also, when machine is built, try some benchmarking with fio. See if machine flies or something is still up. Expect a couple days of experimenting.
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
one remark
Special vdevs are removeable but only when all vdevs in the pool have the same ashift (suggested 12). In general I prefer simplicity over special tunings. ZFS defaults are a very good and only on special use cases it makes a difference where you say wow, I can feel it.
 

mokurum

New Member
Dec 20, 2021
25
6
3
To add: ashift=13 depending on NVME flash page size. Check datasheets. For SQL databases, try recordsize 16k. For VMs and e.g. QCOW2 images, try to align device page size with ZFS recordsize with QCOW2 cluster size for triple performance. Dig through ZFS – JRS Systems: the blog especially ZVOL vs QCOW2 with KVM – JRS Systems: the blog Might want also to work through Workload Tuning — OpenZFS documentation if you haven't done so yet. Google init_on_alloc=0 init_on_free=0 because from kernel 5.3 this is a big performance hit and you may not even need or want it. Wipes memory on alloc and free and that is expensive for ZFS. There's more hardcore tweaking possible like switching from interrupts to polling since you have fast NVME and enough cores. Here polling might be cheaper than interrupt overhead. At the expense of power consumption. And finally good friend Wendell has pretty much done what you need in terms of recommendations: Fixing Slow NVMe Raid Performance on Epyc

Edit: Also, when machine is built, try some benchmarking with fio. See if machine flies or something is still up. Expect a couple days of experimenting.
Thanks for the morning read and the links!
qcow2, very intriguing (everything is, since all is new) especially in regards to “They are simpler to provision, allow for easy snapshots, and are less likely to cause crashes if the underlying storage becomes full.”
Will definitely skim through the OpenZFS documentation. Appreciate sharing the link.

I want simple full on zfs and mostly defaults for now. Will be easier to debug things at first try at this.
Believe me, I can't wait to build and play with this thing. Still waiting on cables and stuff.

re: the blog post on ZVOL vs QCOW2,
Any insight into what the result would be if zfs datasets were also tuned?
This blog post from 2018 still relevant today?

Cheers!

1. **Synchronous Write Performance:** The benchmark compared the performance of ZVOLs (ZFS volumes) and QCOW2 files for synchronous writes. It was found that QCOW2 files on datasets handled synchronous writes more reliably than ZVOLs. With ZVOLs, there can be issues where the guest system may not properly honor synchronous write requests, potentially leading to data integrity issues or unexpected behavior. QCOW2 files, on the other hand, ensure that synchronous writes are consistently completed before proceeding.

2. **Tuning QCOW2 for Better Performance:** The benchmark also highlighted that tuning the underlying cluster size of QCOW2 files can improve performance. By aligning the cluster size with the ZFS record size and underlying hardware block size, the performance of QCOW2 files can be enhanced. This tuning resulted in significantly better performance compared to both the default configuration of QCOW2 files and ZVOLs. The line of code used for tuning in the benchmark is:

Bash:
QCOW2 -o cluster_size=8K, –ioengine=sync:
root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=sync --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
3. **Manageability and Safety:** QCOW2 files were favored for their ease of management and safety features. They are simpler to provision, allow for easy snapshots, and are less likely to cause crashes if the underlying storage becomes full. ZVOLs, on the other hand, can lead to crashes within the guest system if the storage becomes full.

In summary, the updated breakdown emphasizes that QCOW2 files on datasets provide more reliable handling of synchronous writes, offer improved performance when properly tuned.
 

mokurum

New Member
Dec 20, 2021
25
6
3
Everything ok so far
so good to hear this, thanks for your detailed feedback and remarks. It's super helpful for me. Here are my thoughts on your points:

Console vs Web-gui: I'm totally on board with using the web-gui for daily operations. I just used the cli in my posts because it's a bit easier to follow in text.

Record Sizes: Thanks for the insight on record sizes. My numbers were made up for the sake of discussion. I'll run tests on our Linux workstations to figure out what the real-world distribution looks like.

Minimal Config: Totally with you on keeping things simple. As a newcomer to virtualization, non ext4, ntfs file systems, etc. I'm more comfortable sticking with ZFS and (mostly) defaults for now.

Drive Names: That's a really useful tip. Added to first post.

Disabling sync for databases or VM storage with user filesystems is a high risk. A crash during write may lead to a currupted guest filesystem and/or dataloss of confirmed writes. Not a good idea. While an Slog is not needed for a fast nvme pool, you should enable sync for VM filesystems.
For a general use filesystem ex SMB share, sync can be disabled due better performance.
I knew this was coming :) I want to provide more context of my use plan for this machine just to make sure the enable sync recommendation still applies.

Our work involves mostly number crunching in one way or another. Don’t use a database or that could be on a separate file server. This machine is more of a compute node to produce the output that later turns into a report-like deliverable. Basically a bunch of various sized files and a bunch of repos / models to work on them. The datasets we deal with are pretty big and mostly interacted with through scripts or software. We keep our inputs and codes backed up in cold storage and GitHub. The server will also be connected to a UPS for safe shutdowns if the power goes out. Plus, I'm around to manage any hiccups in real-time.

Here’s the way I see it, I am currently doing similar work on 4 X 28 CPUs Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz. Just 4 separate boxes connected via 1GBe.
None of them has zfs on them. Barebone Ubuntus running on ext4. All I do is to take 1 weekly rsync backup on them (one local / one cold) and we never had an issue before. Not saying it's good practice, just trying to make sure to extract as much as performance out of the hardware within my accepted risk/redundancy criteria.

In our setup, we have mirrored drives, hourly VM snapshots during work hours, full system backups at the end of the day and weekly.

Again, thanks for your thoughts. They've given me a lot to consider. I just want to make sure I am not missing anything regarding synching. I am really failing to think of a scenario that the server won’t be up and running in 1-2 days (and that’s ok for us, if it happens even once a year).

Given all this, does sync enable outweigh the performance gains from disabling sync? Could you give me a real life scenario I am not covered for?

maybe something like this is also possible:
vdev1 (SDEV): 2 x 2TB NVMe SSDs in mirror
vdev2: 4 x 4TB SATA SSDs in raidz1 (sync on here?)
vdev3: 2 x 16TB SATA HDDs in mirror

TOTAL POOL: ~ 28TB - will be kept at 90% = 25TB
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
The reason for sync write are atomic writes.
These are write sequences that must be done completely or discarded. Some examples are writing a datablock. This is only valid if the affected metadata are also updated. Another example is a write to a raid array. As there is no parallelism in IT systems, writes are done disk by disk. Any crash within an atomic write sequence can result in corrupt data, a corrupt filesystem or a corrupt raid depending on the moment of a crash.

ZFS itself is immune against such problems due Copy on Write what means that atomic writes are done completely or discarded. This is why you do not need for a filer use. Only in the special case of a small file that is already completely in the rambased writecache but not completely on disk sync has a value for a filer.

Uncompleted atomic writes within a VM guest filesystem happens when the server crashed within the atomic write sequence. There is no way for ZFS to avoid this beside sync that completes missing writes on next reboot. With fast SSDs or especially NVMe the performance degration should be minimal or acceptable. For diskbased pools you need an Slog for acceptable performance.

You may say, power and OS is stable so risk is minimal especially when combined with an ups but a risk remains that you can avoid. Backup is not the solution as you have no guarantee of valid data.

See some performance numbers have done at https://www.napp-it.org/doc/downloads/optane_slog_pool_performane.pdf or with a newer AMD system at https://www.napp-it.org/doc/downloads/epyc_performance.pdf
 
Last edited: