DIY tower 10-core Xeon for ZFS storage and KVM

Tinkerer

Member
Sep 5, 2020
45
14
8
Build’s Name: Paradise
Operating System/ Storage Platform: Fedora 33 Server built up from minimal install to keep it lean and clean
CPU: Xeon Silver 4210R / 10-cores with HT (20 threads)
Motherboard: Asus Z11PA-U12/10G-2S
Chassis: Antec P100 (yes its old but top notch cooling and quiet)
Drives: 8x Hitachi Ultrastar (HGST HUH728080AL)
RAM: 96 GB as 6x 16GB ECC registered (Kingston KSM29RD8/16HDR)
Add-in Cards: 4x PCI-E x4 M.2 NVMe cards with Samsung MZVLB256HAHQ256 GB NVMe drives
Power Supply: Antec CP-850
Other Bits: AMD R5 230 video card (not required but I had it lying around)

Usage Profile: Back-ups, storage, kvm/libvirt hypervisor

Other information…
  • Dedicated KVM/ILO network interface, remote console, remote storage with iso or disk passthrough from the client, HTML5 support, sensor readouts, remote logs. This alone is worth this board ;-).
  • Multiple nics configured, multiple subnets and vlans.
  • ZFS pool 4x stripe mirror setup over 8x spinning rust, ZFS 4x raid0 equivalent over NVMe drives for fast scratch storage
  • Mobile client backups: Syncthing, syncthing relay server to give some back to the Syncthing community
  • Laptop/PC client end to end encrypted backups to zfs backup dataset
  • Cloud storage for offsite backups: end to end encrypted to some cloud storage provider
  • Fully automated deployment of new vm's using ansible for fast deployment of test environments using libvirt module and j2 templated kickstart scripts
  • SMB and NFS shares for clients to use central storage

All datasets and lvm volumes (ZFS and OS) are encrypted, except the NVMe pool. New datasets are automatically setup with encryption and randomly generated keys, which are immediately securely backed up and synced off-box - I learned that the hard way ;).

Average power consumption is roughly 140W/h which would amount to 250 euros a year in electricity costs if run 24/7. Needless to say I turn it off when I don't use it #becauseIcare
 

gea

Well-Known Member
Dec 31, 2010
2,832
992
113
DE
Only a few remarks

If you use the ZFS storage for VMs you should enable sync, otherwise you are in danger of a corrupted guest filesystem on a crash during write. If you enable sync on the disk pool your write performance may go down from maybe 1000 MB/s to 50 MB/s. If you add encryption you may end at 30MB/s. Best solution would be an Slog ex Intel Optane 4801X

See my current thread with benchmarks of the same hgst disks HE8 with sync/encryption and a Xeon Silver system vs Epyc, https://forums.servethehome.com/ind...4110-vs-amd-epyc-7302-on-a-sm-h12ssl-c.31008/

The Samsungs are M.2 Laptop/Desktop drives. In a server environment with a steady write load, write performance may go down dramatically after some time. The iops from specs are then like the max speed of a bicycle (under free fall conditions). Also I have only seen reliability informations on the specs not the more important endurance information (How many TB can be written over all/per day). In any way I would at least use a Z1 not Raid-0

For a local cloud I use minIO (Amazon S3 compatible local cloud) or rclone that allows encrypted client transfer of files to any cloud, http://www.napp-it.org/doc/downloads/cloudsync.pdf (this howto is for Solarish but these tools are available on any X system)

PC end to end encryption to storage require to use client encryption. ZFS offers server encryption even with encrypted backup via replication of locked filesystems.
 

Tinkerer

Member
Sep 5, 2020
45
14
8
Thanks for the comments and suggestions!

I'll comment below.

Only a few remarks

If you use the ZFS storage for VMs you should enable sync, otherwise you are in danger of a corrupted guest filesystem on a crash during write. If you enable sync on the disk pool your write performance may go down from maybe 1000 MB/s to 50 MB/s. If you add encryption you may end at 30MB/s. Best solution would be an Slog ex Intel Optane 4801X

See my current thread with benchmarks of the same hgst disks HE8 with sync/encryption and a Xeon Silver system vs Epyc, https://forums.servethehome.com/ind...4110-vs-amd-epyc-7302-on-a-sm-h12ssl-c.31008/
The vm's are on their own dataset with slightly different settings than other datasets. However, sync is not enabled for the simple fact that its all for test and development and I couldn't care less about vm corruption. I'll redeploy if that happens.

Respectfully your benchmarks mean very little to me. First of all it looks like you're on BSD, I am on Linux. BSD used GELI on top of zfs, unless you very recently upgraded to zfs 2.0 which now shares the codebase with openzfs project, with which you'll get native zfs encryption (i actually upgraded to 2.0 just now, I waited a couple of days to see if the internet would implode). But nowhere do I see you specifically mention zfs version or the purpose of your comparison other than upgrading hardware? Maybe I missed that. You compare last year results with new results, but nowhere do I see a fio commandline with parameters so what exactly did you test? Was fio version the same? Results are known to differ between some versions.

Again I mean no disrepect but it seems you're just comparing numbers without actually knowing what and how you were testing?

The Samsungs are M.2 Laptop/Desktop drives. In a server environment with a steady write load, write performance may go down dramatically after some time. The iops from specs are then like the max speed of a bicycle (under free fall conditions). Also I have only seen reliability informations on the specs not the more important endurance information (How many TB can be written over all/per day). In any way I would at least use a Z1 not Raid-0
Its a home server and the NVMe pool is for scratch data only, meaning that data that has no importance whatsoever. I had those nvme's lying around doing nothing (replaced from other machines) so all I needed were a few dollars for expansion cards which comes really cheap. That pool isnt encrypted either, its purely a backend for faster iops storage. Some vm's have a disk their with their databases on a custom recordsize dataset and again, I don't care about sync writes. I'll redeploy / restore if it corrupts (and honestly, it literally never has in the > 6 years I am running zfs on Linux).

I guess I weigh my options a little differently. I've setup my pool with 4 striped mirrors instead of raidz2 because its so much faster with random iops, which I believe benefit me more than sequential workloads (other people mileage may differ). See, I don't care if something is 200 MB/s or 1200MB/s sequential, but I do care (infinitely more so) if I can get 4K random iops up by a factor 4. I believe more than 60% of generic real world workloads will be random and not sequential, and most of that will be reads, too. Specific cases of course might differ and then a different configuration will benefit performance. Making the nvme pool a z1 will totally destroy random iops and doesn't help my use case.

The only thing really important are my backups. Hence there are multiple levels of backup, and the important stuff also goes off-box and off-site. Vm's are test environments mostly to test ansible code or run homelabs for study prior to exams. None of the vm's do anything of any importance whatsoever.
 

gea

Well-Known Member
Dec 31, 2010
2,832
992
113
DE
Benchmarks were done now with same OS and filebench series and pool on a hardware I bought last year and then with the pool connected to a very new hardware to compare hardware based improvements. Tests were done on OmniOS, the Opensource fork of Oracle Solaris where ZFS comes from and is native. This is not Free-BSD and Geli.

Unlike Free-BSD, OmniOS use the same ZFS encryption since early summer last year that you use in Linux so I would expect similar results between them regarding ZFS sync and encryption performance. In the past ZFS and SMB on Solarish was quite the fastest but OS based differences are not in the range of 100% as it was between the two hardware platforms.
 
Last edited:
  • Like
Reactions: Tinkerer

Tinkerer

Member
Sep 5, 2020
45
14
8
Ah oke. So at least the tests were done the exact same way.

I may need to look into OmniOS one of these days. Just for fun.
 

gea

Well-Known Member
Dec 31, 2010
2,832
992
113
DE
You use a slightly faster Xeon and quite the same diskpool so expected ZFS performance values on any OS should be slighty better than my Xeon values and may give a first impression of what you can expect.

btw.
You accept the performance disadvantage of ZFS against ext4 propably due the superiour data security. I would not disable sync on VM storage as the main problem is that you get not informed immediatly of data corruption on guests. Better to avoid such problems. The Samsung are so cheap, no need to avoid raid redundancy ex via a mirror of two NVMe even for a scrap disk. For VM storage on SSD, care about SSD with powerloss protection. Your system is not cheap and mostly superiour to many current servers in production use.

About mirror vs raid-z
A single disk has around 100 raw iops. If you build a Raid-Z, the overall iops is like a single disk. If you build mirrors read iops is 2 x number of mirrors, write iops is number of mirrors.

With 8 disks in a 4x mirror setup, your pool will end with 400 write iops and 800 read iops. A single desktop SSD may end at 5-10k iops, a datacenter SSD/NVMe at 80k iops and the best of all an Intel Optane (up from 800p) at 500k iops on steady write.

This is the reason of the low sync write performance of even a good disk pool of say 50 MB/s. Sync write is where you need low latancy and high steady write iops at 4k and qd1.
 
Last edited:

Tinkerer

Member
Sep 5, 2020
45
14
8
Thanks, I appreciate your feedback.

I think I am aware of most of that. I have considered Optane, its just that I dont want to spend more money without being sure what I'll get back. I'm not interested in a more performance that will end up being mostly academic and that I won't notice in day to day operations.

I played around with those nvme drives as zil/slog as well as l2arc (and both) and while performance improvements could be measured in some operations, it also wasn't noticable from a user experience point of view. I did "feel" like the disk pool was quieter, less mechanical rattling from the heads if you know what I mean?

I realize Optane is not the same, but at the same time I doubt I'll benefit from anything other than better specs on a sheet of paper.

I remain sceptical. But Chrismas is coming, who knows I'll put one on my wish list ;).

Edit: funny I just checked on which datasets I had sync disabled but they are all default. I'd swear I had disabled it on some datasets but I guess I didn't :).
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,832
992
113
DE
I am quite sure, you "feel" the difference between 50MB/s (sync write like a USB2 disk) and 1000MB/s (nonsync like a NVMe) write performance.

btw
You can force a sync setting with always or disabled. Sync=default means that the writing application can decide. On a normal filer use ex SMB this means sync disabled. ESXi over NFS will enable sync to avoid corrupted VMs.

Slog is not a write cache but a logging device to protect the ramcache. There is no difference with an slog and without - without sync enabled.

L2Arc is useless with enough RAM. It is also useless for sequential reads beside a minimal effect when you enable read ahead.
 
Last edited:

Tinkerer

Member
Sep 5, 2020
45
14
8
So what would you suggest for my system, which Optane would you get and how would you configure it?
 

gea

Well-Known Member
Dec 31, 2010
2,832
992
113
DE
ZFS is very flexible and allows many optimations.
A typical setup for a "production" system where you want to use the disk based pool for VMs would be a Optane 4801x-100 that you add as an Slog or any other SSD/NVMe with powerloss protection. This depends on the wanted sync write performance (Disk based pool without slog around 50MB/s, with a good flashbased NVMe around 200-300 MB/s, with an Optane slog up to 500 MB/s)

The main alternative is to use the disk pool for filer and backup (without sync) and a second faster pool for VMs, either traditional flash or Optane NVMe ex a simple mirror. Only care about powerloss protection then. Only the Optane DC 48xx Optane has guaranteed powerloss protection but any of the Optane up from 800p is considered as uncritical. The endurance of the 800p is lower compared to 90x and the datacenter 48xx. If you find a used NVMe DC 750, 3600 or 3700 they are also very good. Just enable sync then without an additional Slog that you need only for disk based pools.
 
  • Like
Reactions: Tinkerer