napp-it ZFS server on OmniOS/Solaris: news, tips and tricks

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
File and share permissions, new option in current napp-it dev

You can use SAMBA on Solarish operating systems but mostly you prefer the Solaris integrated multithreaded kernelbased SMB server due easier setup, perfect integration of Windows ntfs alike permissions with inheritance, Windows SID as security reference (keep AD permissions on a backup intact), lokal Windows compatible SMB groups and out of the box working ZFS snaps as Windows previous versions.

You use file/folder based finegranular ACL permissions to restrict access or allow creation of new files and folders with the option to place settings to current folder only or inherit also to deeper ones.

acl.png


Additionally you can set share ACL. They can globally and additionally restrict access independent from file/folder based ACL. You use them to set global defaults like not allowing ACL modifications or restrict access temporarily to readonly for normal users (root has always full access). The share ACL are settings on the share control file /pool/filesystem/.zfs/shares/filesystem. This file is created when you enable a share and deleted when you disable a share. This is why settings are not persistent. When you re-enable a share you are always at the default setting everyone=full access.

Up from current november napp-it 22.dev share ACL are preserved as ZFS properties and can be restored when activating a share. You can also restrict access at this point to readonly or modify only.

shareacl.png
 
Last edited:
  • Like
Reactions: gb00s and mrpasc

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
All in One System (ESXi incl. free version + virtualized OmniOS ZFS SAN appliance): autoboot of storage VM + other VMs on NFS

When I brought up the AiO idea more than 10 years ago, autoboot of VMs on a delayed NFS storage was trouble free. Just set a delay for VMs on NFS to allow OmniOS with a NFS share to boot up and provide NFS. With a current ESXi this simple setup no longer works as it seems that ESXi checks availability of a VM prior the bootup delay.

Workaround:
- Create a dummy, empty VM on local datastore
- Autostart this dummy VM after the OmniOS storage VM with a delay long enough to boot up OmniOS and auto reconnect NFS on Esxi, ex 300s
- Autostart other VMs from NFS on ZFS.

autostart.png

 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
My must have tools for daily work with OmniOS storage servers

1. System Management

1.1 Winscp (free)


This is a filemanager and editor for Linux/Unix textfiles on Windows

winscp.png

1.2 Putty (free)

This is a remote console for Windows. You can copy/paste commands via a right mouseclick.
I often use it in combination with midnight commander. a console data browser to copy/move data locally

putty_mc.png


2. Disk Management

2.1 Partedmagic (13 USD to get newest versions for 3 months, use it then without time limit)


This is a small commercial Linux Distribution. Use Rufus to make a bootable USB stick from iso.
The most important feautures are "secure erase" of any SSD/Nvme and Partition management of disks

partedmagic.png



2.2 Hirens Boot CD (free)

This is a Windows 10 PE distribution with many tools for Windows and disk maintenance. What I use quite often on my OmniOS ZFS server is WD data lifeguard for intensive disk tests to repair disks with troubles or get the final info that a disk is bad. Use Rufus to create a bootable USB stick from iso.

hirens.png


2.3 Clonezilla (free)

This is a small Linux distribution to create/restore images ex from/to an NFS/SMB filer or to clone disks locally. Use Rufus to create a bootable USB stick from the iso.

clonezilla.png

2.4 Rufus (free)

This is a Windows tool to create bootable USB sticks from an iso image. You can also create a bootable Freedis USB stick ex to flash firmware for mainboards ot HBA or hdat2.


2.5 USB imager (free)

This is a Windows tool to backup/restore/clone USB images.

usb_image_tool.png


2.6 hdat2 (free)

This is a tool for MS Dos or Freedos for disk management. I use it mainly to create or manage host protected areas (HPA) on SSDs after a secure erase. A HPA reduces the usable size of SSDs (OS is reported a smaller disk). Such a HPA overprovisioning improves performance and reliability of SSDs. Create a freedos USB stick and copy hdat2.exe onto.


3. Development and scripting

3.1 ESXi (free edition).


This is the commercially leading virtualisation environment and in most cases the fastest one. The commercial version adds HA features and centralized management of several servers and storage management via vSAN. With ESXi free you can add a OmniOS storage VM to get sophisticated storage features and a Linux distribution if you want Docker etc. Register at VMware to download the free ESXi.

esxi8free.png


3.2 SuperMicro ipmiview

A tool to discover ipmi devices and display and manage console via ipmi even for older java based Supermicro boards
To save settings and passwort, allow everyone modify in folder C:\Program Files\SUPERMICRO\IPMIView

ipmiview.png


3.3 DZsoft Perl Editor (free)

This was a commercial and quite the best Perl scripting editor (Windows). It is now free to use, DzSoft's Order Page and gives you perfect control of sub routines and variables.

Perl is the "mother" of many scripting languages. Main advantage of Perl is that it is always there on Linux/Unix (no installation required), perfect for system maintenance and scripting due sophisticated string manipulation/ regex features and perfect locking features for text files when in multiuser use. I use Perl for system scripting (much easier and more comfort than shell scripting) and even for the web gui itself as a cgi application.


partedmagic.pngwinscp.pngputty_mc.pnghirens.pngclonezilla.pngusb_image_tool.pngesxi8free.pngperledit.png
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
Setup kernelbased SMB server on OmniOS or Solaris with ACL

If you are a Windows admin you may feel comfortable with the kernelbased Solaris SMB server.
If you are a Linux/ Unix admin with SAMBA knowledge you need to know the following

1. Enable SMB shares
Goto napp-it menu ZFS filesystems and click on off in the line of a filesystem under SMB

2. Edit special SMB server settings
Goto napp-it menu Services > SMB > properties

3. ACL
The Solaris SMB server use always and only NFSv4ACL with Windows SID as reference instead Unix uid.
NFS v4 ACL are quite identical to Windows ntfs ACL regarding features and inheritance

4. Avoid using classic Unix permissions like 755.
They are translated into an ACL setting with same permissions as explicit folder ACL. All inheritance settings are deleted as they are not available with classic Unix permissions.

acl_options.png

5. ACL inheritance, a killer feauture
ACL inheritance allows to set basic ACL on a toplevel folder that can be inherited to subfolders. You can stop inheritance on subfolders and replace with explicit ACL for this folder and deeper ones when inheritance is enabled.

inherit.png

6. Share level ACL
ACL are file or folder properties. If you want to additionally or temporary restrict access you can use share level ACL. This are ACL to a controlfile /pool/filesystem/.zfs/shares/filesystem. This file is created when you enable share and deleted when you disable a share so this setting is not persistent. With napp-it share ACL are stored as ZFS properties and last settings can be restored when you re-enable a share.

smbshare.png
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
Save energy on your napp-it backupserver

Energy costs have multiplied since last year. Really a problem for a backupserver up 24/7 when you only want to backup your storageserver once or a few times a day especially as incremental ongoing ZFS replications are finished within minutes.

A money saving solution is to remotely power up the backupserver via ipmi, sync the filesystema via ZFS replication and power off the backupserver when replications are finished. For this I have created a script for a napp-it 'other job' for your storageserver to simplify and automate this.

script.png

How to setup

The script is attached and in current napp-it 22.dev, see
/var/web-gui/_log/jobs/other_job_examples/remote_up_down_ipmi.pl


1. Edit this script

(use free DZsoft, DzSoft's Order Page) with settings like ipmi ip, user, password.

2. Add a napp-it 'other job' on your storageserver with the following action:
perl /var/web-gui/_log/jobs/other_job_examples/remote_up_down_ipmi.pl


Ignore return values of script or set "error" as return value for errors.
Enable auto, set job active and execute according your needs ex every day 23:00

Whenever the job is executed, the backupserver is powered up, replications are done followed by a soft power down

create_other_job.png


3. Add replication jobs on your backupserver, activate jobs and auto=every minute
with timetable every every every once (means start once at bootup)

Whenever the backupserver is powered up, replications are executed.

replicate_jobs.png


To test the script start it from console ex via Putty
perl /var/web-gui/_log/jobs/other_job_examples/remote_up_down_ipmi.pl

console.png


Without ipmi, use a Tasmota plug to power up the backupserver

(or a simple mechanical power timer plug)
 

Attachments

Last edited:

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
How to plan, setup, deploy or recover a Solaris based Unix OS

I prefer Solaris based Unix operating systems due their stability and maintenance, minimalism and resource efficiency, self sufficiency and fast and easy recovery.


Stability:

Oracle Solaris 11.4 with native ZFS is a comercial Unix with support until at least 2034. A subscription is mandatory and gives access to support, newer service releases and bugfixes. For noncommercial and development use you can download the free Beta Solaris 11.4 cbe. Ontop the minimalistic console setup, you can install a desktop GUI. In my tests Solaris with native ZFS was often the fastest ZFS storage server, Oracle Solaris - Wikipedia

The Solaris forks around Illumos (ex OmniOS, OpenIndiana, SmartOS, Triblix) are OpenSource with Open-ZFS, the free ZFS based on the opensourced ZFS in the former Sun OpenSolaris. ZFS features in Illumos are quite in sync with the other Open-ZFS systems. (BSD, OSX, Linux and Windows), illumos


My main platform for a napp-it server is OmniOS. It comes as an ultra minimalistic and very stable console managed server distribution in a bloody release (ongoing Illumos similar to OpenIndiana, the defacto successor of the former OpenSolaris with an additional desktop option), a stable every 6 months and a long term stable every two years. Reasons for its robustness and stability are dedicated software repositories for the current bloody, every stable and every long term stable. This means that you can be sure that the regular often biweekly security and bugfix updates via 'pkg update' do not add unwanted new software versions, features or newer add-ons. For release updates you just switch repositories followed by a 'pkg update'. Additionally there are two OmniOS repositories, the core one with essentials and the extra with additional services like databases or S3 sharing, IPS Repositories. More applications for Illumos are in other repositories ex Quick links to IPS Repositories | SFE - Software Packages for Solaris, OpenIndiana and OmniOS or pkgin, Index of /packages/SmartOS/. OmniOS gives free access to all releases and fixes but offers a commercial support option for the OS. You can also donate as a patron, Commercial Support


omnios.png
[omnios.png]


Setup:

To setup the server, download the OS installer (iso or usb image), create a bootable cd/dvd or USB stick (see tools in above thread) and boot. Select a system disk for setup (keep rpool as name). During/ after initial setup enable dhcp for your nic.

When OS setup is done (around a minute with OmniOS), add napp-it (web gui and setup of storage related services). Login as root and enter:
wget -O - www.napp-it.org/nappit | perl

The napp-it storageserver setup process lasts a few minutes. When ready it shows the ip for the web-gui ex http://192.168.1.10:81. Prior using it, update pkg and OS to newest state via:

pkg update pkg && pkg update
and reboot when requested.

Then re-enter your root password to allow SMB access as root:
passwd root

That's it, you are ready to setup your storageserver via browser ex http://192.168.1.10:81
On problems you can rerun the wget installer. It always give you napp-it free that you can update.


Minimalism, resource efficency and self sufficiency:

OmniOS is is one of the most minimalistic options for a full featured ZFS server. The OmniOS boot iso is less than 300 MB and fits on a CD or 1GB usb stick. It's hard to find another enterprise class distribution with all ZFS storage services included like FC/iSCSI, NFS, SMB, SSH, www etc that is as fast, compact and as quick to install and setup. A basic setup with all services occupies around 3GB on disk and includes Comstar, the enterprise stack for FC/IB/iSCSI, NFS and the kernelbased SMB all developped by Sun and part of the OS. If you check kernel RAM usage you are at a little more than 2GB RAM for the OS. This low RAM need is perfect for an AiO system with ESXi but also on a barebone setup as it allows to use more of the available RAM for ZFS read/write caching and performance than for OS or management.

iso.png
[iso.png]

rpoolsize.png
[rpoolsize.png]

memory.png
[memory.png]

While the initial space consumption of rpool is ultra low, allow enough space for swap and dump, bootenvironments (bootable snaps of former OS states) and the additional space needed for some OS updates with their bootenvironments. On ESXi I currently use a 35 GB bootdisk. For a barebone setup you should use a small SSD up from 50GB (100GB with larger amount of RAM as swap and dump size depend on RAM). While Solaris RAM needs are quite low, you want RAM for ZFS performance. With a minimal 3GB RAM setup all ZFS read/ writes must go directly to disk. Without fast NVMe disks this is very slow. You want RAM for data/metadata read/write caching so prefer to have 8GB or more. On my AiO ESXi dev system I have assigned 12GB to OmniOS with most of it used for read/write caching.

arc.png
[arc.png]


Deployment under ESXi

For ESXi you can download one of my free and fully configured templates and deploy within ESXi. You can also create a VM manually (30GB disk, 4GB RAM or more, HBA in passthrough mode, vmxnet3 fast vnic as Open VMware tools are installed per default). When you use passthrough devices you must reserve all assigned RAM exclusive to the VM. (You should alway block/restrict RAM for a ZFS storage VM as ZFS wants to use all available RAM for caching - leaves no RAM for other VMs)

After an individual setup you can save your own template. Use the VMware ofvtool to create a single file template.

esxisettings.png
[esxisettings.png]


Recovery

First step on problems is to bootup a bootenvironment (BE) with a former OS state. Be aware that all settings that you are doing then are only in this bootenvironment. To use them persistently make the current BE the default one for reboots.

To be prepared for a damaged bootdisk, backup settings ex via a Job > Backup, then reinstall the OS on problems to a new disk (or re-deploy an ESXi template), import datapool and optionally restore napp-it and user settings either manually (see folders /var/web-gui/_log/*, pool/backup_appliance and users with their uid) or via napp-it ex Comstar > full HA backup/restore or a backup job and User > restore.

To re-register VMs in ESXi, use datastore browser and click on the .vmx file. On next bootup answer the question if you moved or copied the VM.

For a more complex setup (avoid, VM server and storageserver should be as minimalistic as possible) you can create a ongoing backup for disaster recovery. This is a replication job with your current BE as source to a datapool that you can run for ex daily. To recover, do a minimalistic OS setup, restore the BE and boot into.

disasterbackup.png
[disasterbackup]


Monitor fillrate regularly
If rpool is full (due a crash dump, monitor or audit logs, many bootenvironments) you can only boot into maintenance mode.
Try to login as root and destroy an older bootenvironment

beadm list
beadm destroy oneoflisted

or boot from a different disk, import rpool and destroy some files or bootenvironments, dump or swap zvol
or add bootdisk on another system and import rpool
or boot Solaris OS installer with shell option and import rpool

To delete data from a 100% full ZFS pool
How To Delete Files on a ZFS Filesystem that is 100% Full – The Geek Diary
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
OmniOS (Solaris) server and Apple OSX clients,

OmniOS with napp-it is a fast and easy server for Apple clients. You can access OmniOS via iSCSI, NFS or SMB (and rsync or S3) after a simple on/off ZFS filesystem setting.

filesystems.png


1.) iSCSI

iSCSI is a method to provide a zvol (a ZFS dataset treated as a blockdevice) as a network target. Such a target can be connected by one Mac at a time (no concurrent use by several Macs), used there quite like a removeable local USB disk and formatted to any supported OSX filesystem ex APFS. OSX does not include the needed initiator software to connect network targets. Unlike Windows you need a 3rd party initiator for OSX, google for "initiator osx" for available options.

To provide an iSCSI target, use napp-it menu "ZFS Filesystems" and activate an iSCSI target in the row of a filesystem under iSCSI with a setable size. This creates a zvol below this filesystem with a name that includes the guid and a target with all needed settings. You can replicate such a zvol like other ZFS filesystems. To re-enable such a replicated zvol as a target, simply set iSCSI again to on on the destination server. Such a sharing "as a filesystem property" with a simple on/off setting, similar to NFS and SMB simplifies target handling with Comstar, the enterprise class FC/iSCSI stack in Solaris based operating systems, COMSTAR and iSCSI Technology (Overview) - Oracle Solaris Administration: Devices and File Systems

iscsi.png


2.) NFS (Network FileSystem)

NFS (originally developped by Sun together with ZFS) is a method to share a regular ZFS filesystem for one or more hosts. With NFS v3 it is a very simple and fast method to connect a ZFS filesystem via ip. On Solaris based operating systems, NFS is part of the OS and a ZFS filesystem property. To activate NFS sharing, use napp-it menu "ZFS Filesystems" and set NFS to on or off for a filesystem under NFS. As NFS v3 does not provide authentication (user login with name/password) or authorisation (access resrictions based on verified users) NFS is a sharing method for trusted networks where simplicity and performance is the main concern. Often you use NFS for VM or video storage. A (fakeable) minimal access restriction can be set either based on the client ip or the Unix uid of the creator of a file. The uid of a created file depends on client OS. This is either the uid of a client or "nobody". If you want to access files via NFS from several clients, prefer a fully open permission setting of files (ex allow everyone@ at least modify permissions, optionally force such a setting recursively in menu ZFS Filesystems under Folder ACL). To avoid permission modifications from clients set sclmode (ZFS property) to restricted. If you want to restrict access based on client ip when enabling a share, enter for exampe rw=@192.168.1.0/24,root=@192.168.1.0/24 instead the simple "on". The root= option is similar to no_root_squash on Linux and allows root acess to all files on a share.

nfs.png

To connect an NFS share from OSX:

Click on "Go" in finder.
Click on "connect to server"
Enter the following: "NFS://<device name or IP>/poolname/filesystemname"
ex: NFS://192.168.2.1/tank/data

If SMB is an alternative, prefer SMB over NFS. It is quite as fast and offers access control. If you enable NFS+SMB for a filesystem and want to restrict SMB access, set Share ACL in menu ZFS Filesystems.


3. SMB

On OmniOS you can provide SMB shares via SAMBA or the multithreaded kernelbased SMB server that is part of the Solaris operating systems. Mostly you use the kernelbased one due its easyness/ zero config behaviour, Windows ntfs alike ACL with inheritance and Windows sid (superiour to standard Unix permissions based on Unix uid). To activate NFS sharing, use napp-it menu "ZFS Filesystems" and set SMB to on or off. As an additional sharing option you can enable guest or access based enumaration/ABE or set share ACL.

smb.png

To connect an SMB share from OSX:

Click on "Go" in finder.
Click on "connect to server"
Enter the following: "SMB://<device name or IP>/sharename"
ex: SMB://192.168.2.1/data


3.1 SMB and Bonjour/mDNS

This is the Apple method to detect network shares, printers or Timemachine backup devices. OmniOS/ napp-it enables mDNS in menu Services > Bonjour and Autostart.

Select:
-enable Bonjour
-enable advertizing of SMB shares (shows OmniOS with a nice Xserve icon)
-enable Timemachine support (SMB share is automatically offered in Timemachine settings)
-select the SMB share that you want to advertice as Timemachine backup device

Timemachine support need the OmniOS SMB APL extensions (kernelbased SMB server).

You can disable OmniOS APL extensions (kernelbased Illumos SMB server) in a configfile ex apl in
/etc/system/system.d (reboot required). Some Apple SMB features like Timemachine are then not working.

ex file /etc/system.d/apl

* disable APL file extension, ex on locking problems in Avid Media Composer
set smbsrv:smb2_aapl_use_file_ids=1

* or disable all APL extensions
set smbsrv:smb2_aapl_extensions=0

bonjour.png


3.2 ZFS Snaps and OSX

On Windows you have direct and zero config access to ZFS snaps via "Windows Previous Versions". On OSX there is no similar way ex for Timemachine but as ZFS snaps and shares are a strict ZFS filesystem property you do not need to configure anything for snap access. Direct way to access ZFS snaps on OSX is:

Click on "Go" in finder.
Click on "connect to server"
Enter the following: "SMB://<device name or IP>/sharename/.zfs/snapshot"
ex: SMB://192.168.2.1/data/.zfs/snapshot

You can then access and search all ZFS snapshots via Finder (ZFS snaps are readonly)

snaps.png


3.3 ZFS filesystem as Timemachine device (Time Capsule)

If you have advertized a filesystem as Timemachine device in 3.1, you can select in Timemachine settings. Other option is to connect any SMB share. When connected this share is also available as a Timemachine target.

SMB Settings (Service > SMB > properties)
min_protocol=2.1
oplock_enable=true
signing_enabled=false
signing_required=false

ZFS filesystem settings for Timemachine ex timecapsule
sync disabled
nbmand on

tm.png


3.4 NFS4 ACL and OSX

The Solaris kernelbased SMB server uses only and always Windows ntfs alike file/ folder ACL with inheritance, Share ACL, Windows compatible local SMB groups and Windows SID as security reference. None of them are known under OSX. This means that while OSX must obey all ACL settings, you cannot view or edit them on OSX. You must either set them on napp-it or from a Windows machine.

acl.PNG

3.5 Tuning and performance

For best performance ex on a 10G+ network, increase ip and NFS buffers/servers in napp-it menu System > Tuning. Another option is to use Jumboframes (must be supported and enabled on any host, client or switch). In general the multithreaded OmniOS kernelbased SMB server with Open-ZFS is very fast and perfectly integrated into ZFS but not as fast as Oracle Solaris 11.4 with native ZFS.


4.0 OSX problems

On some problems try a different setting but switch back if it does not solve a problem

NFS and SMB client or server problems on OSX are more common than on Linux or Windows. One of the reasons is that Apple does not care about the standards from outside the Apple world (SMB is basically the share method from the Windows world. Other servers like SAMBA or Solaris SMB follows Windows first then optionally care about Apple). Apple removes or adds features, does not matter if non Apple equipment does not work any longer without tweaks or at all.

First check SMB server settings In menu Service > SMB properties.
oplock_enable=true
signing_enabled=false
signing_required=false

Then check ZFS properties of a shared filesystem
set nbmand to on
set aclinherit and aclmode to pass-through (menu ZFS filesystems > Folder ACL)

On remaining OSX problems like Timemachine not working, slow access, sudden disconnects or missing files on a share, the next tip is to update OSX and OmniOS to the newest available and supported release. On OSX update at least to 10.15 that is under support until autumn 2022. Up from 2023 prefer at least OSX v11.


If an update does not solve a problem:
- google with the OSX release ex "OSX 11 SMB problem" as each OSX has its own problems and solutions
and try the suggested solutions

- try to connect via SMB1 (slower than SMB2/3) via
Finder > Go > connect to server: cifs://ip/share ex cifs://192.168.2.1/data
- search or ask at the maillists illumos-discuss or omnios-discuss (Topicbox) for help


5. More manuals, see https://www.napp-it.org/manuals/index_en.html
 
Last edited:
  • Like
Reactions: mrpasc

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
Use case and performance considerations for an OmniOS/OpenIndiana/Solaris based ZFS server
This is what I am asked quite often

If you simply want the best performance, durability and security, order a server with a very new CPU with a frequency > 3GHz and 6 cores or more, 256 GB RAM and a huge Flash only storage with 2 x 12G multipath SAS (10dwpd) or NVMe in a multi mirror setup - with a datacenter quality powerloss protection to ensure data on a powerloss during writes or background garbage collection. Do not forget to order twice as you need a backup on a second location at least for a disaster like fire, theft or Ransomware.

Maybe you can follow this simple suggestion, mostly you search a compromise between price, performance and capacity under a given use scenario. Be aware that when you define two of the three parameters, the third is a result of your choice ex low price + high capacity = low performance.

When your main concern is a well balanced workable solution, you should not start with a price restriction but with your use case and the needed performance for that (low, medium, high, extreme). With a few users and mainly office documents, your performance need is low, even a small server with a 1.5 GHz dualcore CPU, 4-8 GB RAM and a mirror from two SSD or HD can be good enough. Add some external USB disks for a rolling daily backup and you are ready.

If you are a media firm with many users that want to edit multitrack 4k video from ZFS storage, you need an extreme solution regarding pool performance (> 1GB/s sequential read,write), network (multiple 10G) and capacity according your needs. Maybe you come to the conclusion to prefer a local NVMe for hot data and a medium class disk based storage for shared file access and versioning only. Do not forget to add a disaster backup solution.

After you have defined the performance class/use case (low, medium, high, extreme), select needed components.


CPU
For lower performance needs and 1G networks, you can skip this. Even a cheap dual/quadcore CPU is e good enough. If your performance need is high or extreme with a high throughput in a 10G network or when you need encryption, ZFS is quite CPU hungry as you see in https://www.napp-it.org/doc/downloads/epyc_performance.pdf. If you have the choice prefer higher frequency over more cores. If you need sync write (VM storage or databases) avoid encryption as encrypted small sync writes are always very slow and add an Slog for diskbased pools.


RAM
Solaris based ZFS systems are very resource efficient due the deep integration of iSCSI, NFS and SMB into the Solaris kernel that was developped around ZFS from the beginning. You need less than 3 GB for a 64bit Solaris based OS itself to be stable with any pool size. Use at least 4-8 GB RAM to allow some caching for low to medium needs with only a few users.

memory.png

As ZFS uses most of the RAM (unless not dynamically demanded by other processes) for ultrafast read/write caching to improve performance you may want to add more RAM. Per default Open-ZFS uses 10% of RAM for write caching. As a rule of thumb you should collect all small writes < 128k in the rambased write cache as smaller writes are slower or very slow. As you can only use half of the write cache unless the content must be written to disk, you want at least 256k write cache that you can have with 4 GB RAM in a single user scenario. This RAM need for write caching scale with number of users that write concurrently so add around 0.5 GB RAM per active concurrent user.

Oracle Solaris with native ZFS works different. The rambased writecache caches last 5s of writes that can consume up to 1/8 of total RAM. In general this often leads to similar RAM needs than OI/OmniOS with Open-ZFS. On a faster 10G network with a max write of 1 GB/s this means 8GB RAM min + RAM wanted for readcaching.

Most of the remaining RAM is used for ultrafast rambased readcaching (Arc). The readcache works only for small io on a read last/ read most optimazation. Large files are not cached at all. Cache hits are therefore for metadata and small random io. Check napp-it menu System > Basic Statistic > Arc after some time of storage usage. Unless you does not have a use scenario with many users, many small files and a high volatility (ex a larger mailserver), cache hit rate should be > 80% and metadata hit rate > 90%. If results are lower you should add more RAM or use high performance storage like NVMe where caching is not so important.

If you read about 1GB RAM per TB storage, forget this. It is a myth unless you do not activate rambased realtime dedup (not recommendet at all or when dedup is needed use fast NVMe as a special vdev mirror for dedup). Needed RAM size depends on number of users, files or wanted cache hit rate not poolsize.

arc.png

L2Arc
L2Arc is an SSD or at best NVMe that can be used to extend the rambased Arc. L2Arc is not as fast as RAM but can increase cache size when more RAM is not an option or when the server is rebooted more often as L2Arc is persistent. As L2Arc needs RAM to organize, do not use more than say 5x RAM as L2Arc. Additionally you can enable read ahead on L2Arc that may improve sequential reads a little. (add "set zfs:l2arc_noprefetch=0" to /etc/system or use napp-it System > Tuning).


Disk types
RAM can help a lot to improve ZFS performance with the help of read/write caching. For larger sequential writes and reads or many small io it is only raw storage performance that counts. If you look at the specs of disks the two most important values are seqential transfer rate for large transfers and iops that counts when you read or write small datablocks.

Mechanical disks
On mechanical disks you find values of around 200-300 MB/s max sequential transfer rate and around 100 iops. As a Copy on Write filesystem like ZFS is not optimized to a single user/single datastream load, it spread data quite evenly over the pool for a best multiuser/multithread performance. It is therefore affected by fragmentation with many smaller datablocks spread over the whole pool where performance is more limited by iops than sequential values. On average use you will often see no more than 100-150 MB/s per disk. When you enable sync write on a single mechanical disk, write performance is not better than say 10 MB/s due the low iops rating.

Desktop Sata SSD
can achieve around 500 MB/s (6G Sata) and a few thousand iops. Often iops values from specs are only valid for a short time until performance drops down to a fraction on steady writes.

Enterprise SSDs
can hold their performance and offer powerloss protection PLP. Without PLP last writes are not save on a power outage during write as well as data on disk with background operations like firmware based garbage collection to keep SSD performance high.

Enterprise SSDs are often available as 6G Sata or 2 x 12G multipath SAS. When you have an SAS HBA prefer 12G SAS models due the higher performance (up to 4x faster than 6G Sata) and as SAS is full duplex while Sata is only half duplex with a more robust signalling with up to 10m cable length (Sata 1m). The best of all SAS SSDs can achieve up to 2 GB/s transfer rate and over 300k iops on steady 4k writes. SAS is also a way to build a storage with more than 100 hotplug disks easily with the help of SAS expanders.

NVMes are the fastest option for storage. The best like Intel Optane 5800x rate at 1.6M iops and 6.4 GB/s transfer rate. In general Desktop NVMe lack powerloss protection and can hold write iops not on steady write so prefer datacenter models with PLP. While NVMe are ultrafast it is not as easy to use many of them as each wants a 4x pci lane connection (pci-e card, M.2 or oculink/U.2 connector). For a larger capacity SAS storage is often nearly as fast and easier to implement especially when hotplug is wanted. NVMe is perfect for a second smaller high performance pool for databases/VMs or to tune a ZFS pool witha special vdev or an Slog for faster sync write on disk based pools, a persistent L2Arc or a special vdev mirror.


ZFS Pool Layout

ZFS groups disks to a vdev and stripes several vdevs to a pool to improve performance or reliability. While a ZFS pool from a single disk vdev without redundancy rate as described above, a vdev from several disks can behave better.

Raid-0 pool (ZFS always stripes data over vdevs in a raid-0)
You can create a pool from a single disk (this is a basic vdev) or a mirror/raid-Z vdev and add more vdevs to create a raid-0 configuration. Overall read/write performance from math is number of vdevs x performance of a single vdev as each must only process 1/n of data. Real world performnce is not a factor n but more 1.5 to 1.8 x n depending on disks or disc caches and decreases with more vdevs. Keep this in mind when you want to decide if ZFS performance is "as expected"

A pool from a single n-way mirror vdev
You can mirror two or more disks to create a mirror vdev. Mostly you mirror to improve datasecurity as write performance of an n-way mirror is equal to a single disk (a write is done when on all disks). As ZFS can read from all disks simultaniously read performance and read iops scale with n. When a single disk rate with 100 MB/s and 100 iops a 3way mirror can give up to 300 MB/s and 300 iops. If you run a napp-it Pool > Benchmsrk with a singlestream read benchmark vs a fivestream one, you can see the effect. In a 3way mirror any two disks can fail without a dataloss.

A pool from multiple n-way mirror vdevs
Some years ago a ZFS pool from many striped mirror vdevs was the preferred method for faster pools. Nowaday I would use mirrors only when one mirror is enough or when an easy extension to a later Raid-10 setup ex from 4 disks is planned. If you really need performance, use SSD/Nvme as they are by far superiour.

A pool from a single Z1 vdev
A Z1 vdev is good to combine up to say 4 disks. Such a 4 disk Z1 vdev gives the capacity of 3 disks. One disk of the vdev is allowed to fail without a dataloss. Unlike other raid types like raid-5 a readerror in a degraded Z1 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z1 is much better and named different than raid-5. Sequential read/write performance of such a vdev is similar to a 3 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)

A pool from a single Z2 vdev
A Z2 vdev is good to combine say 5-10 disks. A 7 disk Z2 vdev gives the capacity of 5 disks. Any two disks of the vdev are allowed to fail without a dataloss. Unlike other raid types like raid-6 a readerror in a totally degraded Z2 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z2 is much better and named different than raid-6. Sequential read/write performance of such a vdev is similar to a 5 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)

A pool from a single Z3 vdev
A Z1 vdev is good to combine say 11-20 disks. A 13 disk Z2 vdev gives the capacity of 10 disks. Any three disks of the vdev are allowed to fail without a dataloss. There is no equivalent to Z3 in traditional raid. Sequential read/write performance of such a vdev is similar to a 10 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io).

A pool from multiple raid Z[1-3] vdevs
Such a pool stripes the vdevs what means sequential performance and iops scale with number of vdevs (not linear similar to the raid-0 degression with more disks)

Many small disks vs less larger disks
Many small disks can be faster but are more power hungry and as performance improvement is not linear and failure rate scale with number of parts I would always prefer less but larger disks. The same is with number of vdevs. Prefer a pool from less vdevs. If you have a pool of say 100 disks and an annual failure rate of 5%, you have 5 bad disks per year. I you asume a resilver time of 5 days per disk you can expect 3-4 weeks where a resilver is running with a noticeable performance degration.

Special vdev
Some high end storages offer tiering where active or performance sensitive files can be placed on a faster part of an array. ZFS does not offer traditional tiering but you can place critical data based on their physical size (small io), type (dedup or metadata) or based on the recsize setting of a filesystem on a faster vdev of a ZFS pool. Main advantage is that you do not need to copy files around so this is often a superiour approach as mostly the really slow data is data with a small physical file or blocksize. As a vdev lost means a pool lost, use special vdevs always in a n-way mirror. Use the same ashift as all other vdevs (mostly use ashift=12 for 4k physical disks) to allow a special vdev remove.

To use a special vdev, use menu Pools > Extend, select a mirror (best a fast SSD/NVMe mirror with PLP) with type=special. Allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks. This means you can force all data of a certain filesystem to the special vdev when you set the ZFS property "special_small_blocks" ex special_small_blocks=128K for a filesystem with a recsize setting smaller or equal. In such a case all small io and some critical filesystems are on the faster vdev others on the regular pool. If you add another vdev mirror load is distributed over both vdevs. If a special vdev is too full, data is stored on the other slower vdevs.

Slog
With ZFS all writes always go to the rambased writecache (there may be a direct io option in a future ZFS) and are written as a fast large transfer with a delay. On a crash during write the content or the writcache is lost (up to several MB). Filesystems on VM storage or databased may get corrupted. If you cannot allow such a dataloss you can enable sync write for a filesystem. This will force any write commit immediately to a faster Zil area of the pool or to a fast dedicated Slog device that can be much faster than the pool ZIL area and additionally in a second step as a regular cache write. Every bit that you write is writtn twice, once directly and once collected in writecache. This can never be as fast as a regular write vie writecache. So Slog is not a performance option but a security option when you want acceptable sync write performance. The Slog is never read beside after a power outage to redo missing writes on next reboot, similar to the BBU protection of hardware raid.
Add an Slog only when you need sync write and buy the best that you can afford regarding low latency, high endurance and 4k write iops. The Slog can be quite small (min 10GB). Widely used are the Intel datacenter Optane.

Tuning
Beside the above "physical" options you have a few tuning options. For faster 10G+ networks you can increase tcp buffers or NFS settings in menu System > Tuning. Another option is Jumboframes that you can set in menu System > Network Eth ex to a "payload" of 9000. Do not forget to set all switches to highest possible mtu value or at least to 9126 (to include ip headers)

Another setting is ZFS recsize. For VM storage with filesystems on it I would set to 32K or 64K (not lower as ZFS becomes inefficient then). For mediadata a higer value of 512K or 1M may be faster.

more, https://www.napp-it.org/doc/downloads/napp-it_build_examples.pdf
 
Last edited:
  • Like
Reactions: Aluminat and mrpasc

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
How to run a Solaris based ZFS server
Set and forget is a bad idea


Prior ZFS the main risk for your data was a crash during write. In such a case you have had incomplete atomic writes resulting in corrupted data, corrupted filesystems or corrupted raid arrays. Examples are when data is written but metadata not updated or when one disk in a mirror is written but not the other one or when a raid 5/6 stripe is written not to all disks in the array or a combination of them. ZFS avoids such problems by design due Copy on Write. This means that atomic write actions are done completely or discarded with the former data state remaining valid.

Another main risk was silent data corruption (statistically by chance, error rate depend on time and pool size) or data corruption in the data chain memory -> disk driver -> cabling -> disk bay -> disk. ZFS can detect both problems and repair on the fly on reads from redundancy due real time checksums on data and metadata. You can scrub a pool regularly to find and repair errors and to get informed early if error rate suddenly increases due hardware problems.

Be happy as ZFS is not affected by these main risks on older filesystems but there are remaining risks on daily use that you must care about.

A data disaster case
"No backup, no pity"

A data disaster is an event that cannot be covered by the ZFS security and redundancy concept like theft, fire, overvoltage or hardware problems. For a data disaster case you need a disaster data backup. This means that you must have an up to date copy of all important data on a distant place ex via removeable disks or a remote backup server that is not affected from a problem on the main server.

A bootdisk disaster case

If a bootdisk fails with a more complicated setup (with additional services beside the Solaris included core services iSCSI, NFS or SMB like cloudservices, databases, mail or webserver) a re-install can be quite complicated and very time consuming. I would avoid a complicated setup. If VM server and storage are as minimalistic as possible a simple and fast reinstall is enough to be up again. Other services should be virtualized ex under a minimalistic ESXi. But when you already have a complicated setup, you should prepare an easy and fast up to date (ex twice a day) system disaster recovery option. You can use ZFS replication to create bootable copies of your rpool with the configured operating system and your extras. Create a replication job with the current bootenvironment as source and the datapool as destination. Run this job once or several times per day to have a current system backup. In case of a needed disaster recovery on a new disk, you need a minimal napp-it setup, restore the bootenvironment that is fully up to date from datapool to rpool via a replication job, activate this bootenvironment and reboot. Done in less than 30 minutes.

Restore deleted, modified or Ransomware encrypted data from snap versioning
"No snap, no pity"

Copy on Write allows data versioning based on ZFS snaps without delay or initial space consumption as such a snap is a freeze of the former datablocks already on disk. Create one or more snapshot jobs to preserve readonly versions of your current data ex hold=364 for a job that runs each day at midnight. This gives access to your data state for any day of last year via Windows "previous versions" or snapfolder in an SMB share.

You should use a ZFS replication Job to keep filesystems in sync with a removeable or remote backup pool. For replicatios you can use a different keep/hold snap policy ex keep 4 snaps for last hour, keep 24 snaps for last day or keep 364 snaps for last year.

Bugs or security holes
Keep your OS up to date!!


ZFS on a Solaris based server is very save as ZFS is there not just another filesystem among many others but the only one. Solaris was created together with ZFS since 2005. It was build around and ontop of ZFS with iSCSI, the kernelbased NFS and SMB services part of the OS.

Solaris or OmniOS is an Enterprise Unix OS. Bugs are very rare as ZFS is very mature there but new features can introduce new bugs. Discovered and known security holes should be closed as fast as possible as they would allow to compromize data remotely ex due Open-SSH bugs. This means that your OS release must be up to date and under support or maintenance.

Solaris 11.4 with native ZFS

If you use Oracle Solaris commercially, a support contract is mandatory and at least available until 2034. For noncomercial or development use Solaris 11.4 CBE is freely available as a beta of the current Solaris. If you use the CBE edition, check repository as you have only access to the release repository but not the paid support one. If support repo is set, change via:

pkg set-publisher -G'*' -g http://pkg.oracle.com/solaris/release/ solaris
After this a "pkg update" gives you the newest Solaris 11.4 CBE

OmniOS based on Open-ZFS and features

If you use zje OpenSource OmniOS based on the Solaris fork Illumos, you have a dedicated repository per stable OS release. Open-ZFS features are included in Illumos after an additional audit and release process.

A "pkg update" gives you the newest OS state of the currently installed release.

To update to a newer release you must sign out from current repository and select the repository for the newer release, see http://www.napp-it.org/doc/downloads/setup_napp-it_os.pdf

After this a "pkg update" gives you the next OS release with the former release as a bootenvironment to go back when needed.

While OmniOS is Opensource, you can donate as a Patron for further development or aquire a commercial support contract, Commercial Support that offers help from the OmniOS developpers.

Transactional data or VMs on ZFS

While Copy on Writes can guarantee a valid ZFS filesystem after a crash or a sudden poweroff, databases or VMs with foreign filesystems are only save when you enable sync write (a ZFS property). With sync enabled all writes that are collected in the rambased writecache (lost on a crash) are logged immediately to the pool (ZIL) or a faster Slog device. Missing atomic writes due a crash that affect data consistency are redone on next reboot


Bad disks

You should regularly check disk smartvalues and number of checksum and iostat errors as they can indicate disk problems early. Expect 2-10% bad disks per year. A hotspare disk added to a pool can replace a bad disk automatically. Use only one or a few hotspare disks as a flaky pool ex due a loose contact can activate all hotspares with a confusing pool state. Run a zfs scrub from time to time and set an mail alert job.

To be prepared for bad disks, create a visual map of your disk enclosure in menu Disks > Appliance Map or at least write down a list of WWN + serial number of each disk and the disk bay where the disk is in.

map.png

map2.png
On napp-it free use the eval period to create and printout a map.


Flash/SSD

SSDs (beside Intel Optane that works more like persistent memory) have introduced a new problem not known with mechanical disks. The smallest unit of an SSD is a page, which is composed of several memory cells and is usually 4 KB in size. Several pages on the SSD are summarized to a block. A block is the smallest unit of access on a SSD with up to several Megabytes in size. If you make a change to a 4KB file, for example, the entire block that 4K file sits within must be read, deleted and rewritten.

Such an update of a current block is slow and should be avoided via trim that enables an operating system to inform a NAND flash solid-state drive (SSD) which data blocks it can erase because they are no longer in use to allow a write without the prior erase/update cycle. The use of TRIM can improve the performance of writing data to SSDs and contribute to longer SSD life.

You can enable continuos trim for SSDs on OmniOS as a ZFS property. As autotrim affects performance negatively you can optionally trim on demand ex during night. As there are data corruptions reported on some SSDs (mainly cheaper non-datacenter ones), you should evaluate trim prior use.

Beside trim there is SSD garbage collection to improve SSD performance. This is an automated background process by which the firmware of a solid-state drive proactively eliminates the need for whole block erasures prior to every write operation. Garbage collection is essential because SSDs, which are initially extremely fast, become slower over time. The purpose of garbage collection is to increase efficiency by keeping as many empty blocks as possible so that, when the SSD has to write data, it doesn't have to wait for a block to be erased. As garbage collection moves data around in the backround, data corruption on a power outage is possible without powerloss protection.

Powerloss protection of SSDs

When data is written to SSDs, they often land in a faster cache area that is lost on a crash during write. This makes sync write useless where it is essential that a successful write commit must guarantee that data is save on disk. Garbage collection moves data around in the backround. A crash then and you have a risk of dataloss/data corruption even with ZFS. While the risk may not be huge, there is a risk that can only be avoided with datacenter SSDs that offer a powerloss protection that is more than a marketing joke.

So when you buy SSDs, prefer datacenter SSDs/NVMe with powerloss protection. You can also use an UPS to avoid a sudden power down. From my experience in Germany with a very stable power grid I have seen as many outages due UPS failures than power grid failures. UPS need maintenance and new batteries say every two years. At best, use a twin powered server with one psu connected to the UPS.

ECC RAM

Even without a disaster or lost data in unprotected caches, there is one remainig hardware that can cause datacorruption or dataloss without notice and this is RAM. An undetected RAM error can cause ZFS to write bad data with correct checksums to disk. The risk is by chance over time and depend on RAM size. The bigger the RAM the greater the risk. You should use ECC RAM that detects RAM errors and repair them on the fly due RAM redundancy (similar to raid redundancy with disks). If you want ZFS because it is the perfect solution for datasecurity then use it with ECC RAM.

Sometimes you hear about a "ZFS scrub to death" phenomenon. This means that ongoing readerrors due RAM errors without ECC cause false checksum repairs resulting in a damaged pool. This is a myth. ZFS does not "repair to death". Up from a certain error rate the result is a disk offline due too many read errors. If number of RAM errors increase a kernelpanic is the most propable result. Additionally the Solaris fault management daemon fmd will take a whole device (ex HBA) out of order on ongoing problems.
 
Last edited:
  • Like
Reactions: gb00s

gea

Well-Known Member
Dec 31, 2010
2,873
1,016
113
DE
Filesystem monitoring in napp-it via fswatch

Filesystem monitoring is a new feature in newest napp-it 23.dev (jan 29) in menu Service > Audit and Fswatch.
It logs events like file create, modify or delete. You can use it to monitor activities or create alerts
when many files are modified in a short time (under development) ex due a Ransomware attack or to sync
modified files on demand ex between a cloud/s3 and an SMB share based on snaps as there is no working
filelocking between them.

service.png

You can enable monitoring in the above service menu

Menu forms:
Report alerts under development
Watched folderlist 1..3 watched folders ex path1 path2 (blank separated)
use "path 1" with blanks in path

Include Path must contain this regex to be logged
Exclude Events are excluded when regex is in path

Options default is -tnr
t = print timestamp
n = print events with a numetric trigger
(alternative is x to print cleartext)
r = recursive
d = log only directory access not files
(reduce load)

Eventlist log only these events ex
--event 8 --event 68


Tip:
Do not log large filesystems recursively, log only needed folders.
With many files it can take some time until events are logged.


Logs

Logfiles are in folder /var/web-gui/_log/monitor/
There is one logfile per day of last 31 days ex fswatch.01.log .. fswatch.31.log
Older logs ere overwritten. If you want to preserve them use an "other job"


You can show and filter logs in menu Service > Audit > Fswatch Log
ex show only events on .jpg files

filter.png


Debug and check behaviour

Open a console as root ex via Putty)and start monitoring manually to print events at console.
You can copy/paste commands in Putty via a mouse right-click.

perl /var/web-gui/data/napp-it/zfsos/_lib/illumos/agents/agent_stat.pl fswatch

I have included fswatch for Illumos (OmniOS, OpenIndiana) and Solaris.
On Solaris I had to modify sources, not sure about stability, Using fswatch to detect new files (Solaris) · Issue #228 · emcrisostomo/fswatch
 
Last edited:
  • Like
Reactions: gb00s