Evolution from Storage Spaces

nonyhaha · Sep 12, 2020

Hello everyone,

After my last update @ https://forums.servethehome.com/index.php?threads/the-evolution-of-a-home-server.25766/#post-265974 , I got a new server. An HP DL380P G8, which is able to fit 12 lff drives. I am running at the moment 4x4tb and 2x240gb ssd for cache in a storage pool under storage spaces. 4 columns, 3 stripes, 1 parity, 100GB WBC, data integrity check and repair activated over powershell. All fine and dandy, and I keep nerding around reddit, and different tutorials over youtube, and I am always thinking of I should jump the fence to another OS for nas purpose.

I will run ESXi on it. on top of this, I will need to run: home assistant, motion eye, emby server, torrent client, nas.
At the moment, a windows server 2019 vm is taking care of nas(storage spaces), emby server(with a passed through quadro p400) and torrent client.

Should I look into other operation systems?
I will have: 8x4tb(1 will be a cold spare). 2xssd for cache. 1 nvme for os and virtual machines oses.
I would want:
- 1 disk parity. 6 remaining disks for data. 2 ssds for cache.
- preferably instant redundancy calculation and saving on writing data. (I know snapraid need scheduled sync for this)
- protection against stupid things like data degradation/bit rot. Storage spaces refs can o his, if you enable it yourself from cli.
- automatic scrub and repair over time, not all at once. as I will have well over 10tb of data in the beginning.
- I do not necessarily need the possibility to add disks to the array in the future one by one.
- I can't say I am fond of the snapraid method of writing data integral to one disk, even if it has its advantages, as this will cripple writing speeds, and my internet connection handles well over 35mB/s. At this moment, to overcome storage spaces writing to the parity problems when transferring large files, I ended up creating the cache/journaling partition. I also use the rest of space on the ssds on a raid0 stripe also under storage spaces for Emby trans-coding location.

I do not know much about Unraid+zfs. It looks like zfs also supports auto scurb and healing? Am I correct? OMV looks pretty easy to use, with its snapraid and mergefs/unionfs.
What do you think I should do?

Thanks in advance.
I also posted on reddit, but I still think a forum discussion will evolve easier in the right direction as everybody is following a single thread of discussion.
https://www.reddit.com/r/DataHoarder/comments/iqzq44

gea · Sep 13, 2020

ZFS has it all

- realtime bitrot protection (with checksums on data and metadata).
detected errors were automatically repaired on read or a pool scrub
- realtime raid protection without the write hole problem of raid 1/5/6
- no corrupted filesystem on a crash during write (due Copy on Write)
with unlimited readonly snaps (=Windows previous versions), ransomware protection
- protection of ram write caches with sync write and slog
(protects VMs against data corruption)

ZFS comes in two flavours. Either native ZFS with Oracle Solaris (still the fastest ZFS storage server) or Open-ZFS, either with a Solaris fork like OmniOS where ZFS and SMB integration is still best or FreeBSD and Linux where ZFS is now also available. Feature wise Open-ZFS platforms are quite identical with Free-BSD at the moment a year behind but this will be only temporary.

You can try a ZFS storage VM, example my free ready to use storage server template with OmniOS that I offer now for more than 10 years but you can create your own based on Free-BSD, Linux or Solaris operating systems.

From your 8 disks, I would create a ZFS Z2 where 2 disks are allowed to fail without dataloss. This is superiour to single disk parity with a spare disk. Read/ Write cache on ZFS is mainly RAM so RAM is performance sensitive. You can extend RAM by a (slower than RAM) SSD for readcaching. With enough RAM this is quite worthless.

You can use an SSD/NVMe as an Slog to protect the rambased write cache. Makes sense for diskbased VM storage. If you put VMs on NVMe/SSD, skip this and just enable sync.

I would create a diskbased pool for filer and backup, with VMs on a second pool from SSD or NVMe. Optionally you can use the SSD as a special vdev mirror to hold metadata and small io.

hardware
- use enough RAM (gives performance, best is 8GB+ for storage)
- use an HBA to connect disks, gold standard are LSI/BroadCom HBA ex 9300

see my howto

https://napp-it.org/doc/downloads/napp-in-one.pdf

nonyhaha · Sep 13, 2020

Thank you very much. I really like how you explained this. I will begin researching this and I hope I will make it to the end. I will try to make an 8 disk pool with 2 disk redundancy.
Thank you again.

So I am reading through the manual, very nice put.
I am just into page 5, and I read about best ways to use flash cache, which would be the intel optane 9 series, but later on we find that it is no longer recommended for l2arc (I presume, because no more plp protection in lower priced range).
As I understand from ZFS L2ARC , l2arc is used for random reads from the pool, and this can be added through a vdisk from esxi (could I use a part of my pcie nvme used for datastore? I will have a ups to backup the server during power shortages).
ZIL on te other hand, is just like a write back cache. Although, I read on FreeNAS - What is ZIL & L2ARC - 45 Drives Technical Information Wiki that it is used to store writes of only up to 64kb data files. So I am not sure it will be of any good for my torrent client trying to write to the pool, am I right?

All this being said, is there a way to cache higher capacity writes also? Can this be done with an SSD cache?

Also, can napp-in-one construct a second array, a raid 0 ztripe of 2 ssds, for me to be able to write the emby cache to that partition? Or would the ZIL+L2ARC mentioned before, be enough for it?

Many many many thanks!

L.E. At this moment, i am thinking:
esxi7
napp-it vm with 20gb vdisk from the datastore nvme, passed through p420i in IT mode with 8x4tb and 1-2ssds, one for ZIL? or 2 ofr raid 0 stripe? 32gb ddr3 1333 should be enough? This will be mostly a media storage server, but will include also photos. As for processing power, what should I use? passmark score wise/corecount?

L.E. 2
I see you continue to talk about omnios. I see that napp-it can share the resulting pool via smb.
I just want to be sure, but I should be able to share this via smb to my win sever machine, and use that machine to manage the per user sharings, correct?

ari2asem · Sep 13, 2020

what about snapraid ??

www.snapraid.it

nonyhaha · Sep 13, 2020

ari2asem said:
what about snapraid ??

www.snapraid.it

Thank you very much. I studied the option of snapraid, along with mergefs on omv. At this moment I am on the napp-in-all train, it looks like the perfect thing to do, it has it all

gea · Sep 13, 2020

some remarks

L2Arc is a readcache only and extends the rambased arc.
You need no plp as on a failure reads just skip using l2arc. As ramcache (arc) and ssdcache l2arc only cache random reads, with a lot of ram you will see (nearly) no improvement with an l2arc. Mainly you want l2arc in low ram situations or with a huge amount of random reads, ex a mailserver with hundreds of users. A minmal improvement can be achieved if you enable read ahead that is only possible on l2arc.

Partitioning an NVMe for pool, l2arc and slog?
Possible but I would strongly advice against in most cases. Keep it simple is the important factor for trouble free storage.

Slog
is not a write cache as is not in the datapath writing application > storage. It is a protector for the rambased write cache. Think of it like the BBU protection of a hardware raid controller. It logs all writes from commit to commit and is only read after a crash during write to redo missing writes on next bootup. Its size does not exceed around 20 GB (2 x 5s of writes on Solaris, 10% ram up to 4GB on Open-ZFS).

For a filer you do not need secure sync writes. ZFS will be in any case stable after any sort of crash during write. No corrupted filesystems or chkdsk needed any longer. Only VMs (a file from ZFS view) and transactional databases may become corrupted on a crash during write without sync enabled.

Writecache on ZFS is RAM and only RAM. You want it to avoid small random writes as it collects them and you must write only large sequential writes where on a pool from 6 datadisks, performance is > 600 MB/s while small random writes be be at only 50 MB/s or less. For sync writes this pool may land at 20 MB/s.

A larger SSD cache would not help to improve anything as sequential pool perfomance is already much higher than a single SSD. Only single disk based systems like Snapraid see an improvement with a write cache SSD and maybe Windows but this is more due a weak implementation of rambased caching.

VM size
My storage server template is around 40 GB. If you install manually I would use a similar size. OmniOS does not require as it is the most minimalistic available ZFS server but you need it either on a system upgrade where the new OS must me stored and you want it for boot environments. This are the last state of the OS prior an upgrade. On problems you can boot then into one of the former OS states.

SMB
All Solaris based servers come with an in the OS embedded multithreaded kernel/ZFS based SMB server as a ZFS property. Just set SMB sharing = on and you are done. The Solarish SMB server offers a unique level of easyness, Windows SMB sharing compatibility with always working ZFS snaps=Windows previous versions, ntfs alike ACL and local Windows groups with Windows sid security ids as extended ZFS atributes. This is unique for a Linux/Unix system and a huge advantage over SAMBA (Solarish based operating systems only), the alternative SMB server available on any X system.

Users are created within napp-it. If you use same pw than on Windows you can connect directly without entering login/pw. ACL permissions are usually set from Windows (Property > Security, connect as user root)

Additionally you can share with other protocols. Most important is NFS (ex as ESXi storage), iSCSI when you want blockbased sharing and Amazon S3. The last is important for backups or Internet/ Cloud sharing.

nonyhaha · Sep 13, 2020

Many thanks again. I will then proceed with this project when i get back home form the covid hospital.
I feel great about this project

Thanks!

I will use the nvme ssd for all vms on esxi datastore. The zfs raidz2 pool will hold only the nas data. I'll see what I will use the remaining 2 ssds for.

ari2asem · Sep 13, 2020

my own experience with napp-it:
terribly slow webgui-interface, used with microserver gen8, 16gm memory, 1230-cpu.

was just playing around with napp-it, without any data drive installed in microserver. only 256gb ssd for OS.

nonyhaha · Sep 13, 2020

ari2asem said:
my own experience with napp-it:
terribly slow webgui-interface, used with microserver gen8, 16gm memory, 1230-cpu.

was just playing around with napp-it, without any data drive installed in microserver. only 256gb ssd for OS.

Thank you. I'll have to get home, install it, and see. Even if the gui will be slow as hell, if the data transfers to the storage pool are good for me, thatis enough. I will use it exclusively as a nas. Everything else will run in other esxi hosts. ATM i have 2x E5-2630LV2 for all my vms. I am thinking of uprading to E5-2650LV2.

gea · Sep 13, 2020

The gui performance of napp-it free depends on the server performance. If you click on menu disks, filesystems or snaps it calls the according zfs or system commands to display the menu. A fast server means a faster web-gui, a slow server means that you have to wait until ex a zfs list command finishes to display a menu with current realtime status. Gui performance is in no way related to server performance. No difference to when you disable napp-it. Napp-it uses mini httpd as webserver. This is unique in its low size, only a 50k binary and with ultra low memory and cpu needs for itself, much less as for ex an Apache webserver.

Napp-it Pro adds asyncronous background agents to improve gui performance but still reads in the current state of ZFS to allow a full compatibility with cli commands. With current hardware and ram the delay without background acceleration is minimal in the area of 1-2 seconds. With a jbod and 90 disks, hundreds of filesystems or 10000 snaps or a very old hardware and 2-4 GB RAM this may be different but this is not a typical current home use scenario.

The alternative would be a gui database of server states that would give different gui results to cli commands. Some other gui applications work this way to be fast. Not my way.

nonyhaha · Sep 17, 2020

Hello @gea ,
So looks like i am in a pickle. Napp-it does not recognize the hp p420i because, something from here: Google Translate / from original OmniOS + napp-it で ZFS のファイルサーバを作る -> go to -420i with openindiana. also here: Bug #1441: cpqary3 driver is old and causes panic in oi_151 - OpenIndiana Distribution - illumos

Do you think I should continue down this road?

What would be the easiest way to add support?

Many thanks!

L.E. I setup also a win server 2019 machine, this does not even start with the IT mode HP p420i passthrough. After installing windows i attached the pci device, tried to start it and it stays at the windows loading screen.

L.E.2 Do you think I could get away by using RDM to forward the 8 HDDs to the napp-it vm?

gea · Sep 17, 2020

HP 420 is a Raid 5/50/6/60 controller, absolutely not suitable for software raid and ZFS. Sell it to a Windows guy who may be happy with it and buy a cheap used LSI HBA for half the price (best is LSI/BroadCom 9300, ok are 2307 and 2008 based HBAs). If you can decide when you buy, prefer a HBA with IT firmware, IR (Raid 1/10) is ok, firmware with raid-5 will not work.

RDM is fine when you use an LSI SAS HBA from above. RDM can give problems with Sata. RDM on raid adapters - it depends.

nonyhaha · Sep 17, 2020

@gea again thanks for replying. The p420i in an onboard controller. It has a simple command that puts it into it mode, so this should not be an issue. The issue is i have many problems passing it through. I am in a pickle because the dl380p has 3 pcie slots and i wanted to have 1. Gpu. 2. The nvme ssd i just mounted on a pcie adapter for datastore. 3. A usb 3.0 pcie adapter, passed through to the windows machine, to be able to backup stuff on an external drive from time to time.

I see a lot of people going against using zfs on rdm drives.

I think i have hit a hard brick wall an i am very tired at this hour, midnight here, and i do not know yet what to do.

I keep reading and reading a lot of posts all around and literally can't find a reason people are against rdm. As far as i can see, since esxi 5.1 there are no known issues passing only the drives to the vms.

Does anybody know a reason why this is going on for ages?

gea · Sep 18, 2020

RDM is mainly a problem with an Sata disk controller as this is not a supported option from Vmware. It may work or not. RDM over SAS controller even with Sata disks is a supported option and I see no reason to not use it - beside that you need to pass disk by disk where with HBA passthrough you need only one action for all disks.

As RDM is using the ESXi disk driver additionally, there may be a slightly lower performance than with controller pass-through where the VM has direct hardware access but I would not care about.

nonyhaha · Sep 18, 2020

Well, this is absolutely no issue for me... I have a sas controller, sata disks, so I might try this.

nonyhaha · Oct 28, 2020

Hello again GEA.
Everything is working great, I am very happy with the performance of it compared to windows.

I have a quick question. Today I added an automatic scrub to the system, an, when I checked the pool status under scrub, i see the following:

pool: prima
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.

Should I do that - upgrade - or not? what would be the implications of leaving it like this/ upgrading it?

I have a smb share on the pool.

Have a nice day!

nonyhaha · Nov 9, 2020

Hello @gea .

So now I have run into my first biggie. After a power surge, my server restarted. all ok and dendy, except my zfs pool, which, for some reason, will hang if I try to access it. It will work for first file I access, but after that, it becomes totally unresponsive.

Napp-it interface also will not be able to display anything about the pool.
If I remove one disk from the pool and start it like that, it works? This is very wired.

I can login to ssh, and when i run zpool status, it says it is scrubbing, but it is stuck at a certain point and nothing happens.

What could I do to repair this?
If I restart napp-it, I have access to the tools for a short period of time.

L.E.
I just disabled the smb share right after boot, and now, it seems that the scrub is running correctly.

Ill be back with good news I hope.

Patrick · Nov 10, 2020

@gea we lost this post in the backup restore this morning. If you want to repost it and tag me, I can remove this one. Here is what was lost

1. Pool version
Newer pool versions offer new features. This is why you may want to update.

Only two problems
- an older OS cannot import a newer pool (sometimes readonly works)
- very new features may contain bugs so sometimes you want to wait a little for newest features

2. System hangs
If your pool hangs ex during pool import or disk listing, napp-it hangs as well as it wants to read in disks and filesystems.
I would remove all data disks, then check logs (napp-it menu system > logs) for reasons.

If a single disk is the problem:
- try without or remove all disks and
- re insert first disk, check if a console command like format shows the disk (end with ctrl-c)
- add next disk, retry. If it hangs on a single diskm continue with others. You may end with a degraded but working pool.

A sharing service like SMB should not be a reason for problems but the first to show it.

gea · Nov 10, 2020

@Patrick: you are perfect!

nonyhaha · Nov 10, 2020

Thank you @Patrick and @gea

The scrub ended but I was not careful enough to see that it was scrubbing only 7 out of 8 disks.

My zpool status after a pool export/import is:
pool: prima
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: scrub in progress since Wed Nov 11 06:59:57 2020
327G scanned out of 15.5T at 338M/s, 13h4m to go
0 repaired, 2.06% done
config:

NAME STATE READ WRITE CKSUM
prima DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
c2t0d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
7237068410135738485 FAULTED 0 0 0 was /dev/dsk/c2t3d0s0
c2t5d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c2t8d0 ONLINE 0 0 0

The problem is that if I try to replace the faulted disk with the disk that is missing, i get this:
root@napp-it030:~# zpool replace -f prima 7237068410135738485 c2t3d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c2t3d0s0 is part of active ZFS pool prima. Please see zpool(1M).

The "missing" disk was already formatted on windows as ntfs, so I do not know ow to add this drive toi the pool to replace the missing one.

Also, every time I start the vm, napp-it starts a pool scrub, is there any way to stop it?

Maybe someone can help me?
Many thanks!

Evolution from Storage Spaces

Member

Well-Known Member

Member

Active Member

Member

Well-Known Member

Member

Active Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Administrator

Well-Known Member

Member