Advice Needed on ZFS Array Design

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Drooh

New Member
May 4, 2019
28
3
1
I'm re-working a few things in the structure of my startup's data storage and r&d lab. Since I've been working, and somewhat halting on the super high performance data serving need in the r&d lab. I'm taking a quick break to tackle, what should be a much easier challenge, and that is my primary means for backup, data integrity, and failover in the event of a catastrophe.

For the means of this challenge, I am looking for expertise on how to design and utilize what I have left over from the r&d purchases, recent server configs, etc. I have the gear available, and would like to see some suggestions with benefits of the design/configuration.

-Super high performance data access in process in the r&d lab. For the time being what I have works for current necessity.

-I have the mission-critical data replicated on a mirrored system. So there is a safeguard in place for failure of the r&d lab system.

-The mission-critical data is also stored and triplicate on a backup array, which is also replicated off-site.

I'm in a decent spot, but I would like to have a tertiary system housed on-site, underground; for purposes of catastrophe meeting catastrophe, I don't want to be at the mercy of the time it would take to recover from the off-site replication. I've also encountered significant data corruption, from which I quickly recovered, because these were small data sets, cloud recovery was quick and non-problematic.

Currently, I am missing the protection from corruption, while snapshots are in place for some critical data, I need to establish snapshots for the entire system. That's why I've chosen ZFS.

It's still relatively new to me, and I've brought myself to about 50% of what I need to know. But I want to test out a few configurations and determine how to best design the hardware. I can worry about methodology once I have the system built.



Keys
-This array doesn't need to be highly scalable, but in the next year, I'll need to be able to scale on the order of petabytes. If the system can scale, great, it can be a long-term option. But if it can't, the funds will be there for a managed solution, if necessary. For now, I need something to complete the protection scheme between now and such a time I need to scale-up, at which point I can look at the cost of scaling this system vs a managed solution.

-The system needs to have the ability to hold 100TB cold data, 10-12TB of warm data, and I'd say about 3-4tb of hot data.

-I don't need the same type of performance that I need for r&d. I do, however need to be able to meet throughput of right at least 800 megabytes per second, 1.0-1.5 gigabytes per second would be ideal over two 10GbE connections, but I can live with saturating a single 10GbE connection, throughput wise.

Materials for use
-6 12tb WD Gold
-12 6tb WD Gold
-12 1tb Intel 545 SSD's
-2, maybe 3 2tb Intel NvME's
-2 LSI 9300 HBA's
-1 LSI 9207 HBA
-1 LSI 9201 HBA
-2 Chelsio T580 NIC's
-Server is a super micro with xeon scalable bronze, 128gb RAM, I have extra RAM, so I could expand it, but not likely needed.

Maybe ZFS is the right way to go, maybe not. Maybe I don't need to use all this one system. I'm open to design thoughts, as long as I can meet the key requirements.

With all the possible configs, I don't know which way to configure.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
If you have static data, warm and hot data, there are two possible solutions. One is tiring what mans that hot data is automatically or manually moved to a faster part of an array. ZFS does not support tiering it has chosen another solution to hot/coly data with its advanced ram and SSD/NVMe based caches as this works more dynamically even under heavy load. You can combine this with different pools (faster ones like SSD/Nvme and slower ones like disks).

The readcaches in ZFS (Arc and L2Arc) buffer ZFS datablocks (not files) on a read last/read most strategy. This prefers metadata and small datablocks as they are the most performance sensitive data parts. On a system with that many RAM you can expect a ram cache hit rate over 80% of such data. And Ram is always the fastest storage part much faster than any NVMe..

Arc cache does not buffer sequential data as for ex reading a video would clear the cache otherwise immediately. If you add an L2Arc cache (ssd or nvme) that extends the ram cache, you can enable read ahead that can increase sequential performance. But sequential performance is rarely a problem as this scale with number of datadisks. You can easily achieve 10G performance and more even with a disk based pool.

For writes, ZFS use a rambased write cache (default on Open-ZFS is 10% ram/max 4GB, on Solaris 5s of writes) to transfer small slow random writes to large and fast sequiental writes. If you need a secure write behaviour with no dataloss on a crash during write (cache content lost), you can add a logdevice Slog to protect the cache.

So with ZFS you have the option to use one large pool. Fastest regarding iops is to create a pool from many mirrors. Each mirror vdev increase pool iops and sequential performance. If you create a pool from Raid-Z vdevs, sequential performance is similar but each vdev has only the iops of a single disk.

Given your disks I would create at two or three pools.

My suggestion: Pools from disks/ cold data
create a pool 3 x raid-z2 vdevs
vdev 1: 6 x 12tb disks
vdev 2/3: 6 x 6tb each

Pool capacity is 96TB usable
Options to extend: add more vdevs or replace the 6TB disks with 12TB disks
Improve performance: Add an NVMe as L2Arc (max 5 x ram) and enable read ahead (may help or not, depends on workload). You can check arc stats do decide if you even need any L2Arc

LSI 9207 or 9201 os good enough for disks.
With Sata disks I would avoid an expander and prefer a disk per HBA port as with many Sata disks a bad Sata disk can block an expander what makes error finding not easy or fast. SAS disks are more uncritical on such problems.

Pool 2: medium performance/ warm data
Create a pool from 12 Intel SSDs, layout depends on workload.
If you crate a pool from one Z2 vdev, your usable capacity is 10TB. Use the 9300 HBA for SSDs

Pool 3: ultra high performance for special needs
Create a Z1 pool from your 3 NVMe,4 TB usable
 
Last edited:

Drooh

New Member
May 4, 2019
28
3
1
If you have static data, warm and hot data, there are two possible solutions. One is tiring what mans that hot data is automatically or manually moved to a faster part of an array. ZFS does not support tiering it has chosen another solution to hot/coly data with its advanced ram and SSD/NVMe based caches as this works more dynamically even under heavy load. You can combine this with different pools (faster ones like SSD/Nvme and slower ones like disks).

The readcaches in ZFS (Arc and L2Arc) buffer ZFS datablocks (not files) on a read last/read most strategy. This prefers metadata and small datablocks as they are the most performance sensitive data parts. On a system with that many RAM you can expect a ram cache hit rate over 80% of such data. And Ram is always the fastest storage part much faster than any NVMe..

Arc cache does not buffer sequential data as for ex reading a video would clear the cache otherwise immediately. If you add an L2Arc cache (ssd or nvme) that extends the ram cache, you can enable read ahead that can increase sequential performance. But sequential performance is rarely a problem as this scale with number of datadisks. You can easily achieve 10G performance and more even with a disk based pool.

For writes, ZFS use a rambased write cache (default on Open-ZFS is 10% ram/max 4GB, on Solaris 5s of writes) to transfer small slow random writes to large and fast sequiental writes. If you need a secure write behaviour with no dataloss on a crash during write (cache content lost), you can add a logdevice Slog to protect the cache.

So with ZFS you have the option to use one large pool. Fastest regarding iops is to create a pool from many mirrors. Each mirror vdev increase pool iops and sequential performance. If you create a pool from Raid-Z vdevs, sequential performance is similar but each vdev has only the iops of a single disk.

Given your disks I would create at two or three pools.

My suggestion: Pools from disks/ cold data
create a pool 3 x raid-z2 vdevs
vdev 1: 6 x 12tb disks
vdev 2/3: 6 x 6tb each

Pool capacity is 96TB usable
Options to extend: add more vdevs or replace the 6TB disks with 12TB disks
Improve performance: Add an NVMe as L2Arc (max 5 x ram) and enable read ahead (may help or not, depends on workload). You can check arc stats do decide if you even need any L2Arc

LSI 9207 or 9201 os good enough for disks.
With Sata disks I would avoid an expander and prefer a disk per HBA port as with many Sata disks a bad Sata disk can block an expander what makes error finding not easy or fast. SAS disks are more uncritical on such problems.

Pool 2: medium performance/ warm data
Create a pool from 12 Intel SSDs, layout depends on workload.
If you crate a pool from one Z2 vdev, your usable capacity is 10TB. Use the 9300 HBA for SSDs

Pool 3: ultra high performance for special needs
Create a Z1 pool from your 3 NVMe,4 TB usable
Fantastic post. Thanks for taking the time to write that. You covered 90% of what I really needed to know.

My only assumption is I’d have to move from pool to pool manually.

I tried creating a test pool with a funny result. Something happened between the volume creation and mount point. I’m created a test share which i found at /mnt/testshare, then a folder named testshare was placed within /mnt/testshare. The result was an error that there is no space to write the test file. I’m working on figuring out where I went wrong, haha. I’m testing on XigmaNaS for the time being