Newbie trying to understand best practices with ZoL on RHEL

SRussell · Dec 17, 2019

My current hardware setup:
SuperMicro SC216
SuperMicro 847
19x 1.6TB SATA Intel 3500
20x 10TB SATA HGST

Current setup:
Synology 12 bay NAS
6x 10TB SATA HGST BRTFS Raid6
18TB of use

My plan is to move all current storage to a RAIDzN setup on the Intel drives. I will also add numerous VMs to the JBOF. I expect to stay within 22TB of space on the JBOF. My expectation is consuming around 50-80TB on the HGST JBOD.

Now the confusion has started. I have read so many articles and posts about 'best practices' that I feel I have entered analysis paralysis.

1. How do you begin to determine drive setups based on drive profiles? How do account for drives when you are looking at 2/4/8/10/14 and potentially larger drives? At what point in drive size do you say we need to move to RAIDz3 or run RAID60?
2. What is a safe setup for platters and in what quantity? e.g. Is an 8x drive RAIDz2 the standard or does that change based on drive size, cache, and what if a drive is SMR or not SMR... how do you begin to factor all of that?
3. Does the triad of performance, capacity and integrity change when you move to flash storage? Does it change again if using enterprise flash?
4. Does the use of a hot spare change with drive types? My guess would be there is not an issue leaving a hot spare flash drive but I do not think it would be a good practice to leave a platter SATA drive in a hot standby mode.

SRussell · Dec 18, 2019

I read the ZFS document from iXSystems: Introduction to ZFS

It seems like 8 drive in Raidz2 is pretty good for performance and a 9 drive Raidz3 for better integrity.
They do advocate having a hot spare available. I am not sure I fundamentally agree with this but hard to argue opinion vs their business model.
Could not find any data relating to SMR but drive sizes over 6TB should always be mirrored or raids2/3.
Could not find any data that directly addresses flash storage.

i386 · Dec 19, 2019

No answers yet? Though/funny questions

I had to google some of these things (eg:"drive profiles", "triad of performance, capacity and integrity")

SRussell said:
Raidz2 is pretty good for performance

Raidz(1/2/3) is not for performance. For performance you will need to SAME (Stripe And Mirror Everything aka raid 10).

SRussell said:
3. Does the triad of performance, capacity and integrity change when you move to flash storage? Does it change again if using enterprise flash?

I'm not sure if I understand your question but I try to guess.
Not all best practices for hdds apply to ssds. For example raid 5 with large hdds is not okay, but with ssds it kinda is okay.
With hdds rebuilding an array happens at low throughput and you put the array under extreme stress (remember hdds are mechanical devices). This increases the risk that another hdd fails/times out/what ever and corrupts the entire array. Ssds on the other side have higher throughput in general and can handle the random access far better.

Terry Wallace · Dec 19, 2019

Read this first
ZFS RAID - 45 Drives Technical Information Wiki

Then read ALL of this
https://calomel.org/zfs_raid_speed_capacity.html

Then if you still have questions let chat

Those are 2 very useful resources.

Rand__ · Dec 19, 2019

So you have a twofold question Flash and platters - I will focus on the Flash part as VM storage for now

I have been playing around with storage for a very long time now trying to reach certain performance levels (as you might have seen in my various threads

).
From my current perspective I would recommend to start backwards (or actually its not backwards from a planning point of view, but backwards from a building point of view)

So first step is to define your requirements more closely
1. space needed, including 40%/50% utilization rule for ZFS (can decide if you need RaidZx or can do mirrors)
2. Then performance needed, including considering number of concurrent users/vms/processes (resulting in threads and queue depths for testing) (can decide if you need mirrors or can do RaidZx)
3. Whats your write to read ratio (or patterns)

If you have that then you can decide upon pool layouts keeping in mind that
1. ZFS does not scale too well
from @Terry Wallace s Calomel link

Code:

2x 256GB  raid0 striped   464 gigabytes ( w= 933MB/s , rw=457MB/s , r=1020MB/s )
24x 256GB raid0 striped   5.5 terabytes ( w=1620MB/s , rw=796MB/s , r=2043MB/s )

12 times the drives, double the performance?

2. you will have losses for networking, potentially sync write issues (slog?)

And if you have all that then you can determine actual possible pool setup and then see if you need more drives, faster drives or if you are good

After that you should build a test pool and try whether your expected performance is reachable in real life...

Edit:
to be fair, the scaling issues primarily manifest in low thread situations, as soon as you scale vertically with more threads (users) or deeper queues the problem is partially mitigated. I assume saturation will scale inversely to amount of vertical scaling (i.e. 16 users will be able to saturate significantly more drives than a single user).
Also this seems to be dependent on individual drive performance - higher perf drives (nvme) hit the saturation point earlier.

Rain · Jan 3, 2020

Rand__ said:
1. ZFS does not scale too well
from @Terry Wallace s Calomel link

Code:

2x 256GB raid0 striped 464 gigabytes ( w= 933MB/s , rw=457MB/s , r=1020MB/s ) 24x 256GB raid0 striped 5.5 terabytes ( w=1620MB/s , rw=796MB/s , r=2043MB/s )

12 times the drives, double the performance?

The chassis they claim the tests were run on was a SuperMicro 846 with the BPN-SAS2-846EL1 backplane connected to an LSI HBA with a single 8087 cable. 24 SATA SSDs hanging off a 24 drive SAS2 expander with only one connection to the HBA simply isn't going to produce the best results. Their ~2GB/s reads are probably knocking on this limitation.

Also, they mention

TRIM is not used and not needed due to ZFS's copy on write drive leveling.

In most workloads with deleted/re-written data, TRIM greatly effects performance. See the pull-request that added TRIM support to ZFSonLinux and the benchmarks performed: Add TRIM support by behlendorf · Pull Request #8419 · zfsonlinux/zfs . If I'm not mistaken, TRIM support was added to FreeBSD's ZFS before ZFSonLinux.

While a good read, the Calomel link shouldn't be taken as gospel. As with most benchmarks, you should expect similar performance with the same configuration on the same hardware. If they had instead used the BPN-SAS-846A (direct-attach, expander-less) backplane and more HBAs their sequential numbers would be much better.

Rand__ · Jan 3, 2020

I totally agree

However I was pointing out that one as it was mentioned and due to the fact that scaling is far from ideal on faster drives so one should not take it for granted when designing a system

Rain · Jan 3, 2020

I'm sure we agree. I was merely pointing out that Calomel's scaling issues are because of their hardware configuration, not ZFS. No RAID systems (hardware or software) scale perfectly, but ZFS's scales quite well when disk bandwidth is properly accounted for and the hardware is selected appropriately.

Notice the near linear scaling in the 1, 2, 3, and 4 drive configurations in comparable reads (writes are obviously affected by RAID level and the lack of TRIM). After 4 drives, they max out the 4x SAS2 links from the expander to the HBA:

Code:

1x 256GB  a single drive  232 gigabytes ( w= 441MB/s , rw=224MB/s , r= 506MB/s )
2x 256GB  raid1 mirror    232 gigabytes ( w= 430MB/s , rw=300MB/s , r= 990MB/s )
3x 256GB  raid5, raidz1   466 gigabytes ( w= 751MB/s , rw=485MB/s , r=1427MB/s )
4x 256GB  raid6, raidz2   462 gigabytes ( w= 565MB/s , rw=442MB/s , r=1925MB/s )

If they were not hitting the limit of the SAS2 expander and/or its connection to the single HBA, this scaling would continue far past 4 drives.

------

More on subject to @SRussell's questions, I generally think RAIDZ2 is plenty for configurations with 10 drives or less in a single VDEV regardless of today's large drive sizes. Any more and I'd feel better with RAIDZ3. That said, with proper backups in place, even RAIDZ2 is probably fine in many-drive configurations unless you really think you need the additional uptime RAIDZ3 would provide in a "drastic" failure situation.

I also prefer to spread data out across multiple pools (if the data is organizationally separable) instead of using one large bulk-storage pool with many VDEVs. For example, ignoring performance constraints, if I have Data Set A and Data Set B, both unrelated to each other and equally large in size, I'd prefer to have Data Set A on one pool and Data Set B on another instead of both of them on a single larger pool. Mainly because if a single VDEV in a multi-VDEV pool fails you've lost everything.

For a performance-focused pool (say, serving many VMs), striped mirrors generally the way to go if you can afford it. Otherwise, small striped RAIDZ1/RAIDZ2 VDEVs works great as well. Really depends on the exact workload you're trying to fulfill.

On the subject of hot spares, unless the machine is truly out of reach for extended periods of time, I don't think they're needed. I can always walk a datacenter tech, coworker, family member, ect through physically swapping a drive out and handle the rebuild remotely. Being able to swap in a cold spare in a timely manner is just as good, if not better, than wasting space/energy with hot spares in most applications.

Rand__ · Jan 3, 2020

Well I' have not the best experience with scaling despite adequatly sized hw (-> Pool performance scaling at 1J QD1) but this is not a matter for this thread - happy to discuss that further though, either over there or in a new thread here

SRussell · Jan 3, 2020

On the subject of hot spares said:
Would you use hot spares with non-flash drives?

Rand__ · Jan 3, 2020

I would say that depends on your availability requirements...
If you have the pool at a remote location with expensive access then sure, keep a hot spare.
If its in the basement ... keep a drive in a safe place instead or buy when needed.

SRussell · Jan 3, 2020

Rand__ said:
I would say that depends on your availability requirements...
If you have the pool at a remote location with expensive access then sure, keep a hot spare.
If its in the basement ... keep a drive in a safe place instead or buy when needed.

Do you believe a rotational drive kept as a hot spare would degrade the drive the longer it is kept powered on?

Rand__ · Jan 3, 2020

It uses power and generates heat. Not sure it would really age if its not really used, but unless you really want to see no reduction in resilience I dont think its necessary.

pricklypunter · Jan 4, 2020

SRussell said:
Do you believe a rotational drive kept as a hot spare would degrade the drive the longer it is kept powered on?

Yes, I would say it will. The trouble is, you won't know about it, unless it either fails in a manner that is signaled to the system or when you actually have to use the damned thing and discover it has developed bad sectors or a sticky head while it was just sitting there spinning, being subjected to the same environmental conditions and generating heat

Search

Newbie trying to understand best practices with ZoL on RHEL

SRussell

Active Member

SRussell

Active Member

i386

Well-Known Member

Terry Wallace

PsyOps SysOp

Rand__

Well-Known Member

Rain

Active Member

Rand__

Well-Known Member

Rain

Active Member

Rand__

Well-Known Member

SRussell

Active Member

Rand__

Well-Known Member

SRussell

Active Member

Rand__

Well-Known Member

pricklypunter

Well-Known Member