ZFS Advice for new Setup

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Question(s) relating to ZFS using 'whole disk' -vs- 'partitions'.
..

Instead of a two whole disk mirror, what if you partitioned each of the SSDs, lets say, in to two partitions.
And then created a RAID1+0 set, where you mirror the 1st two partitions, and the 2nd two partitions and then stripe across.
Thus providing more vDEVs to the underlying scheduler, and potentially using more internal drive controller threads.

I know for HDDs this is considered a no-no ..
ZFS let you do that and in case of a Slog or L2Arc there may be a single reason behind (=save money) but in general you only complicate things without an advantage, more or less you are asking for troubles only - especially if you partition datadisks in a pool. On problems there is a good chance that do something stupid.

A good SSD with steady load can give you up to 500 MB/s write and 4ok iops with smaller blocksize. If you partition the same disk with a steady load on both partitions, you will be limited by the same overall SSD values, means that the performance is divided between both partitions, overall will be the same.

If you use a setup with many HBAs, there may be reason to assign disks in a way that a single HBA is allowed to fail and the pool keeps online (Sun did that as well). From a performance view it does not matter.

Mostly you add complexity without a serious reason.
The most important goal must be "keep it simple" as this is what helps on troubles.

btw
You should use the S3700 for Slog
Do not use the term raid-5/6 for ZFS software raid.
ZFS is more advanced, the correct terms are raid Z1 and Z2
(you can use raid-5/6 for ZFS but you will loose any repair options then)
 
Last edited:
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
Got it.. Seems like its frowned upon, unless you have a specific reason (workload/benchmarks) that justify other.
That being said, I might be balancing on the line of 'will it matter'.

i.e. In a scenario where you dont have enough SSDs to out-perform the underlying disk pool, and thus the SSD (cache/logs) become a bottleneck.
My what-if thought is, if (2) SSDs would out perform the HDD pool, but (1) wont. Then using partitions and trying to RAID1/0 two SSDs would potentially out-perform the HDD pool, and thus be effective. [although slightly complicating the design]

My assumption is pool v19+ has the ability to dynamic add/remove Cache/Logs right? So if I originally created the L2ARC / ZIL using RAID1/0 stripped partitions, and in the future make up the performance difference by purchasing additional SSDs. When I do have the proper physical SSDs to handle the performance, i'll no longer partition them, and go back to mirror only.

The save $ aspect is kind of worth considering, since my original (4) SSD purchase didnt pan out :) these (4) new SSDs were already unbudgeted, and if I found out i needed (4) Intel S3700s to out-perform (10) Hitach HDDs because (2) in RAID1 only yields 500MBs, i've got a problem. I really want 1GBs+ sustained read/write out of the pool. (assuming large block sequential, copying ISOs/1080p movies, etc).

In my previous Windows 2016 Storage Spaces tests, I obviously wasnt getting ZFS POSIX compliance, and my consumer grade SSDs acting as dedicated journal volumes, were cheating with system RAM for write acks. So i'm aware this is apples and oranges, but I was able to get about 1.2GBs sustained read/write out of the (2) existing Samsung Evo 850s , before i expanded them to (4). Benchmarks after the SSD expansion yielded no additional performance [part of the reason i wanted to switch to ZFS, for the ability to customize the underlying disk pool and cache mechanisms.]

So if I found out i need (4) Intel S3700's in RAID1 to get back to where I was in Windows w/ (2) Samsung Evos, then I'm going to be a sad panda.
But if I can partition my SSDs and cheat for a while, I'll be able to contain my panda tears.

I will have sufficient time to play with multiple configurations, and run extensive benchmarks, before I have to finish up and 'go live'. All my important data is currently backed up on a Netgear ReadyNAS 6x Disk RAID5 appliance, and the rest is trickled on to Seagate 8TB USB drives. [which will be my backup devices in the future]

--EDIT--
Ahh I think i just read something about the way the ZFS IO scheduler leverages the drive NCQ for parallel IO, using a single cache flush cmd to validate multiple IOs at once. So my idea of partitioning the SSDs to try and leverage additional threads of the SSD controller might be moot.
 
Last edited:

humbleThC

Member
Nov 7, 2016
99
9
8
48
Sooo the Zatoc 240GB 'PE' SSDs @ $107ea were stuck in Amazon 'preparing for shipping', and I was able to cancel the order \o/

And the Intel S3700-200s were stuck in Ebay 'preparing for shipping', and i filed a cancellation request.

I found these bad boys instead...
Intel SSD DC S3710 Enterprise 400GB 2.5" 6GB/S SATA SSD SSDSC2BA400G4

S3710-400GBs for $125 ea.... So i picked up (4) to replace the Zatoc/S3700-200s.

Now i'm thinking:
(2) S3710-400GB's for L2ARC
(2) S3710-400GB's for ZIL.

Unless it makes sense to partition each of the SSDs like 80%/20%, and use all (4) for both functions.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
You should now think on your real use case and needs
If you need a filer and backup capacity, you do not need an slog (no sync writes) or L2Arc with sufficient RAM

If you need a high performance storage for ESXi you will always miss this with a regular spindle based pool.
If you build a smaller high performance pool without slog or L2Arc from the 4 x S3700 as they are senseless then,
you will get 800 GB (raid-10) or 1,2 TB (Z1).

This means a two pool solution, one optimized for capacity and a smaller one for performance.
 
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
You should now think on your real use case and needs
If you need a filer and backup capacity, you do not need an slog (no sync writes) or L2Arc with sufficient RAM

If you need a high performance storage for ESXi you will always miss this with a regular spindle based pool.
If you build a smaller high performance pool without slog or L2Arc from the 4 x S3700 as they are senseless then,
you will get 800 GB (raid-10) or 1,2 TB (Z1).

This means a two pool solution, one optimized for capacity and a smaller one for performance.
True, which is what I was going to do with the (4) Samsung Evo 850s => 480 GB (raid-10)
- This wouldn't satisfy my entire ESX capacity needs, but it's something to play around with.

But I see what you are saying, I have to make the same choice again with the (4) Intel S3710-400s.
Either use them 'perhaps ineffectively' as L2ARC/ZIL for the HDD pool. Or carve them out in another FLASH only pool, again w/o L2ARC or ZIL.
- Yielding a 2nd FLASH only pool 800GB (raid1-0) or 1.2TB (Z1)
- Combined with the Samsung FLASH pool, still is just scratching my ESX capacity needs.
- I'm hoping for 5-10TB of flex-space for deploying labs (out of the 30TB usable), but maybe no more than 10% in play at any time (500GB-1TB).

So i'm still focusing on hopefully 'accelerating' the HDD pool, with enough 'cache warm up period' to get near SSD speeds 'most of the time' when i'm working inside a specific subset of the lab storage. That way I can share my 'capacity' pool between home NAS use and LAB use.

I have a feeling that the L2ARC is going to be 95% dedicated to the LAB anyways, since my home NAS archive/backup is rarely ever touched, except for the incoming data.

So for me its just a question now, to partition or not to partition the Intel S3710s for both L2ARC and ZIL.

My thought being 400GB R1 on a pair of SSDs is overkill for ZIL, and based on the performance of a single drive.
Lets say I want 100% of RAM for ZIL, thats 80GB.
I could partition each of the (4) Intel SSDs in to 40GB / 360GBs

Use 80 GB (raid-10) across the (4) 40GB partitions for ZIL.
Use 1440 GB (raid-0) stripe across the (4) 340GB partitions for L2ARC
Or 160/1360 split.

This way i'm using all (4) SSDs for L2ARC bandwidth, same for ZIL, while maintaining ZIL redundancy. And each SSD has it's own LSI adapter.

REF:
Frequently Asked Questions About Flash Memory (SSDs) and ZFS
Can I use the same SSD both as a ZIL and as an L2ARC?
 
Last edited:

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
Another vote for not splitting/sharing SSD.

Have you looked at the performance hit of SSD when doing mixed work-load? (IE: Reads & Writes) Drastically reduces drive performance even enterprise drives. Also you don't need and won't use 80GB SLOG unless you tune other settings.
 
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
Another vote for not splitting/sharing SSD.

Have you looked at the performance hit of SSD when doing mixed work-load? (IE: Reads & Writes) Drastically reduces drive performance even enterprise drives. Also you don't need and won't use 80GB SLOG unless you tune other settings.
My thought on the SLOG is (4) Intel S3710s are going to be faster at writing acknowledgements than the (10) Hitachi HDDs. So 80GB (enough to cover the entire RAM space) is more than sufficient. From what I read, you want about 1/2 your RAM, or even less, 8GB ZuesDrives for example. So I realize that a certain portion of the 80GB SLOG will be free all of the time.

Now does that take away perhaps from the speed of the L2ARC shared on the same SSDs, sure. But are these SSDs fast enough, reliable enough, to make the HDDs faster, I sure hope so :) Thats why i'm interested in spreading across all 4 all of the time.
 

Attachments

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
I am under the impression, that your optimisation strategy would be helpful on Windows Storage Spaces but not on ZFS.

The Slog must log all writes between two flushes of the normal rambased writecache, as a rule of thumb count 5s + some time for writing, say 7s. If your network can deliver 1GByte/s (10G network), this would mean that your Slog must not be larger than 7 GByte - the reason why one of the best of all Slog devices, a ZeusRAM has only 8GByte (battery buffered Dram). Slog is not a write cache and regarding writes would not help to improve writes onto your spindle pool! Every write to the Slog is done additionally to any write to your pool. Think more like a lesser delay on sync writes done to an Slog instead the ZIL device on a pool.

Beside using a hardwareraid, there is only the option to mirror an slog - not for performance but to allow a failure without a dataloss on a simultanious crash or a performance degration otherwise. You must also consider, that a raid-0 of two devices helps to improve sequential performance but not latency or iops, so quite senseless for an Slog.

When you want an L2Arc you also do not use a Raid-10 alike construct but you just add them to distribute load over them. You should also consider that with that many RAM you would propably never see any advantage of an L2ARC as you always can deliver cache resuls from the much faster rambased Arc. If you want an L2Arc to cache sequential workloads without beeing to slow compared to your pool sequential poolperformance, you want a single NVMe instead that can deliver up to several GB/s.
 
Last edited:

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
Sizing should be based on capacity of incoming writes * 5 seconds. So if you have a single 1Gig connection we'll say you have 110MB/s so 5 seconds would be 550MB, you want to have enough for 2 transaction groups so 1.1GB would be sufficient.

You can change how often it flushes thus changing the 'size' it will get to but you'll never use 80GB unless you tune it wrong.

Your L2ARC won't suffer nearly as much as the writes to SLOG by using for both, writes suffer much worse than reads in mixed work load.

EDIT: And it looks like @gea posted seconds before with similar info :)
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Sooo the Zatoc 240GB 'PE' SSDs @ $107ea were stuck in Amazon 'preparing for shipping', and I was able to cancel the order \o/

And the Intel S3700-200s were stuck in Ebay 'preparing for shipping', and i filed a cancellation request.

I found these bad boys instead...
Intel SSD DC S3710 Enterprise 400GB 2.5" 6GB/S SATA SSD SSDSC2BA400G4

S3710-400GBs for $125 ea.... So i picked up (4) to replace the Zatoc/S3700-200s.

Now i'm thinking:
(2) S3710-400GB's for L2ARC
(2) S3710-400GB's for ZIL.

Unless it makes sense to partition each of the SSDs like 80%/20%, and use all (4) for both functions.
Please do not waste those precious S3700 400gb'ers on SLOG or L2ARC (200GB even). SLOG will rarely use even 8GB of those, the 100GB versions had abt 1/2 performance profile as the 200's, anything bigger than 200 on the s3700 line is in line w/ eachother perf-wise. Sweet spot would be one 100-200GB s3700 for SLOG (no need to mirror IMHO), and one 480-600GB S3610 for L2ARC if you HAVE to use L2ARC (IE: not enough RAM to commit to ZFS filer).

This is the sweet spot, trust me my man!

EDIT, Sorry, my bad I see others have already hinted and you agree that it may not be the best use-case for those sweet sweet thangs :-D Sometimes I get caught up right when I read something w/out finishing the thread and am like 'NOOOOOOOOOOO, don't do it!!!' and my fingers get to frantically typing before I get to finish thread/other comments that many times had already 'hit it on the head' or covered same concept/suggestion.

~Take care buddy!
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
You should now think on your real use case and needs
If you need a filer and backup capacity, you do not need an slog (no sync writes) or L2Arc with sufficient RAM

If you need a high performance storage for ESXi you will always miss this with a regular spindle based pool.
If you build a smaller high performance pool without slog or L2Arc from the 4 x S3700 as they are senseless then,
you will get 800 GB (raid-10) or 1,2 TB (Z1).

This means a two pool solution, one optimized for capacity and a smaller one for performance.
Concur, pretty much what I 'cast my vote for' a page or two ago utilizing a performance/capacity pool. I hope he heeds our advice :-D

@humbleThC , four 400gb s3700's in a raid-10 (stripped mirror) pool will kick ass and take names (for sata ssd's), no need for SLOG/L2ARC IMHO on ssd/AFA pools. I'd still hunt a 100-200 GB for SLOG for that spinner pool and 'maybe' a 480-600gb s3610 for L2ARC if ram is limited (under 32GB).
 
Last edited:

humbleThC

Member
Nov 7, 2016
99
9
8
48
I'm sorry i'm not understanding this at the level and speed at which I hoped... But I am ever grateful at everyones continued support!

So is part of my misunderstanding due to the 80GB of DDR3 ECC RAM on the ZFS server?
i.e. Because of my back-end disk (10x HDD), and front-end network (40Gb IB - theoretical 16Gb cap), i'll 'never' really be able to saturate the entire RAM region, such that my (4) new Intel SSDs would assist?

That's partially very impressive if true, on the efficiency of RAM destaging to disk. But it's also a little sad, because ZFS was my solution to being able to use SSDS to cache active data, and serve it faster than the underlying HDDs, during L1ARC cache misses.

Either that or my math/understanding of the physical hardware capabilities are going wrong somewhere.

(rough #s for sake of argument, assuming the drives are spread evenly across controllers, and PCIs as to not be a bottleneck)
(1) HDD = 100MBs seq read/write large block async , 15MBs random read/write small block, sync
(1) SSD = 500MBs seq read/write large block async, 300MBs random read/write small block, sync

In the case of using the HDDs in RAIDz1 (4+1) * 2 vDEVs, i'm looking at aprox (2) HDDs worth of performance on the Capacity pool.
If true, it looks like (1) SSDs for SLOG would suffice, and if using (2-4) SSDs for SLOG would yield incremental results. (on strictly returning the write ack to the network client)

My other thought i'm having a hard time letting go is... If i'm throwing (4) of these Intel SSDs at the HDD pool, even if I split them for both ZIL and L2ARC, they should outperform HDD pool 99% of the time when the L2ARC gets a hit that the L1ARC didnt, And with 1TB+ of L2ARC, thats more than the entire active dataset i'll ever really touch.

Part of my misunderstanding is i'm a 'block storage engineer' in the SAN arena, and i'm used to array level functions like FAST Cache from EMC, and NetApp PAM cards, and i incorrectly assumed that L2ARC was pretty much the same thing. Or if it is very similar, i'm failing to understand why (4) proper Intel SSDs arent fast enough to be useful.

I do 100% get that SLOG/ZIL is not write-cache, and does not double as read-cache accordingly. But everything I read about ZIL says putting SSDs in front of HDDs for small random sync writes is useful. I'm struggling to understand with the proper Intel SSDs now, why that doesnt apply to my 10x disk array (which could grow to 20-25 disks in the future). Especially if by default the ZIL is carved out of the same HDDs pool, and the spindles have to fight each other without a dedicated SLOG.

Or are you simply trying to say:
1) use 10x HDDs for capacity pool, and only use it for large sequential reads/writes , in which case no SSDs would be beneficial.
-- Because even tho the SSD based ZIL captured the write, the HDD pool has to also write the block out to disk before the ACK is returned?

2) use 4x SSDs for a performance only pool, and carve all small random sync writes here, and only here.
 
Last edited:

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
Most any enterprise ssd used as a SLOG on a HDD pool will increase random performance, no one is disagreeing.

@whitey is saying a 'all ssd' pool would be faster than adding a SSD slog to a bunch of HDD pools, so having 'storage' on HDD and high performance on another pool is the most cost effective, and performance minded way to go. BUT you can still add a 200GB S3700 SLOG to your storage pool to speed it up.

And people are saying no L2ARC because you likely have enough RAM (ARC) to satisfy your needs and that adding L2ARC utilizes RAM and lowers available ARC.

RAM is dirt cheap so instead of adding L2ARC spend the $$ adding more RAM until you've reached capacity. You can easily get up to 256GB+ DDR3 very affordable now.
 
  • Like
Reactions: humbleThC

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Or are you simply trying to say:
1) use 10x HDDs for capacity pool, and only use it for large sequential reads/writes , in which case no SSDs would be beneficial.
-- Because even tho the SSD based ZIL captured the write, the HDD pool has to also write the block out to disk before the ACK is returned?

2) use 4x SSDs for a performance only pool, and carve all small random sync writes here, and only here.
Basically, yes this is the essence.

While you can use an slog to improve write on a disk based pool,
an SSD only pool with enterprise SSDs (without extra Slog/ L2Arc) will be far faster especially on small random writes. On a regular filer ex SMB, sync write is not used so an Slog is not used at all.

ZFS use most of free RAM as readcache. Unless you have many users and a workload with massive random reads from a huge pool, nearly all reads are cached from RAM - no need for the slower L2Arc.

The extra L2Arc increases this ramcache with a (slower) SSD but it also reduces RAM as L2Arc needs RAM to organise.So in your workload the L2Arc may even reduce overall readcache performance and efficiency. This may be different if you want to readcache sequential workloads like multi-user videostreams.

Most important: Readcache can and will improve access to small random data and metadata. L2Arc can additionally cache sequential data or helps when regular RAM is too low for a good cache hitrate,
 
  • Like
Reactions: humbleThC

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I feel that ssd accelerated spinner pools are ok for light to mid-use VM' and I/O (can run a bunch of 'lighter' VM's off this if your 800GB pool of stripped/mirror ZFS performance pool is running low on space) but otherwise yeah just use that for media/bulk storage pool that can sustain a bit of hammering (with at least a cache dev, can forego a read cache if you allocate enough vRAM to the AIO stg appliance) if/when you do need to float something over to it it wont be 'd|ck in the dirt' if your performance pool is taxed for capacity.

Can always throw 2 more 400gb s3700's at that performance pool to 'grow-as-you-go'. I have 8 hussl4040ass600 400gb SLC ssd's in stripped-mirror (raid 10 ZFS) config and run 50+ VM's off of it regularly, they are anything from light to fairly heavy I/O machines, doesn't skip a beat.
 
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
Well I should have my new LSI SAS adapter and 8x 2.5" HotSwap trays to reposition the SSDs by Tuesday. So i'll be able to create the HDD pool, with out any SSDs and perform as many internal benchmarks, as well as, external SMB/NFS/iSCSI over IB benchmarks and see what it does.

Then when the Intel SSDs get here, i'll be able to test them in a pool by themselves, -vs- added to the capacity pool in various fashions.

As far as RAM. Currently @ 80GB with all 12x slots full (mixture of 8x8 + 4x4)
Pricing out refurbished RAM 12x16 = 192GB (max the mobo supports) is about $1000 USD
So I feel like best case scenario I replace the 4x4 with 4x8 and max @ 96GB (if I bother at all).

And i'm limited by my PCIe v2.0 x8 bandwidth (in x16 slots, but only x8 per lane active)
So i'm not really thinking NVMe/PCI storage to break that 1-2GBs sustained bandwidth.

I was however hoping to throw 2-4x fast SSDs, on separate PCI adapters at it, to kinda best-effort it :)
And pretend those 10x HDDs were fast enough to push 40Gb IB.
 
Last edited:

humbleThC

Member
Nov 7, 2016
99
9
8
48
Minor update, still waiting on hardware...
Ordered a 5th LSI 9211-8i last minute...

Already splitting the SSDs up 1 of each type per existing controller.
But for the 10x HDDs, i wanted 5x Controllers, such that when I build RAIDz1 components i can make sure to rotate them evenly across controllers. That way if any controller goes down, i only lose 1 component of each RAIDz1 group.

Intel SSDs should be here tomorrow, but now I have to wait on the 5th LSI controller before I can finalize the HDD pool.
Should be able to test the Intel only pool tomorrow evening though.

New/Final Storage Layout
upload_2017-1-10_19-26-46.png
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
:confused: Not really sure why you would post and ask questions and then ignore people who've been working with ZFS for many many years as well as sell their own ZFS management software.

You've been told how to do it 'right' for performance and you've ignored the advice.

Samsung 850 EVO (or even PRO) are not performance drives especially not in a ZFS environment where there's no TRIM and you must rely on enterprise configured garbage collection/management found on enterprise SSD.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
:confused: Not really sure why you would post and ask questions and then ignore people who've been working with ZFS for many many years as well as sell their own ZFS management software.

You've been told how to do it 'right' for performance and you've ignored the advice.

Samsung 850 EVO (or even PRO) are not performance drives especially not in a ZFS environment where there's no TRIM and you must rely on enterprise configured garbage collection/management found on enterprise SSD.
Umm. I think I have been listening pretty well, not always understanding thus the continued questions.
But if you were fully up to speed, i no longer plan to use the Samsung Evos for anything other than a RAID1/0 SSD pool. And I took the recommendation and purchased (4) Intel S3710s for the right tool/job.

And i'm pretty sure i'll keep the Intels in their own separate RAID1/0 pool as well (Per the advice here, even though i purchased them for ZIL/L2ARC).

Only thing I said I'd do is benchmark the HDD pool with the Intels in front of it, just to see the results with and without SSD on the HDD pool (again using the right SSDs).
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
I'm going by what your most recent update is with your layout. That's all.

If you're using hardware raid why are the samsung drives in a vdev at all then?