ZFS Thoughts: Best Inexpensive ZIL?

Kaishi

New Member
Mar 7, 2011
25
0
1
ZFS Thoughts: Best Inexpensive SLOG?

This is a crosspost from [H]ardforum. I respect ServeTheHome, but because the community is small, I'm reaching out elsewhere as well.

Introduction
I'm looking for some feedback on my upcoming ZFS build, but I'll get to that later. I have a very good understanding of the mechanics of ZFS, and the general requirements for each subsystem therein. My concern here is specifically the ZFS Intent Log, or ZIL.

Background on the ZIL
The ZIL is used to store the intended changes before they are committed to disk in the zpool. Writes to the ZIL act like somewhat of a cache: Random writes on the ZIL can be committed to the zpool as contiguous writes. By default, the ZIL is interleaved with the rest of the storage space on the zpool, but you can assign a specific device to host the ZIL. The device used to host the ZIL is known as the SLOG. Ultimately, the IOPS of the SLOG are a bottleneck for the zpool: changes must first be made to the ZIL, then copied to the zpool, then committed. If the SLOG can't keep up with the zpool, the aggregate performance will decrease. If performance decreases enough, ZFS will (if I understand correctly) abandon the dedicated SLOG.
NOTE: Should the ZIL be lost, ZFS may have difficulty recovering the zpool. For this reason, SLOGs should be mirrored if possible. EDIT: Apparently, this behavior has been fixed in v28, so mirroring may not be as critical.

So, which devices make for the best SLOG? By conventional thinking, three classes of devices, in order of maximum IOPS:

1) NVRAM: battery-backed DRAM devices. The IOPS here are off the charts, especially if the device uses the PCIe bus directly, rather than SATA.

2) SLC SSD: Enterprise-grade Flash NAND-Flash Devices. The IOPS are much less than NVRAM, but still much better than traditional HDDs. A concern with these is the onboard DRAM cache: any writes to the ZIL pending in local cache will be immediately lost in the event of unexpected loss-of-power. Ideally, the cache should be backed with a supercapacitor or dedicated battery. 3rd-gen Intel SLC SSDs should meet this requirement.

3) 10k+ RPM SAS: Enterprise-grade HDDs. These drives have the highest IOPS of conventional HDD designs as they maximize IOPS and seek times over throughput (as throughput can be aggregated, but IOPS and seeking occur per drive). Faster than consumer-grade HDDs, but 2 orders of magnitude slower than SLC SSDs.​

Comparisons
Deciding between these three technologies generally boils down to budgetary constraints: NVRAM devices are exceedingly expensive ($30+ per GB), and somewhat rare. SLC SSDs cost $10-20 per GB. SAS drives are usually around $3 per GB.

NVRAM devices are also the most volatile: any power outage that lasts more than a few hours will jeopardize the contents of the ZIL, the loss of which can corrupt the entire zpool. However, they are not worn by write cycles. All SSDs have a limit to how many writes they can sustain, and while they have different strategies to minimize writes (compression, caching, over-provisioning, TRIM), it is writes themselves that put these devices into the recycling pile, and since the ZIL is specifically for synchronous writes, SSDs can be rapidly consumed by the very role we assign them.

While SAS drives are prone to read errors, lost DRAM cache, and insufficient IOPS, they are able to sustain loss-of-power indefinitely, and a near-infinite number of write operations. They just aren't quite fast enough to make an ideal SLOG.

My Thoughts
Interleaving the ZIL with the zpool data does reduce zpool performance; HDD heads do not need to seek from data to ZIL, back to data, then back to ZIL, just to perform a simple change. The easiest solution is to move the ZIL to a dedicated HDD in the zpool, which eliminates the seek bottleneck and IOPS throughput, but also reduces ZIL throughput dramatically. Using a special devices for the ZIL role makes a lot of sense, but enterprise-grade solutions like NVRAM are not cost-effective outside of fortune-500 companies. SSDs will need to be replaced regularly (6-18 month lifespan), at a cost of at least $350 per drive, or minimum $700 to maintain a mirror.

Proposal
Two 2.5" 10-15kRPM SAS drives in a ZFS mirror will have greater IOPS than a normal 3.5" 5-7kRPM commodity storage drive. If WD Velociraptors were used, a pair of the 150GB model would be sufficient for capacity, and could bring the cost down below $200 total.

The Questions
Would the 2x SAS configuration I proposed above be sufficient IOPS for ZFS to make use of that vdev as a SLOG? Would it offer significant performance over the interleaved ZIL?

If I have any misunderstanding of ZFS mechanics, please correct me. I want to understand this technology inside and out.
 
Last edited:

Kaishi

New Member
Mar 7, 2011
25
0
1
My ZFS build

In the OP, I mentioned that I had been preparing a ZFS build list. I have now made my final decisions and ordered the components. Here's the list:

Chassis: NORCO RPC-4224 with SFF-8087 cables (one reverse-breakout)
Motherboard: Supermicro H8DG6-F with integrated IPMI and LSI-based SAS controller, terminating in SFF-8087 ports.
CPU: 2x AMD Opteron 6128 (Magny-Cours, octacore, 2.0 GHz)
RAM: 8x 4GB ECC DDR3-1333
Disks: 10x Samsung Spinpoint F4 HD204UI 2TB
PSU: Corsair AX-750 with necessary molex splitter and EPS12v splitter​

My plan is to add an SAS expander in the future, as well as an external SAS controller, and an SSD (or two) to be the SLOG. I intend to use double-sided tape to attach the SSDs to the inside of the chassis walls, rather than using any of my precious 3.5" hotswap bays.

The last thing I think I need is to decide on a 2-port ethernet adapter to aggregate with the 2 onboard ports (802.3ag I think, my other server uses it, and my switches support it). Additionally, I'm debating making an iSCSI fabric of some kind, using a dedicated VLAN or maybe a whole switch.

Regarding the drives, I intend to create 10x 2-drive mirrored vdevs in a single zpool, with the intention of adding additional vdevs as necessary. I did buy an extra drive to use as a hot-spare but I may instead put it in an external chassis for the time being, and buy another one later.
 

ACCS

New Member
Apr 6, 2011
10
0
0
San Diego
www.accs.com
There's one thing that I think you're missing here. The purpose of the ZIL is to be a write journal. The OS writes FS-critical information to the ZIL, then puts it in the write queue for the file-system. When the I/O to the file-system completes, the transaction is marked as complete in the ZIL.

Under normal circumstances, the ZIL is useless. This is to be expected, as the ZIL exests to recover from unusual circumstances (crash or power outage). In the event of a power outage (including PSU failure and UPS failure), the ZIL has to be able to flush it's cache, or there's a potential for FS structure corruption. Most disks and SSDs don't do this. I've heard of one SSD that does this (I don't recall the name, but it's one that Sun used in some of their systems for this purpose), and I've heard that some of the new Intel SSDs will also (I haven't verified).

I'd look for a device that has the ability to flush the cache on a power outage. Without that ability, just use the internal ZIL.
 
Last edited:

ACCS

New Member
Apr 6, 2011
10
0
0
San Diego
www.accs.com
Chassis: NORCO RPC-4224 with SFF-8087 cables (one reverse-breakout)
Motherboard: Supermicro H8DG6-F with integrated IPMI and LSI-based SAS controller, terminating in SFF-8087 ports.
If the MB has an SFF-8087 connector, then you need a connector with an SFF-8087 connector on the source end, and the appropriate connector for the backplane that Norco uses on the other end. This CAN NOT BE a reverse breakout cable, as that cable goes from individual SAS ports on the MB to an SFF-8087 connector on the backplane. It would most likely be either a SFF-8087 -> SFF-8087 cable, or an SFF-8087 -> SAS cable (standard breakout).

The breakout cables are directional.
 

MACscr

Member
May 4, 2011
119
3
18
So did you end up going with no cache or zil drives to start out? I notice you never mentioned cache drives at all. Also, how do we determine how much storage is needed for ZIL anyway? Also, is that much processing power really needed? Also, you didnt mention nics. I would highly recommend using Intel Pro nics.

EDIT: im blind, it is ECC memory.
 
Last edited: