I got this optane drive on this deal from newegg, but still haven't decided what i'm going to do with it. There's nothing INSANELY compelling about a single optane, i knew this too going into that purchase, 3d xpoint just was too shiny not to get my grubby hands on some.
But as far as I know, but i hope my knowledge is outdated, for ZFS it's not advisable to have a single point of failure so even though optane can make for great results in use as special(metadata/small file)/SLOG/L2ARC, only having one means my pool goes poof if the optane drive dies. Only the special vdev usage mode is associated with this risk.
Optane is clearly mostly wasted when used on anything else I can think of like
- gaming desktop os disk
- linux os disk
- applications that stream data (HDDs in regular ZFS setup handles it)
I do wonder though, what if I periodically make a block for block clone of the optane and use it as single lone special vdev in zfs. If the optane dies i have a backup i can save my pool with!
OpenAI o3 tells me this, and I'm inclined to believe it:
Why a single‑drive special vdev is risky
- All metadata (and any data blocks below special_small_blocks) lives only on the special vdev class.
- ZFS keeps the same redundancy within that class. A single device ⇒ no redundancy.
- If the device is lost the pool usually fails to mount; a stale clone lacks the last transaction groups and still leaves you rebuilding by hand.
In light of this (and other deeper research with o3, it had some great advice for me like
Either way, keep normal ZFS snapshots and an off‑box zfs send backup—the fastest SSD in the world won’t save you from fat‑finger deletes.
) my path forward looks fairly clear. I could:
- mirror an optane with nvme for special vdev and benefit from optane speed read latency, while being stuck with nvme write latency. This, meaning, presumably, on the performance of fetching metadata and small files. Latencies look like 10us for optane and 70-120us for NVMe. writes on NVMe are up to 300us.
- mirror two fast nvmes for special vdev. still going to see massive speed up compared to rust, but technically slower than the above. not likely something i'll do mainly because i barely have a pair of gen 4 ssd's and any I have I probably wouldn't want to put toward special vdev usage
- single optane special vdev, lean on regular pool backups. Transition to mirrored special vdev once a second optane arrives (may not happen for a long time though. street price seems to be 300-500 on 960GB 905P's. goes a long way to justify my unjustified purchase). i could also mirror it with some nvme but that will cause a slight perf reduction.
All (except maybe the last... seems silly to also put a SLOG load on a single optane that is trusted with all metadata, likely increasing its failure rate) seemingly can practically have the optane get more partitions and deployed as SLOG and L2ARC where necessary and gain perf in relevant workload types. Even L2ARC feels like a waste but probably not if rootfs is on this zfs setup.
Also overall i am likely overemphasizing the importance of special metadata. I think (but am not confident) that optane for SLOG is like the bread and butter of people justifying going out to get optane. What seems nice is only a small amount of that storage is needed since SLOG is never needed until a crash occurs and a few seconds is all you need. carving out 16GB for SLOG on a 960GB optane drive leaves the vast bulk of space for other usage to benefit from that sweet low latency that can be put toward other uses.
I guess I gotta think some more about what makes sense but i should probably start looking around for whether there exists out there a really fast and small (250GB should be good for my pool which will be around 56TB *primarily* storing large files but i'm hoping to make it a general purpose fast storage with all these fancy things i'm talking about!) and cheap gen 4 NVMe SSD, use it to mirror for special metadata use with my optane. Then, i can partition my 960GB optane with 250GB for special vdev use mirrored with that ssd, and the 710 remaining GB can be partitioned into 16GB (32GB if we're really conservative) for SLOG and the 694 or 678 remaining GB can be L2ARC or further partitioned into some kind of scratch space to give to any relevant applications. (update: looks like i may want a smaller l2arc to start since l2arc does also have some RAM overhead) Looks like I got myself a detailed plan...
Aforementioned "really fast but can be small" SSD could very well be a 280GB 905P or 900P and i'll have all my bases covered. That's going to mean over $300 in optane in the rig though, so it had better speed something up...
probably will continue to keep an eye out for optane cheap deals, and in the meantime:
- taking a 250gb partition from some gen 4 TLC nvme and mirroring that with a 250gb partition of the optane disk for a special metadata vdev for my main pool
- split the optane into 3 more partitions:
- 16 or 32GB for SLOG
- 200GB for L2ARC (system has 128GB of DDR4 which is supposed to be ECC but i haven't really confirmed, running 5950X on a dark hero board)
- remaining 460+GB for bcache
This gives half a TB of high performance optane caching capability on writes to the pool, fastest possible metadata/small file reads, faster than usual writes due to slog, and significantly enhanced caching of reads. the split between bcache and l2arc can be adjusted later.
The remaining caveat of non-fastest-possible metadata/small file writes can be addressed by installing a second optane device and making metadata a fully optane mirror. I did check that special vdev use on NVMe will not be troublesome for it in terms of endurance.
Pretty cool but also gotta admit all of this complexity gives cheap servers with oodles of ram some more appeal. 50 or more GB/s is a whole other level compared to <10GB/s from optane. part of why optane died. servers just load up on dram.
AI is really useful for brainstorming homelab configuration.