ZIL question

josh · Sep 28, 2020

Hey guys,

I have a Z2 array of 6x14TB EasyStores and they do about 800MB/s of write as an array.

I have some MZ1LV960HCJH NVMe sitting around. Benchmarked, they also do about the same amount of write as the Z2 array as a whole.

Is there any purpose in adding them as ZIL drives?

Thanks

gea · Sep 29, 2020

Set sync to always and redo a performance test and don't be surprised to end with less than 50 MB/s. Then add the Samsung as Slog and you may be at 300 MB/s. Add an Optane Slog (the best of all NVMe Slogs) and be maybe at 450 MB/s.

You may also add the NVMe as a special vdev mirror to improve additionally small io, metadata, dedup or force a performance sensitive filesystem onto for all operations.

i386 · Sep 29, 2020

gea said:
the best of all NVMe Slogs

Radian RMS-200/8G NVRAM Flush-to-Flash Accelerator Card PCIe x8 Gen3 LP Bracket | eBay

Rand__ · Sep 29, 2020

Wow, they've become cheap

josh · Sep 29, 2020

gea said:
Set sync to always and redo a performance test and don't be surprised to end with less than 50 MB/s. Then add the Samsung as Slog and you may be at 300 MB/s. Add an Optane Slog (the best of all NVMe Slogs) and be maybe at 450 MB/s.

You may also add the NVMe as a special vdev mirror to improve additionally small io, metadata, dedup or force a performance sensitive filesystem onto for all operations.

I have the Samsungs lying around unused. Not really interested in spending for a marginally faster drive for SLOG when these would already be a performance increase.

I already have a tier of HGST SSDs for OS/DB level filesystem. Not sure there's benefit in having another tier above that. Just looking to maybe improve the HDD tier for now.

When you say 50MB/s perf test, you are still talking about the entire Z2 as a whole and not the individual drives right? What parameters should I use on the test to get that value? 4G?

Btw, 960GB seems pretty overkill for ZIL, my intention is to overprovision the ZIL but can I set aside some of it for L2ARC?

Rand__ · Sep 29, 2020

With sync activated your spinner pool will drop significantly in performance.
Its fine to test with the same parameters as you did for your 800MB/s value, just to see the difference.

For non Optane drive its not recommended to split a drive into l2arc and ZIL

josh · Sep 29, 2020

Rand__ said:
With sync activated your spinner pool will drop significantly in performance.
Its fine to test with the same parameters as you did for your 800MB/s value, just to see the difference.

For non Optane drive its not recommended to split a drive into l2arc and ZIL

Thanks. I've done more research and I think I understand why it looked good before. System was not writing the entire transaction to drive before completion with sync off. Is mirroring SLOG a waste of a 1TB drive or should I just put the second SSD to L2ARC?
Also, how do I calculate the size of SLOG to be overprovisioned? Which value is representative of ZIL_MAX_WRITE_SPEED? The write speed of the SLOG or the pool?

gea · Sep 29, 2020

On Solaris with native ZFS the Slog should be around 2 x 5s of writes. On a 10G network this means around 10 GB. On Open-ZFS the size is related to RAM. The default is 10% RAM, max 4GB. When the ram writecache is full it is written to pool and saved to the Slog to protect the cache in case of a crash until next reboot. Calculate the same as a buffer for next writes then what gives the same size, around 10 GB as a minimum. With traditional flash where you cannot write single Bytes like with Optane, use a larger Slog say 20-50GB and keep the rest empty (on a new or securely erased SSD/NVMe)

Performance relevant for the Slog is low latency and high steady write iops with qd1. Powerloss protection is required.

An Slog mirror can protect the ramcache on a Slog failure with a simultanious crash, not very likely. More important is that there is no performance degration on a failed Slog where the logging goes otherwise to the pool (ZIL).

josh · Oct 1, 2020

gea said:
On Solaris with native ZFS the Slog should be around 2 x 5s of writes. On a 10G network this means around 10 GB. On Open-ZFS the size is related to RAM. The default is 10% RAM, max 4GB. When the ram writecache is full it is written to pool and saved to the Slog to protect the cache in case of a crash until next reboot. Calculate the same as a buffer for next writes then what gives the same size, around 10 GB as a minimum. With traditional flash where you cannot write single Bytes like with Optane, use a larger Slog say 20-50GB and keep the rest empty (on a new or securely erased SSD/NVMe)

Performance relevant for the Slog is low latency and high steady write iops with qd1. Powerloss protection is required.

An Slog mirror can protect the ramcache on a Slog failure with a simultanious crash, not very likely. More important is that there is no performance degration on a failed Slog where the logging goes otherwise to the pool (ZIL).

Thank you. I currently have 4 of these drives on a Hyper M2 and I was thinking of mirroring 2 as a scratch drive for dataset processing, 1 for L2ARC and 1 for SLOG. Since I only need about 50GB of the drive for SLOG, I believe the 960GB drive will never wear out when overprovisioned. Am I correct to think this way?

gea · Oct 1, 2020

If you only use 5% of an SSD, wear out should not be a problem with a server SSD event when it must process the whole write load (on the pool this is distributed over the disks in a raid-z)

josh · Oct 3, 2020

If the drive is rated for 1.3DWPD and you fill 5% of the drive every 5 seconds (pre-flush), won't it wear out really fast?

Rand__ · Oct 4, 2020

My first thought was nah, that can't be an issue, but when i quickly calculated it it looked different

If you use max 5%, so you basically have 20 times the DWPD if optimization takes place, so 26DWPD in this example.
Thats 26*960GB =24960 GB, divided by 50 = ~499 writes, every 5 secondes gives us ~41 minutes :O

The 1.3 DWPD are for 3 or 5 years usually, so thats 41 *(3|5) *365 , so 31/52 days ...

O/c nobody writing 6.25 GB/s to the ZIL 24/7 actually would consider a 960GB 1.3DWPD drive a suitable candidate to use

Nor would a 960GB 1.3DWPD drive actually be able to deliver a sustained write rate 6.25GB/s (or at least none that I know)

josh · Oct 4, 2020

Rand__ said:
If you use max 5%, so you basically have 20 times the DWPD if optimization takes place, so 26DWPD in this example.
Thats 26*960GB =24960 GB, divided by 50 = ~499 writes, every 5 secondes gives us ~41 minutes :O

I don't think it works this way.. You should only get 26*5%*960GB

gea · Oct 4, 2020

I would calculate like
If you enable sync on a diskpool with such good flash Slog, you cannot expect more than say 300 MB/s max constant write rate, with an Optane maybe 500 MB/s. This is also what a 10G network offers without special tunings.

If you write with this rate constantly you get the following per day
300 MB/s x60 (min) x 60 (hour) x 24 (day)=around 25000 GB/day

This is quite the value that you can expect with 95% overprovisioning on this new/secure erased flash disk (Drive firmware should then hopefully distribute writes over the whole flash disk)

Of course, if you write constantly 24/7 at this rate and come near to limits, you should never allow this in a production system.
An Intel DC4801 would then be more suitable, durable - and faster.

josh · Oct 10, 2020

SLOG failure from wearout should only negatively impact the system if there was a double failure (system power loss before full commit together with SLOG failure). Am I right to say that the chances of this are miniscule?

Chris's Wiki :: blog/solaris/ZFSSLOGLossEffects

i386 · Oct 10, 2020

A power surge killing the psu and attached devices?

josh · Oct 10, 2020

i386 said:
A power surge killing the psu and attached devices?

If it killed attached devices won't my main pool be destroyed as well and all this would be moot as well?

josh · Oct 10, 2020

gea said:
I would calculate like
If you enable sync on a diskpool with such good flash Slog, you cannot expect more than say 300 MB/s max constant write rate, with an Optane maybe 500 MB/s. This is also what a 10G network offers without special tunings.

If you write with this rate constantly you get the following per day
300 MB/s x60 (min) x 60 (hour) x 24 (day)=around 25000 GB/day

This is quite the value that you can expect with 95% overprovisioning on this new/secure erased flash disk (Drive firmware should then hopefully distribute writes over the whole flash disk)

Of course, if you write constantly 24/7 at this rate and come near to limits, you should never allow this in a production system.
An Intel DC4801 would then be more suitable, durable - and faster.

I've just thought more in detail about what you mentioned, if I'm not expected to get more than 300MB/s on the SLOG, what purpose am I using an NVMe for? I could just plug in an additional HGST 400GB with 7.3PB write endurance and not worry about failure for years?

azev · Oct 10, 2020

most nvme max throughput numbers are done via many queue depth. The same nvme will perform much smaller numbers when you set the queue depth of 1, that is why Gea said most likely on get around 300MB/s with a consumer grade nvme.
As for your questions, I think its a good idea to do it, its set it and forget it, instead of worrying about ZIL expiration

Search

ZIL question

josh

Active Member

gea

Well-Known Member

i386

Well-Known Member

Rand__

Well-Known Member

josh

Active Member

Rand__

Well-Known Member

josh

Active Member

gea

Well-Known Member

josh

Active Member

gea

Well-Known Member

josh

Active Member

Rand__

Well-Known Member

josh

Active Member

gea

Well-Known Member

josh

Active Member

i386

Well-Known Member

josh

Active Member

josh

Active Member

azev

Well-Known Member