ZIL question

josh

Active Member
Oct 21, 2013
435
107
43
Hey guys,

I have a Z2 array of 6x14TB EasyStores and they do about 800MB/s of write as an array.

I have some MZ1LV960HCJH NVMe sitting around. Benchmarked, they also do about the same amount of write as the Z2 array as a whole.

Is there any purpose in adding them as ZIL drives?

Thanks
 

gea

Well-Known Member
Dec 31, 2010
2,538
856
113
DE
Set sync to always and redo a performance test and don't be surprised to end with less than 50 MB/s. Then add the Samsung as Slog and you may be at 300 MB/s. Add an Optane Slog (the best of all NVMe Slogs) and be maybe at 450 MB/s.

You may also add the NVMe as a special vdev mirror to improve additionally small io, metadata, dedup or force a performance sensitive filesystem onto for all operations.
 

josh

Active Member
Oct 21, 2013
435
107
43
Set sync to always and redo a performance test and don't be surprised to end with less than 50 MB/s. Then add the Samsung as Slog and you may be at 300 MB/s. Add an Optane Slog (the best of all NVMe Slogs) and be maybe at 450 MB/s.

You may also add the NVMe as a special vdev mirror to improve additionally small io, metadata, dedup or force a performance sensitive filesystem onto for all operations.
I have the Samsungs lying around unused. Not really interested in spending for a marginally faster drive for SLOG when these would already be a performance increase.

I already have a tier of HGST SSDs for OS/DB level filesystem. Not sure there's benefit in having another tier above that. Just looking to maybe improve the HDD tier for now.

When you say 50MB/s perf test, you are still talking about the entire Z2 as a whole and not the individual drives right? What parameters should I use on the test to get that value? 4G?

Btw, 960GB seems pretty overkill for ZIL, my intention is to overprovision the ZIL but can I set aside some of it for L2ARC?
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,644
922
113
With sync activated your spinner pool will drop significantly in performance.
Its fine to test with the same parameters as you did for your 800MB/s value, just to see the difference.

For non Optane drive its not recommended to split a drive into l2arc and ZIL
 

josh

Active Member
Oct 21, 2013
435
107
43
With sync activated your spinner pool will drop significantly in performance.
Its fine to test with the same parameters as you did for your 800MB/s value, just to see the difference.

For non Optane drive its not recommended to split a drive into l2arc and ZIL
Thanks. I've done more research and I think I understand why it looked good before. System was not writing the entire transaction to drive before completion with sync off. Is mirroring SLOG a waste of a 1TB drive or should I just put the second SSD to L2ARC?
Also, how do I calculate the size of SLOG to be overprovisioned? Which value is representative of ZIL_MAX_WRITE_SPEED? The write speed of the SLOG or the pool?
 

gea

Well-Known Member
Dec 31, 2010
2,538
856
113
DE
On Solaris with native ZFS the Slog should be around 2 x 5s of writes. On a 10G network this means around 10 GB. On Open-ZFS the size is related to RAM. The default is 10% RAM, max 4GB. When the ram writecache is full it is written to pool and saved to the Slog to protect the cache in case of a crash until next reboot. Calculate the same as a buffer for next writes then what gives the same size, around 10 GB as a minimum. With traditional flash where you cannot write single Bytes like with Optane, use a larger Slog say 20-50GB and keep the rest empty (on a new or securely erased SSD/NVMe)

Performance relevant for the Slog is low latency and high steady write iops with qd1. Powerloss protection is required.

An Slog mirror can protect the ramcache on a Slog failure with a simultanious crash, not very likely. More important is that there is no performance degration on a failed Slog where the logging goes otherwise to the pool (ZIL).
 
Last edited:
  • Like
Reactions: BoredSysadmin

josh

Active Member
Oct 21, 2013
435
107
43
On Solaris with native ZFS the Slog should be around 2 x 5s of writes. On a 10G network this means around 10 GB. On Open-ZFS the size is related to RAM. The default is 10% RAM, max 4GB. When the ram writecache is full it is written to pool and saved to the Slog to protect the cache in case of a crash until next reboot. Calculate the same as a buffer for next writes then what gives the same size, around 10 GB as a minimum. With traditional flash where you cannot write single Bytes like with Optane, use a larger Slog say 20-50GB and keep the rest empty (on a new or securely erased SSD/NVMe)

Performance relevant for the Slog is low latency and high steady write iops with qd1. Powerloss protection is required.

An Slog mirror can protect the ramcache on a Slog failure with a simultanious crash, not very likely. More important is that there is no performance degration on a failed Slog where the logging goes otherwise to the pool (ZIL).
Thank you. I currently have 4 of these drives on a Hyper M2 and I was thinking of mirroring 2 as a scratch drive for dataset processing, 1 for L2ARC and 1 for SLOG. Since I only need about 50GB of the drive for SLOG, I believe the 960GB drive will never wear out when overprovisioned. Am I correct to think this way?
 

gea

Well-Known Member
Dec 31, 2010
2,538
856
113
DE
If you only use 5% of an SSD, wear out should not be a problem with a server SSD event when it must process the whole write load (on the pool this is distributed over the disks in a raid-z)
 

josh

Active Member
Oct 21, 2013
435
107
43
If the drive is rated for 1.3DWPD and you fill 5% of the drive every 5 seconds (pre-flush), won't it wear out really fast?
 

Rand__

Well-Known Member
Mar 6, 2014
4,644
922
113
My first thought was nah, that can't be an issue, but when i quickly calculated it it looked different

If you use max 5%, so you basically have 20 times the DWPD if optimization takes place, so 26DWPD in this example.
Thats 26*960GB =24960 GB, divided by 50 = ~499 writes, every 5 secondes gives us ~41 minutes :O

The 1.3 DWPD are for 3 or 5 years usually, so thats 41 *(3|5) *365 , so 31/52 days ...

O/c nobody writing 6.25 GB/s to the ZIL 24/7 actually would consider a 960GB 1.3DWPD drive a suitable candidate to use;)
Nor would a 960GB 1.3DWPD drive actually be able to deliver a sustained write rate 6.25GB/s (or at least none that I know)
 

josh

Active Member
Oct 21, 2013
435
107
43
If you use max 5%, so you basically have 20 times the DWPD if optimization takes place, so 26DWPD in this example.
Thats 26*960GB =24960 GB, divided by 50 = ~499 writes, every 5 secondes gives us ~41 minutes :O
I don't think it works this way.. You should only get 26*5%*960GB
 

gea

Well-Known Member
Dec 31, 2010
2,538
856
113
DE
I would calculate like
If you enable sync on a diskpool with such good flash Slog, you cannot expect more than say 300 MB/s max constant write rate, with an Optane maybe 500 MB/s. This is also what a 10G network offers without special tunings.

If you write with this rate constantly you get the following per day
300 MB/s x60 (min) x 60 (hour) x 24 (day)=around 25000 GB/day

This is quite the value that you can expect with 95% overprovisioning on this new/secure erased flash disk (Drive firmware should then hopefully distribute writes over the whole flash disk)

Of course, if you write constantly 24/7 at this rate and come near to limits, you should never allow this in a production system.
An Intel DC4801 would then be more suitable, durable - and faster.
 
Last edited:

josh

Active Member
Oct 21, 2013
435
107
43
I would calculate like
If you enable sync on a diskpool with such good flash Slog, you cannot expect more than say 300 MB/s max constant write rate, with an Optane maybe 500 MB/s. This is also what a 10G network offers without special tunings.

If you write with this rate constantly you get the following per day
300 MB/s x60 (min) x 60 (hour) x 24 (day)=around 25000 GB/day

This is quite the value that you can expect with 95% overprovisioning on this new/secure erased flash disk (Drive firmware should then hopefully distribute writes over the whole flash disk)

Of course, if you write constantly 24/7 at this rate and come near to limits, you should never allow this in a production system.
An Intel DC4801 would then be more suitable, durable - and faster.
I've just thought more in detail about what you mentioned, if I'm not expected to get more than 300MB/s on the SLOG, what purpose am I using an NVMe for? I could just plug in an additional HGST 400GB with 7.3PB write endurance and not worry about failure for years?
 

azev

Active Member
Jan 18, 2013
740
212
43
most nvme max throughput numbers are done via many queue depth. The same nvme will perform much smaller numbers when you set the queue depth of 1, that is why Gea said most likely on get around 300MB/s with a consumer grade nvme.
As for your questions, I think its a good idea to do it, its set it and forget it, instead of worrying about ZIL expiration :)