ZFS Elephant In The Room (all NVMe array)

Carlin Chamberlain · Dec 13, 2023

Hi STH community! Our company is growing rapidly and so I recently ordered our first ‘proper’ server (been using NAS hosted VM’s previously). I settled on the SuperMicro AS-1115-SV-WTNRT with 10 x 3.4 PM9A3 Samsung Enterprise drives (1.3 DWPD) and a RAID0 m.2 boot drive.

I’m a fairly experienced computer user (years of development, database management, system admin etc) but am struggling to find good info about tuning ZFS for all NVMe storage arrays. Worringly, I seem to come across many posts where people complain they’re getting very poor speeds using ZFS - regardless of tuning :-(

Another problem - possibly even worse - is the absolutely appalling write amplification that ZFS suffer from (see thorough testing by user Dunuin here). Write amplification of over 50 x for ZFS RAID10!(!!). I don’t want to wear out my NZ$6000 storage array 50 x faster than necessary!

Yes, I’m a ZFS noob but I want to climb this hill and am looking for help finding the best path forward. My questions are (and please insert anything else you think relevant)
1. Should I just use MDADM instead? (Server will run VM’s probably in XCP-ng)
2. With a 5 year warranty, am I worrying too much about rampant write amplification? Just ignore it!?
3. Running a journaling file system on top of ZFS seems to be a major cause of the amplification, can I disable journaling (considering SSD’s have capacitors, server has dual power supplies and a robust UPS will be in place). I’ve ready conflicting answers on this so am not sure what’s true.

Apologies if this has been asked before but I’m really struggling to find info specific to all NVMe drive pools and NVMe that don’t have severe issues.

Very much appreciate any thoughts.

Thanks,
Carlin

mattventura · Dec 13, 2023

50x seems suspiciously high. What are your pool settings and ashift value?

CyklonDX · Dec 13, 2023

Most zfs setups where people have problems are all in truenas, proxmox, poor controllers, and/or within vms.
Latest verions of zfs addresses some limitations how writes are being made (no longer stuck with speed of single drive in vdev).

1. I'm not going to recommend the use of mdadm if you are looking for data safety.
2. Building flash storage like that makes no sense with zfs in first place. ZFS was meant to be used with slow hdd's. Your worries are on point... best way to prevent that is to enable dedup, and have decent log,l2arc, as well as decent chunk of memory assigned to the pool. As well follow best practices for zfs setup to minimize uneeded writes - flushing settings.
3. you can always disable that.

SnJ9MX · Dec 13, 2023

and what is recordsize vs your average actual write size?

also hope you meant RAID1 for boot drive

Carlin Chamberlain · Dec 13, 2023

Hi Matt, system is still on is way so I’m trying to put together a few options for testing when it arrives. An ashift of 12 seems to be the sweet spot for my hardware config. Will post my findings once it arrives.

Perhaps I’m overthinking this but from all the reading I’ve done, including the very exhaustive testing of many different configs that match my use case (Linux VM’s hosted on all-flash ZFS) found here. I’m wondering if XFS may not be the best way to go.

Kind regards,
Carlin

CyklonDX · Dec 13, 2023

ashift 12 is meant for 4kn

If your ssd's/nvme are 512e/512 (they all come formatted with that) you should be using ashift 9
(2^9 = 512 )

If you want ashift 12 you should first format your nvme disk correctly to 4kn first.

vincococka · Dec 13, 2023

Carlin Chamberlain said:
Hi Matt, system is still on is way so I’m trying to put together a few options for testing when it arrives. An ashift of 12 seems to be the sweet spot for my hardware config. Will post my findings once it arrives.

Perhaps I’m overthinking this but from all the reading I’ve done, including the very exhaustive testing of many different configs that match my use case (Linux VM’s hosted on all-flash ZFS) found here. I’m wondering if XFS may not be the best way to go.

Kind regards,
Carlin

XFS is good choice for VMs and DB's,
Also consider using plain/vanilla block devices - less SW in the path to data -> less problems.

ano · Dec 13, 2023

you should be seeing 20GB/s at 128KB for 100% random writes with a decent genoa cpu, they are so fast you could even pull off z2 or z3 instaed of "raid10"

we have been running ZFS solutions like that with customers for years and years, dont worry about write amplification

AVG lifetime left for 960GB and 1.92TB systems with 1DWPD after 5 years in use for all their storage seems to avg out at around 98% lifetime remaining, so dont worry.

You have bought a very good server, H13 platform genoa, with very good drives.

unwind-protect · Dec 13, 2023

Is there any real information on zfs write amplification?

louie1961 · Dec 13, 2023

What is your use case for ZFS? Are you using it in your server operating system, or are you using it in TrueNAS,or something else?

CyklonDX · Dec 14, 2023

unwind-protect said:
Is there any real information on zfs write amplification?

with ashift 12 i could see write amplification by up-to 8x for 512 disks.

As each sector in zfs will be 4096bytes in size, if its being modified it will have to use 8 real sectors on the disk. The endurance of the disk likely will cut by 8x too. - one of likely reasons why they made nvme/ssd logical sectors 512.

Carlin Chamberlain · Dec 14, 2023

Appreciate all the comments, lots to consider while I wait for delivery of the system and some good starting tips on potential best options to behinú with. The server will be regularly backed up to a RAID10 NAS but even so, uptime matters so it sounds like MDADM is out.

That forum post I linked to above is the most comprehensive analysis / testing of many different settings and configuration with regards to write amplification I have seen. I was quite shocked at what I read.

With flash storage so cheap now it looks like there will still be a place for the hardware RaID controller for years to come yet - once the new generation that can cope with the speeds required become reasonably priced.

Perhaps there’s still a place for mechanical drives yet. Will. Post my finding once it arrives.

thanks again,
Carlin

CyklonDX · Dec 14, 2023

In terms of mechanical drives with nvme as l2arc here are some of my tests at home. (only for write, the nvme doesn't do much there - its more of a system stream capacity test)

Untitled spreadsheet

Sheet1 4x SSD,Hynix 500G Gold,512e ZFS RAIDz2,64GB RAM,internal 12GB/s 32GB,Streams,1,2,4,6,8,12,16,24 Write (MB/s),1800,1600,1200,976,868,725,563,388 Time (sec),19,21,27,35,39,47,61,88 4096 x1M,Streams,1,2,4,6,8,12,16,24 Write (MB/s),1200,1100,756,669,522,417,287,214 Time (sec),3.7,3.8,5.6,6.4,8.2

docs.google.com

Code:

One with just SSD's

root@KVMHost:~# zpool get all zfs
NAME  PROPERTY                       VALUE                          SOURCE
zfs   size                           1.81T                          -
zfs   capacity                       73%                            -
zfs   altroot                        -                              default
zfs   health                         ONLINE                         -
zfs   guid                           5831615307471499831            -
zfs   version                        -                              default
zfs   bootfs                         -                              default
zfs   delegation                     on                             default
zfs   autoreplace                    off                            default
zfs   cachefile                      -                              default
zfs   failmode                       wait                           default
zfs   listsnapshots                  off                            default
zfs   autoexpand                     off                            default
zfs   dedupratio                     1.00x                          -
zfs   free                           497G                           -
zfs   allocated                      1.33T                          -
zfs   readonly                       off                            -
zfs   ashift                         0                              default
zfs   comment                        -                              default
zfs   expandsize                     -                              -
zfs   freeing                        0                              -
zfs   fragmentation                  27%                            -
zfs   leaked                         0                              -
zfs   multihost                      off                            default
zfs   checkpoint                     -                              -
zfs   load_guid                      8202998207846407600            -
zfs   autotrim                       on                             local
zfs   compatibility                  off                            default
zfs   feature@async_destroy          enabled                        local
zfs   feature@empty_bpobj            active                         local
zfs   feature@lz4_compress           active                         local
zfs   feature@multi_vdev_crash_dump  enabled                        local
zfs   feature@spacemap_histogram     active                         local
zfs   feature@enabled_txg            active                         local
zfs   feature@hole_birth             active                         local
zfs   feature@extensible_dataset     active                         local
zfs   feature@embedded_data          active                         local
zfs   feature@bookmarks              enabled                        local
zfs   feature@filesystem_limits      enabled                        local
zfs   feature@large_blocks           enabled                        local
zfs   feature@large_dnode            enabled                        local
zfs   feature@sha512                 enabled                        local
zfs   feature@skein                  enabled                        local
zfs   feature@edonr                  enabled                        local
zfs   feature@userobj_accounting     active                         local
zfs   feature@encryption             enabled                        local
zfs   feature@project_quota          active                         local
zfs   feature@device_removal         enabled                        local
zfs   feature@obsolete_counts        enabled                        local
zfs   feature@zpool_checkpoint       enabled                        local
zfs   feature@spacemap_v2            active                         local
zfs   feature@allocation_classes     enabled                        local
zfs   feature@resilver_defer         enabled                        local
zfs   feature@bookmark_v2            enabled                        local
zfs   feature@redaction_bookmarks    enabled                        local
zfs   feature@redacted_datasets      enabled                        local
zfs   feature@bookmark_written       enabled                        local
zfs   feature@log_spacemap           active                         local
zfs   feature@livelist               enabled                        local
zfs   feature@device_rebuild         enabled                        local
zfs   feature@zstd_compress          enabled                        local
zfs   feature@draid                  enabled                        local

(one with l2arc nvme)
root@KVMHost:~# zpool get all zfs2
NAME  PROPERTY                       VALUE                          SOURCE
zfs2  size                           29.1T                          -
zfs2  capacity                       82%                            -
zfs2  altroot                        -                              default
zfs2  health                         ONLINE                         -
zfs2  guid                           2215484702743167789            -
zfs2  version                        -                              default
zfs2  bootfs                         -                              default
zfs2  delegation                     on                             default
zfs2  autoreplace                    off                            default
zfs2  cachefile                      -                              default
zfs2  failmode                       wait                           default
zfs2  listsnapshots                  off                            default
zfs2  autoexpand                     off                            default
zfs2  dedupratio                     1.00x                          -
zfs2  free                           5.13T                          -
zfs2  allocated                      24.0T                          -
zfs2  readonly                       off                            -
zfs2  ashift                         12                             local
zfs2  comment                        -                              default
zfs2  expandsize                     -                              -
zfs2  freeing                        0                              -
zfs2  fragmentation                  3%                             -
zfs2  leaked                         0                              -
zfs2  multihost                      off                            default
zfs2  checkpoint                     -                              -
zfs2  load_guid                      4881414342528979059            -
zfs2  autotrim                       off                            default
zfs2  compatibility                  off                            default
zfs2  feature@async_destroy          enabled                        local
zfs2  feature@empty_bpobj            enabled                        local
zfs2  feature@lz4_compress           active                         local
zfs2  feature@multi_vdev_crash_dump  enabled                        local
zfs2  feature@spacemap_histogram     active                         local
zfs2  feature@enabled_txg            active                         local
zfs2  feature@hole_birth             active                         local
zfs2  feature@extensible_dataset     active                         local
zfs2  feature@embedded_data          active                         local
zfs2  feature@bookmarks              enabled                        local
zfs2  feature@filesystem_limits      enabled                        local
zfs2  feature@large_blocks           enabled                        local
zfs2  feature@large_dnode            enabled                        local
zfs2  feature@sha512                 enabled                        local
zfs2  feature@skein                  enabled                        local
zfs2  feature@edonr                  enabled                        local
zfs2  feature@userobj_accounting     active                         local
zfs2  feature@encryption             enabled                        local
zfs2  feature@project_quota          active                         local
zfs2  feature@device_removal         enabled                        local
zfs2  feature@obsolete_counts        enabled                        local
zfs2  feature@zpool_checkpoint       enabled                        local
zfs2  feature@spacemap_v2            active                         local
zfs2  feature@allocation_classes     enabled                        local
zfs2  feature@resilver_defer         enabled                        local
zfs2  feature@bookmark_v2            enabled                        local
zfs2  feature@redaction_bookmarks    enabled                        local
zfs2  feature@redacted_datasets      enabled                        local
zfs2  feature@bookmark_written       enabled                        local
zfs2  feature@log_spacemap           active                         local
zfs2  feature@livelist               enabled                        local
zfs2  feature@device_rebuild         enabled                        local
zfs2  feature@zstd_compress          enabled                        local
zfs2  feature@draid                  enabled                        local

(one with just hdd's)
root@KVMHost:~# zpool get all zfs3
NAME  PROPERTY                       VALUE                          SOURCE
zfs3  size                           58.2T                          -
zfs3  capacity                       44%                            -
zfs3  altroot                        -                              default
zfs3  health                         ONLINE                         -
zfs3  guid                           17895618389695106753           -
zfs3  version                        -                              default
zfs3  bootfs                         -                              default
zfs3  delegation                     on                             default
zfs3  autoreplace                    off                            default
zfs3  cachefile                      -                              default
zfs3  failmode                       wait                           default
zfs3  listsnapshots                  off                            default
zfs3  autoexpand                     on                             local
zfs3  dedupratio                     1.00x                          -
zfs3  free                           32.3T                          -
zfs3  allocated                      25.9T                          -
zfs3  readonly                       off                            -
zfs3  ashift                         0                              default
zfs3  comment                        -                              default
zfs3  expandsize                     -                              -
zfs3  freeing                        0                              -
zfs3  fragmentation                  0%                             -
zfs3  leaked                         0                              -
zfs3  multihost                      off                            default
zfs3  checkpoint                     -                              -
zfs3  load_guid                      16163058075340890727           -
zfs3  autotrim                       off                            default
zfs3  compatibility                  off                            default
zfs3  feature@async_destroy          enabled                        local
zfs3  feature@empty_bpobj            enabled                        local
zfs3  feature@lz4_compress           active                         local
zfs3  feature@multi_vdev_crash_dump  enabled                        local
zfs3  feature@spacemap_histogram     active                         local
zfs3  feature@enabled_txg            active                         local
zfs3  feature@hole_birth             active                         local
zfs3  feature@extensible_dataset     active                         local
zfs3  feature@embedded_data          active                         local
zfs3  feature@bookmarks              enabled                        local
zfs3  feature@filesystem_limits      enabled                        local
zfs3  feature@large_blocks           enabled                        local
zfs3  feature@large_dnode            enabled                        local
zfs3  feature@sha512                 enabled                        local
zfs3  feature@skein                  enabled                        local
zfs3  feature@edonr                  enabled                        local
zfs3  feature@userobj_accounting     active                         local
zfs3  feature@encryption             enabled                        local
zfs3  feature@project_quota          active                         local
zfs3  feature@device_removal         enabled                        local
zfs3  feature@obsolete_counts        enabled                        local
zfs3  feature@zpool_checkpoint       enabled                        local
zfs3  feature@spacemap_v2            active                         local
zfs3  feature@allocation_classes     enabled                        local
zfs3  feature@resilver_defer         enabled                        local
zfs3  feature@bookmark_v2            enabled                        local
zfs3  feature@redaction_bookmarks    enabled                        local
zfs3  feature@redacted_datasets      enabled                        local
zfs3  feature@bookmark_written       enabled                        local
zfs3  feature@log_spacemap           active                         local
zfs3  feature@livelist               enabled                        local
zfs3  feature@device_rebuild         enabled                        local
zfs3  feature@zstd_compress          enabled                        local
zfs3  feature@draid                  enabled                        local

VMman · Dec 14, 2023

Not sure if you're seen this but it maybe helpful.

i386 · Dec 14, 2023

CyklonDX said:
As each sector in zfs will be 4096bytes in size, if its being modified it will have to use 8 real sectors on the disk.

Ssds don't have sectors, thye read/write in pages which can be 4, 8 or (in newer flash) 16 kbyte.
Deleting Data in ssds Happens in blocks. 1 Block = 128 (or in newer flash) 256 pages

ericloewe · Dec 14, 2023

Carlin Chamberlain said:
Write amplification of over 50 x for ZFS RAID10!(!!). I don’t want to wear out my NZ$6000 storage array 50 x faster than necessary!

Well, the tested case is not super meaningful. 1 MB/s is not a lot, so you're doing a lot of "overhead" stuff for little data... Updating the tree up to the uberblock on every TXG, flushing out every TXG well before it fills up, etc... You wouldn't see the same ratio if you were doing 100 MB/s, it would almost certainly be a lot lower.

mattventura · Dec 14, 2023

Don't forget to enable compression of any sort so that larger writes don't have to round up to the next whole record size multiple.

CyklonDX · Dec 14, 2023

i386 said:
Ssds don't have sectors, thye read/write in pages which can be 4, 8 or (in newer flash) 16 kbyte.
Deleting Data in ssds Happens in blocks. 1 Block = 128 (or in newer flash) 256 pages

Thanks - i wasn't aware.
~ hmm i need to investigate this behavior. From hardware thats what may be going on, but i think software wise it will still treat it as it is with logical sectors - and submit write where the hardware may "wait" to update block - block will be 128-256 pages - but the write will happen with 0's and only 4096/512 bytes filled in on zfs if it wants to commit. I digress tho - i need to investigate - for myself.

CyklonDX · Dec 16, 2023

So... I ran some tests on older Hynix Gold 1TB SSD's with zfs ashift 12
(had 12 of those, with around 15-20TB written on each)

The result the disks want to die. While in reality i only wrote around 300-400TB on each - they reported 600+. While at NAND writes exploded.

// So in my opinion zfs at very least isn't treating ssd's with as if they were with blocks.

pimposh · Dec 16, 2023

Could you re-do this test on other 12 drives? Single run isn’t reliable

Now, seriously speaking, why temp reported is so low?

ZFS Elephant In The Room (all NVMe array)

New Member

Active Member

Well-Known Member

Active Member

New Member

Well-Known Member

Member

Well-Known Member

Active Member

Active Member

Well-Known Member

New Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Well-Known Member

hardware pimp