NVMe drives for KVM on ZFS

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

andrewbedia

Active Member
Jan 11, 2013
698
247
43
Looking to explore upgrading from 4x SM953 480GB in two mirrors (RAID10) in ZFS. The perfomance is 'okay' but not particularly great for 4-5 VMs. Platform is 2x Xeon Silver 4216 and 256GiB RAM. L2ARC to a Intel S3700 (not ideal, I'm aware) definitely helped, but it's still crappy performance. I've already tuned to Virtio, no cache, and thread IO.

What are people using for NVMe on ZFS in similar situations? It seems like good options are Seagate Firecuda 520/530, WD Black SN850, Kingston KC3000, Samsung 970 Evo Plus/980 Pro. Trying to not break the bank -- find a happy middle ground on pricing vs. performance. Want to move to 2x 1TB/960GB in just a single mirror.
 
Last edited:

ericloewe

Active Member
Apr 24, 2017
293
128
43
30
Might this be a sync write problem? Your post is light on details in terms of what is crappy, so I'm just throwing that one out there.
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
Great example: reads for loading my chia client are really bad. Spikes to 100% load ~25MB/s reading in the blockchain database (a single large file)
 
Last edited:

ericloewe

Active Member
Apr 24, 2017
293
128
43
30
25 MB/s is absolutely pitiful, and reads mean it's not sync writes. So, crazy fragmentation? How full is the pool?
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
fragmentation -> 50% (huh, interesting)
- side observation -- it seems the guest (Windows 10) has been defragmenting on a schedule, which is very odd. Turned that off.

usage -> 309G used, 551G available

this is qcow2 on a filesystem, not a volume (zvol).

I'll try evacuating the pool and putting the data back to drop the fragmentation and report back
1645742256101.png
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
It seems I need to recreate the pool. Not sure why zfs auto picked ashift=9 for modern nvme drives... very strange, but this was created with much earlier zfs. I've evacuated everything and the frag % would not go under 36%. So, I'm not sure about the wisdom of copy out, destroy, copy back to solve fragmentation problems. I'll report back on what I find. Maybe I won't buy drives after all.

Code:
# zpool list fastStor
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
fastStor   888G   175M   888G        -         -    36%     0%  1.00x    ONLINE  -


zdb -C excerpt:

fastStor:
    version: 5000
    name: 'fastStor'
    state: 0
    txg: 116159707
    pool_guid: 3421763655878042696
    errata: 0
    hostname: 'darkstar'
    com.delphix:has_per_vdev_zaps
    hole_array[0]: 2
    vdev_children: 3
    vdev_tree:
        type: 'root'
        id: 0
        guid: 3421763655878042696
        children[0]:
            type: 'mirror'
            id: 0
            guid: 6575987116052560175
            metaslab_array: 259
            metaslab_shift: 32
            ashift: 9
            asize: 480088948736
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 8273057805271260820
                path: '/dev/disk/by-id/nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400359-part1'
                devid: 'nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400359-part1'
                phys_path: 'pci-0000:83:00.0-nvme-1'
                whole_disk: 1
                DTL: 2948
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 14054244996729488724
                path: '/dev/disk/by-id/nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400163-part1'
                devid: 'nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400163-part1'
                phys_path: 'pci-0000:02:00.0-nvme-1'
                whole_disk: 1
                DTL: 2947
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
        children[1]:
            type: 'mirror'
            id: 1
            guid: 15275351900947899816
            metaslab_array: 256
            metaslab_shift: 32
            ashift: 9
            asize: 480088948736
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 132
            children[0]:
                type: 'disk'
                id: 0
                guid: 15309268343683182218
                path: '/dev/disk/by-id/nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400354-part1'
                devid: 'nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400354-part1'
                phys_path: 'pci-0000:81:00.0-nvme-1'
                whole_disk: 1
                DTL: 2946
                create_txg: 4
                com.delphix:vdev_zap_leaf: 133
            children[1]:
                type: 'disk'
                id: 1
                guid: 5989656386901741783
                path: '/dev/disk/by-id/nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400353-part1'
                devid: 'nvme-SAMSUNG_MZ1WV480HCGL-000MV_S1Y0NYAG400353-part1'
                phys_path: 'pci-0000:82:00.0-nvme-1'
                whole_disk: 1
                DTL: 2945
                create_txg: 4
                com.delphix:vdev_zap_leaf: 134
        children[2]:
            type: 'hole'
            id: 2
            guid: 0
            whole_disk: 0
            metaslab_array: 0
            metaslab_shift: 0
            ashift: 0
            asize: 0
            is_log: 0
            is_hole: 1
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
Update: re-created with ashift=13 (was ashift=9), no L2ARC, set compression to zstd-fast-10 (was lz4). I'm getting like 28-40MB/s reading in the database. Still pretty disappointed, so looking for suggested tweaks or other drives to buy into that can handle being hammered like this and be more performant.
1645746299123.png
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
Just want to bump this again. I don't think ZFS tuning is the problem at this point, but I'm open to suggestions and/or might post that specific question in the Linux category separately.

Going back to the original ask, what NVMe drives are people using on ZFS for virtual machines? I can't be the only one?
 

TRACKER

Active Member
Jan 14, 2019
169
48
28
i use 4xM.2 -> PCIe 3.0 x16 card with PCIe switch chip from Ali :) No bifurcation required.
In case 4 drives are used, sequential speeds are around 10-12GB/s.
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43
Hey Tracker, I already have a set up with 4x M.2. The question is not "how to hook them up" but "which drives to buy". Sequential is largely irrelevant as this is for virtual machine performance. Check the original post for more context

PS: I'm the author of the thread for multi-nvme without bifurcation
 

andrewbedia

Active Member
Jan 11, 2013
698
247
43