Finding my ZFS bottleneck

altano · Apr 5, 2018

Sequential read speeds on my ZFS pool (4 vdevs, each a 2 x 6TB mirror) are not where I'd like them to be, at 80 - 130MB/S when moving large files around. My two main use cases for the pool are hosting VMs and moving large video files around. The VMs are performing fine but I'd like faster video file copy speeds.

DD Test (local on OmniOS):

BS=32M
Count=6250
Test file: 204.8 GB (>2 x memory)
Napp-it realtime monitoring off
Pool primary cache off
Pool secondary cache off
Compression off (not specifically called out by napp-it, but seemed like a good idea given that napp-it's dd test reads from /dev/zero

Write: `dd if=/dev/zero of=/tank/dd.tst bs=32768000 count=6250`
Read: `dd if=/tank/dd.tst of=/dev/null bs=32768000`

Results = 563 MB/s write, 208 MB/s read

In reality I have never moved a large file around (intra-pool) at faster than 130 MB/s.

iostat snapshot of what the drives look like during the write test:

Code:

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
rpool                      2.10G  37.7G      0      0      0      0
  c2t0d0                   2.10G  37.7G      0      0      0      0
-------------------------  -----  -----  -----  -----  -----  -----
tank                       11.5T  10.3T    120  3.52K  5.54M   450M
  mirror                   3.62T  1.82T     42    788   664K  98.4M
    c0t50014EE20FB334A6d0      -      -     26    788   165K  98.4M
    c0t50014EE059345DBCd0      -      -     15    788   499K  98.4M
  mirror                   3.70T  1.74T     34    870  1.98M   109M
    c0t50014EE265087D35d0      -      -     11    911   479K   114M
    c0t50014EE2B73FBE85d0      -      -     22    870  1.51M   109M
  mirror                   3.71T  1.73T     40    845  2.55M   106M
    c0t50014EE2B7437D10d0      -      -     20    884  1.01M   110M
    c0t50014EE20FB32A77d0      -      -     19    836  1.53M   104M
  mirror                    430G  5.02T      3  1.07K   369K   137M
    c0t50014EE2650872C3d0      -      -      1  1.06K   179K   136M
    c0t50014EE2B7E365A7d0      -      -      1  1.10K   191K   140M
-------------------------  -----  -----  -----  -----  -----  -----

Expectation: Pool write = 4 x single HD write
Reality: All 4 vdevs are writing within the same ballpark as the rated sequential write speed of single drive: ~100MB for the 3 vdevs that have less free space, ~14o for the fairly empty one. This looks good to me.

Here's a snapshot during the read:

Code:

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
rpool                      2.10G  37.7G      0      0      0      0
  c2t0d0                   2.10G  37.7G      0      0      0      0
-------------------------  -----  -----  -----  -----  -----  -----
tank                       11.6T  10.1T  1.94K     14   192M   121K
  mirror                   3.66T  1.78T    415      2  38.4M  23.4K
    c0t50014EE20FB334A6d0      -      -    278      2  21.5M  23.4K
    c0t50014EE059345DBCd0      -      -    136      2  17.0M  23.4K
  mirror                   3.74T  1.70T    353      3  44.2M  31.2K
    c0t50014EE265087D35d0      -      -    172      3  21.6M  31.2K
    c0t50014EE2B73FBE85d0      -      -    181      3  22.7M  31.2K
  mirror                   3.75T  1.68T    479      3  49.7M  31.2K
    c0t50014EE2B7437D10d0      -      -    202      3  25.3M  31.2K
    c0t50014EE20FB32A77d0      -      -    276      3  24.3M  31.2K
  mirror                    479G  4.97T    736      3  59.9M  35.1K
    c0t50014EE2650872C3d0      -      -    250      3  27.1M  35.1K
    c0t50014EE2B7E365A7d0      -      -    486      3  32.9M  35.1K
-------------------------  -----  -----  -----  -----  -----  -----

Expectation: Pool read = 8 x single HD read, reading evenly from all hard drives if data is striped evenly.
Reality: Data is distributed evenly but pool read = ~1.5x single HD read.

Why am I getting such evenly-distributed-but-slow read speeds across the board?

My hardware:

ZFS pool on OmniOS made up of 4 vdevs, each a 2 x 6TB (wd60efrx) mirror. One of these vdevs didn't exist when I wrote all the data to the pool (due to data migration constraints).
OmniOS is an ESXi guest.
All SATA drives are plugged directly into my motherboard's (X10SDV-7tp4f) onboard controller which is passed-through to OmniOS.
OmniOS has 96GB of RAM.
I've experimented with hosting an ESXi virtual disk on my Optane 900p, mounted in OmniOS, for a ZIL, readcache, or both. Doesn't seem to have any effect on performance for my workloads (when measuring simple file copies or running benchmark software).

zpool stats

Code:

NAME   PROPERTY   VALUE   SOURCE
 tank   size   21.8T   -
 tank   capacity   52%   -
 tank   altroot   -   default
 tank   health   ONLINE   -
 tank   guid   4460167909185718889   default
 tank   version   -   default
 tank   bootfs   -   default
 tank   delegation   on   default
 tank   autoreplace   off   default
 tank   cachefile   -   default
 tank   failmode   wait   default
 tank   listsnapshots   off   default
 tank   autoexpand   off   default
 tank   dedupditto   0   default
 tank   dedupratio   1.00x   -
 tank   free   10.3T   -
 tank   allocated   11.4T   -
 tank   readonly   off   -
 tank   comment   -   default
 tank   expandsize   -   -
 tank   freeing   0   default
 tank   fragmentation   1%   -
 tank   leaked   0   default
 tank   bootsize   -   default
 tank   feature@async_destroy   enabled   local
 tank   feature@empty_bpobj   active   local
 tank   feature@lz4_compress   active   local
 tank   feature@multi_vdev_crash_dump   enabled   local
 tank   feature@spacemap_histogram   active   local
 tank   feature@enabled_txg   active   local
 tank   feature@hole_birth   active   local
 tank   feature@extensible_dataset   enabled   local
 tank   feature@embedded_data   active   local
 tank   feature@bookmarks   enabled   local
 tank   feature@filesystem_limits   enabled   local
 tank   feature@large_blocks   enabled   local
 tank   feature@sha512   enabled   local
 tank   feature@skein   enabled   local
 tank   feature@edonr   enabled   local

(Compression is normally on but disabled for testing with DD)

Code:

Pool       Version       Pool GUID       Vdev       Ashift       Asize       Vdev GUID       Disk       Disk-GUID       Cap       Product/ Phys_Path/ Dev_Id/ Sn
tank       5000       4460167909185718889        vdevs: 5       
vdev 1: mirror       12       6.00 TB        2243640248582335401       
vdev 2: mirror       12       6.00 TB        3977744385084940597                   
vdev 3: mirror       12       6.00 TB        3289320665279914143                   
vdev 4: hole       0       0       0                   
vdev 5: mirror       12       6.00 TB        6527926759137914625

SMB copy on a Windows VM on the same ESXi host:

Here's another snapshot of what `zpool iostat -v 1` looks like during an SMB copy across filesystems going at ~70MB/s, indicating that the 4 vdevs are being utilized evenly, each transferring at ~17MB/s:

Code:

                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
rpool                      2.10G  37.7G      0      0      0      0
  c2t0d0                   2.10G  37.7G      0      0      0      0
-------------------------  -----  -----  -----  -----  -----  -----
tank                       11.4T  10.3T    660  1.35K  69.3M   126M
  mirror                   3.61T  1.82T    162    260  15.4M  30.5M
    c0t50014EE20FB334A6d0      -      -     89    255  7.21M  30.5M
    c0t50014EE059345DBCd0      -      -     72    254  8.18M  30.5M
  mirror                   3.69T  1.75T    151    361  15.7M  30.6M
    c0t50014EE265087D35d0      -      -     72    306  6.61M  30.6M
    c0t50014EE2B73FBE85d0      -      -     79    308  9.08M  30.6M
  mirror                   3.70T  1.73T    172    375  18.2M  32.5M
    c0t50014EE2B7437D10d0      -      -     97    321  9.75M  32.5M
    c0t50014EE20FB32A77d0      -      -     74    319  8.45M  32.5M
  mirror                    422G  5.03T    174    386  20.0M  32.7M
    c0t50014EE2650872C3d0      -      -     86    313  9.92M  32.7M
    c0t50014EE2B7E365A7d0      -      -     87    314  10.1M  32.7M
logs                           -      -      -      -      -      -
  c2t1d0                    772K  15.9G      0      0      0      0
cache                          -      -      -      -      -      -
  c2t2d0                   29.9G  93.8M      0      0      0      0
-------------------------  -----  -----  -----  -----  -----  -----

How can I find out what is bottlenecking my read speeds?

dragonme · Apr 5, 2018

so to be clear..

'all' hard drives are hanging on the onboard SATA controller?
1 are the SATA drives being passed to napp-it as RDM or are you passing the SATA controller to napp-it?
2 do these drives currently hold you other VMs? is there 'ANY' other work being done be the pool.. any other NFS or SMB connections to the pool while you are trying to test.. regardless if any data is being moved?

on an all-in-on.. where is napp-it running from.. usually its a sata drive and you pass a LSI controller for napp-it pools...

my guess is that you are running napp-it on a sata drive and passing RDM sata drives individually to napp-it..

I have an all in one setup the same way.. a single sata SSD for napp-it to run at boot before it provides 2 pools.. one on a LSI controller passed to napp-it for VMs (more performant) and 3 RDM drives passed from the SATA controller for a large 'DATA' pool.

I too see strange read/write behavior on the SATA pool .. uneven writes across the 3 drive stripe are common.. I figured it was either I was maxing out the SATA bandwidth or perhaps the interrupts were being throttled by esxi to spread the resources to efficiently run the hypervisor.

I did experiment with raising the napp-it vm to a 'latency sensitive' VM giving it higher priority and that shows that all napp-it memory is not just reserved but in use.. so it kicks alarms until you disable the memory warning.. but on a external pool of 3x 5 drive raidz drives.. scrubbing goes from 900MB/s to over 1000GB/s with the enhanced latency settings in the vm settings.. it gives the napp-it VM priority for resources.. never tested improvement on the SATA pool however.

my guess is if you have 'everything' running off the SATA controller .. you are probably running into throttling of interrupts or some other resource constraint.

what kind of CPU use are you seeing.. bring up esxclitop on your host and you can see real time latency, busy, wait, etc for the drives and controllers.

also, and I have not messed with it yet.. but with sata you have a queue depth of 32 and some have contemplated that reducing the queue depth can improve performance on sata drives.. but I have not need to do if for my purposes..

also you are constraining your strip width to 32MB blocks... I 'assume' for better vm performance ? but I think that optimization only really works for iscsi and I would let zfs decide using its variable striping algo to decide what is best during writes.

gea · Apr 5, 2018

Can you add the results of menu Pools > Benchmark with
readcache = all vs readcache=none

(insert here as code to keep result readable)

altano · Apr 7, 2018

Thanks for the responses, everyone.

All SATA drives are connected to my motherboard's onboard SAS controller which is directly passed-through to OmniOS VM. There are no RDM drives. The drives hold two VMs but they are completely idle. I am verifying that there is near-zero drive activity with `zpool iostat` before beginning any benchmarks.

I am using a SuperMicro X10SDV-7TP4F motherboard which has a Xeon-D 1537 SoC CPU

also you are constraining your strip width to 32MB blocks... I 'assume' for better vm performance ?

I've made no such decisions intentionally. Isn't "strip width" a RAID-Z concept only? Are you referring to the 32M blocksize option I passed to dd?

gea said:
Can you add the results of menu Pools > Benchmark with
readcache = all vs readcache=none

Since your benchmarks capture both sync=always/disabled, I included runs with/without a ZIL/SLOG as well:

cache=all

Code:

Bennchmark filesystem: /tank/_Pool_Benchmark
begin test 1 write loop ..
begin test 2 write loop ..
begin test 3 ..randomwrite.f ..
begin test 3sync ..randomwrite.f ..
begin test 4 ..singlestreamwrite.f ..
begin test 4sync ..singlestreamwrite.f ..

begin dd write test 5 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10

5000000000 bytes transferred in 22.764258 secs (219642561 bytes/sec)

begin dd write test 6 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10

5000000000 bytes transferred in 5.452006 secs (917093663 bytes/sec)

set sync=disabled
begin test 7 randomread.f ..
begin test 8 randomrw.f ..
begin test 9 singlestreamread.f ..
begin dd read test 10 ..

pool: tank

NAME                       STATE     READ WRITE CKSUM
        tank                       ONLINE       0    0    0
          mirror-0                 ONLINE       0    0    0
            c0t50014EE20FB334A6d0  ONLINE      0     0     0
            c0t50014EE059345DBCd0  ONLINE      0     0    0
          mirror-1                 ONLINE       0    0    0
            c0t50014EE265087D35d0  ONLINE      0     0    0
            c0t50014EE2B73FBE85d0  ONLINE      0     0    0
          mirror-2                 ONLINE       0    0     0
            c0t50014EE2B7437D10d0  ONLINE      0     0    0
            c0t50014EE20FB32A77d0  ONLINE      0     0    0
          mirror-4                 ONLINE       0    0    0
            c0t50014EE2650872C3d0  ONLINE      0     0    0
            c0t50014EE2B7E365A7d0  ONLINE      0     0     0

hostname                        holodeck3  Memory size: 98304 Megabytes
pool                            tank (recsize=128k, compr=off, readcache=all)
slog                            -
remark                          readcache=all

simple 10 s write loop          sync=always                     sync=disabled                  
data per write                  8KB                             8KB                            
throughput                      296 KB/s                        1.7 MB/s                      

Fb3 randomwrite.f               sync=always                     sync=disabled                  
                                547 ops                         28165 ops
                                109.388 ops/s                   5632.661 ops/s
                                6663us cpu/op                   292us cpu/op
                                9.1ms latency                   0.2ms latency
                                0.8 MB/s                        44.0 MB/s

Fb4 singlestreamwrite.f         sync=always                     sync=disabled                  
                                229 ops                         4902 ops
                                45.796 ops/s                    980.366 ops/s
                                17456us cpu/op                  1838us cpu/op
                                21.7ms latency                  1.0ms latency
                                45.6 MB/s                       980.2 MB/s

dd results                      sync=always                     sync=disabled                  
data                            5GB                             5GB                            
dd sequential                   219.6 MB/s                      917.1 MB/s                     
________________________________________________________________________________________
 
read fb 7-9 + dd (opt)          randomread.f     randomrw.f     singlestreamr dd
pri/sec cache=all               174.6 MB/s       258.2 MB/s     2.2 GB/s      1.9 GB/s       
________________________________________________________________________________________

cache=none

Code:

Bennchmark filesystem: /tank/_Pool_Benchmark
begin test 1 write loop ..
begin test 2 write loop ..
begin test 3 ..randomwrite.f ..
begin test 3sync ..randomwrite.f ..
begin test 4 ..singlestreamwrite.f ..
begin test 4sync ..singlestreamwrite.f ..

begin dd write test 5 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 22.062105 secs (226632950 bytes/sec)

begin dd write test 6 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 5.384915 secs (928519697 bytes/sec)

set sync=disabled
begin test 7 randomread.f ..
begin test 8 randomrw.f ..
begin test 9 singlestreamread.f ..
begin dd read test 10 ..

pool: tank

   NAME                       STATE     READ WRITE CKSUM
   tank                       ONLINE       0     0     0
     mirror-0                 ONLINE       0     0     0
       c0t50014EE20FB334A6d0  ONLINE       0     0     0
       c0t50014EE059345DBCd0  ONLINE       0     0     0
     mirror-1                 ONLINE       0     0     0
       c0t50014EE265087D35d0  ONLINE       0     0     0
       c0t50014EE2B73FBE85d0  ONLINE       0     0     0
     mirror-2                 ONLINE       0     0     0
       c0t50014EE2B7437D10d0  ONLINE       0     0     0
       c0t50014EE20FB32A77d0  ONLINE       0     0     0
     mirror-4                 ONLINE       0     0     0
       c0t50014EE2650872C3d0  ONLINE       0     0     0
       c0t50014EE2B7E365A7d0  ONLINE       0     0     0


hostname                        holodeck3  Memory size: 98304 Megabytes
pool                            tank (recsize=128k, compr=off, readcache=none)
slog                            -
remark                          cache=off

simple 10 s write loop          sync=always                     sync=disabled                   
data per write                  8KB                             8KB                             
throughput                      312 KB/s                        1.9 MB/s                       

Fb3 randomwrite.f               sync=always                     sync=disabled                   
                                201 ops                         163 ops
                                40.197 ops/s                    32.599 ops/s
                                5712us cpu/op                   52378us cpu/op
                                24.8ms latency                  30.6ms latency
                                0.2 MB/s                        0.2 MB/s

Fb4 singlestreamwrite.f         sync=always                     sync=disabled                   
                                193 ops                         4701 ops
                                38.598 ops/s                    940.171 ops/s
                                6053us cpu/op                   1408us cpu/op
                                25.8ms latency                  1.1ms latency
                                38.4 MB/s                       940.0 MB/s

dd results                      sync=always                     sync=disabled                   
data                            5GB                             5GB                             
dd sequential                   226.6 MB/s                      928.5 MB/s                     
________________________________________________________________________________________
read fb 7-9 + dd (opt)          randomread.f     randomrw.f     singlestreamr dd
pri/sec cache=none              0.2 MB/s         0.6 MB/s       15.4 MB/s     215.5 MB/s     
________________________________________________________________________________________

cache=all + 16GB Optane 900p-backed virtual ESXi drive as ZIL/SLOG

Code:

Bennchmark filesystem: /tank/_Pool_Benchmark
begin test 1 write loop ..
begin test 2 write loop ..
begin test 3 ..randomwrite.f ..
begin test 3sync ..randomwrite.f ..
begin test 4 ..singlestreamwrite.f ..
begin test 4sync ..singlestreamwrite.f ..

begin dd write test 5 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 6.217289 secs (804209033 bytes/sec)

begin dd write test 6 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 5.124140 secs (975773546 bytes/sec)

set sync=disabled
begin test 7 randomread.f ..
begin test 8 randomrw.f ..
begin test 9 singlestreamread.f ..
begin dd read test 10 ..

pool: tank

   NAME                       STATE     READ WRITE CKSUM
   tank                       ONLINE       0     0     0
     mirror-0                 ONLINE       0     0     0
       c0t50014EE20FB334A6d0  ONLINE       0     0     0
       c0t50014EE059345DBCd0  ONLINE       0     0     0
     mirror-1                 ONLINE       0     0     0
       c0t50014EE265087D35d0  ONLINE       0     0     0
       c0t50014EE2B73FBE85d0  ONLINE       0     0     0
     mirror-2                 ONLINE       0     0     0
       c0t50014EE2B7437D10d0  ONLINE       0     0     0
       c0t50014EE20FB32A77d0  ONLINE       0     0     0
     mirror-4                 ONLINE       0     0     0
       c0t50014EE2650872C3d0  ONLINE       0     0     0
       c0t50014EE2B7E365A7d0  ONLINE       0     0     0
   logs
     c2t1d0                   ONLINE       0     0     0


hostname                        holodeck3  Memory size: 98304 Megabytes
pool                            tank (recsize=128k, compr=off, readcache=all)
slog                            Virtual disk 17.2 GB
remark                          cache=all, zil=16GB optane

simple 10 s write loop          sync=always                     sync=disabled                   
data per write                  8KB                             8KB                             
throughput                      1.5 MB/s                        1.9 MB/s                       

Fb3 randomwrite.f               sync=always                     sync=disabled                   
                               24878 ops                       29975 ops
                               4975.427 ops/s                  5994.810 ops/s
                               342us cpu/op                    225us cpu/op
                               0.2ms latency                   0.2ms latency
                                38.8 MB/s                       46.8 MB/s

Fb4 singlestreamwrite.f         sync=always                     sync=disabled                   
                               4505 ops                        5000 ops
                               900.974 ops/s                   999.971 ops/s
                               2869us cpu/op                   1284us cpu/op
                               1.1ms latency                   1.0ms latency
                                900.8 MB/s                      999.8 MB/s

dd results                      sync=always                     sync=disabled                   
data                            5GB                             5GB                             
dd sequential                   804.2 MB/s                      975.8 MB/s                     
________________________________________________________________________________________
 
read fb 7-9 + dd (opt)          randomread.f     randomrw.f     singlestreamr dd
pri/sec cache=all               193.2 MB/s       287.0 MB/s     2.6 GB/s      2.2 GB/s       
________________________________________________________________________________________

cache=none + 16GB Optane 900p-backed virtual ESXi drive as ZIL/SLOG

Code:

Bennchmark filesystem: /tank/_Pool_Benchmark
begin test 1 write loop ..
begin test 2 write loop ..
begin test 3 ..randomwrite.f ..
begin test 3sync ..randomwrite.f ..
begin test 4 ..singlestreamwrite.f ..
begin test 4sync ..singlestreamwrite.f ..

begin dd write test 5 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 6.413377 secs (779620439 bytes/sec)

begin dd write test 6 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 5.100575 secs (980281575 bytes/sec)

set sync=disabled
begin test 7 randomread.f ..
begin test 8 randomrw.f ..
begin test 9 singlestreamread.f ..
begin dd read test 10 ..

pool: tank

   NAME                       STATE     READ WRITE CKSUM
   tank                       ONLINE       0     0     0
     mirror-0                 ONLINE       0     0     0
       c0t50014EE20FB334A6d0  ONLINE       0     0     0
       c0t50014EE059345DBCd0  ONLINE       0     0     0
     mirror-1                 ONLINE       0     0     0
       c0t50014EE265087D35d0  ONLINE       0     0     0
       c0t50014EE2B73FBE85d0  ONLINE       0     0     0
     mirror-2                 ONLINE       0     0     0
       c0t50014EE2B7437D10d0  ONLINE       0     0     0
       c0t50014EE20FB32A77d0  ONLINE       0     0     0
     mirror-4                 ONLINE       0     0     0
       c0t50014EE2650872C3d0  ONLINE       0     0     0
       c0t50014EE2B7E365A7d0  ONLINE       0     0     0
   logs
     c2t1d0                   ONLINE       0     0     0


hostname                        holodeck3  Memory size: 98304 Megabytes
pool                            tank (recsize=128k, compr=off, readcache=none)
slog                            Virtual disk 17.2 GB
remark                          cache=none, zil=16GB optane

simple 10 s write loop          sync=always                     sync=disabled                   
data per write                  8KB                             8KB                             
throughput                      1.5 MB/s                        1.9 MB/s                       

Fb3 randomwrite.f               sync=always                     sync=disabled                   
                               146 ops                         169 ops
                               29.198 ops/s                    33.798 ops/s
                               29027us cpu/op                  27799us cpu/op
                               34.1ms latency                  29.5ms latency
                                0.2 MB/s                        0.2 MB/s

Fb4 singlestreamwrite.f         sync=always                     sync=disabled                   
                               4575 ops                        4981 ops
                               914.954 ops/s                   996.165 ops/s
                               2849us cpu/op                   1443us cpu/op
                               1.1ms latency                   1.0ms latency
                                914.8 MB/s                      996.0 MB/s

dd results                      sync=always                     sync=disabled                   
data                            5GB                             5GB                             
dd sequential                   779.6 MB/s                      980.3 MB/s                     
________________________________________________________________________________________
 
read fb 7-9 + dd (opt)          randomread.f     randomrw.f     singlestreamr dd
pri/sec cache=none              0.2 MB/s         0.4 MB/s       30.8 MB/s     242.7 MB/s     
________________________________________________________________________________________

Curiously, it looks like the Optane ZIL/SLOG HUGELY increased the performance on the dd sequential test with sync=always. I don't know why I wasn't seeing this before. I repeated my initial dd test with the ZIL/SLOG still in place:

Write: `dd if=/dev/zero of=/tank/dd.tst bs=32768000 count=6250`
Read: `dd if=/tank/dd.tst of=/dev/null bs=32768000`

Results = 456 MB/s write, 500 MB/s read

Well that's a big read speed improvement. Hmmmm. I don't know why I wasn't seeing this before

altano · Apr 7, 2018

For completeness, here's cache=all, 16GB Optane 900p-backed ZIL/SLOG, 100GB Optane 900p-backed readcache/l2arc):

Code:

Bennchmark filesystem: /tank/_Pool_Benchmark
begin test 1 write loop ..
begin test 2 write loop ..
begin test 3 ..randomwrite.f ..
begin test 3sync ..randomwrite.f ..
begin test 4 ..singlestreamwrite.f ..
begin test 4sync ..singlestreamwrite.f ..

begin dd write test 5 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 7.296929 secs (685219740 bytes/sec)

begin dd write test 6 .. time dd if=/dev/zero of=/tank/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 5.307482 secs (942066238 bytes/sec)

set sync=disabled
begin test 7 randomread.f ..
begin test 8 randomrw.f ..
begin test 9 singlestreamread.f ..
begin dd read test 10 ..

pool: tank

   NAME                       STATE     READ WRITE CKSUM
   tank                       ONLINE       0     0     0
     mirror-0                 ONLINE       0     0     0
       c0t50014EE20FB334A6d0  ONLINE       0     0     0
       c0t50014EE059345DBCd0  ONLINE       0     0     0
     mirror-1                 ONLINE       0     0     0
       c0t50014EE265087D35d0  ONLINE       0     0     0
       c0t50014EE2B73FBE85d0  ONLINE       0     0     0
     mirror-2                 ONLINE       0     0     0
       c0t50014EE2B7437D10d0  ONLINE       0     0     0
       c0t50014EE20FB32A77d0  ONLINE       0     0     0
     mirror-4                 ONLINE       0     0     0
       c0t50014EE2650872C3d0  ONLINE       0     0     0
       c0t50014EE2B7E365A7d0  ONLINE       0     0     0
   logs
     c2t1d0                   ONLINE       0     0     0
   cache
     c2t2d0                   ONLINE       0     0     0


hostname                        holodeck3  Memory size: 98304 Megabytes
pool                            tank (recsize=128k, compr=off, readcache=all)
slog                            Virtual disk 17.2 GB
remark                          cache=all, zil=16GB optane, l2arc=100GB optane

simple 10 s write loop          sync=always                     sync=disabled                   
data per write                  8KB                             8KB                             
throughput                      1.4 MB/s                        1.8 MB/s                       

Fb3 randomwrite.f               sync=always                     sync=disabled                   
                               26304 ops                       28940 ops
                               5260.628 ops/s                  5787.819 ops/s
                               338us cpu/op                    206us cpu/op
                               0.2ms latency                   0.2ms latency
                                41.0 MB/s                       45.2 MB/s

Fb4 singlestreamwrite.f         sync=always                     sync=disabled                   
                               4408 ops                        4767 ops
                               881.571 ops/s                   953.361 ops/s
                               3118us cpu/op                   1831us cpu/op
                               1.1ms latency                   1.0ms latency
                                881.4 MB/s                      953.2 MB/s

dd results                      sync=always                     sync=disabled                   
data                            5GB                             5GB                             
dd sequential                   685.2 MB/s                      942.1 MB/s                     
________________________________________________________________________________________
 
read fb 7-9 + dd (opt)          randomread.f     randomrw.f     singlestreamr dd
pri/sec cache=all               195.0 MB/s       63.0 MB/s      2.5 GB/s      2.1 GB/s       
________________________________________________________________________________________

altano · Apr 7, 2018

Unfortunately the ZIL/SLOG and the readcache/L2ARC may have a huge impact on benchmarks, but I'm still seeing 100MB/s local intrapool file copies.

gea · Apr 8, 2018

First a few remarks as your initial question was about copying large files and benchmark results

1. If you copy files within a pool, you have concurrent read/writes to the same pool so you will not get similar results like a pure read or write benchmark

2. As your question was about copying large files with a huge RAM, you will most propably see no effect of the L2Arc. I expect quite similar results with and without L2ARC, so best is to disable to reduce parameters.

3. Sync write is a high end security feature for crash resistent writes where you must ensure that every commited write is on stable storage at least after next reboot similar to a hardwareraid with cache + BBU/Flash protection. (Slog is NOT a write cache it is a protection for the ZFS rambased writecache). For a regular filer/copy use, you do not use or need this feature. If you really need ex for transactional databases or VM storage you will discover that a mechanical disk is bad suited for sync writes due high latency and only about 50-100 iops per single disk. This is why you want to outsource sync write logging (to protect the rambased write cache on a crash and redo writes on next reboot) to something like an Optane 900P with 500k iops and 10us latency. For your intention you can ignore the sync write benchmark values or the question if you add an slog or not.

3. Read ramcache is an essential need of ZFS. This is due the behaviour of CopyOnWrite with a high fragmentation and the raid behaviour of ZFS as it spreads data quite evenly over a pool. This means that you will never see a pure sequential behaviour but that your effective read/write performance is limited by the iops performance of a pool. As even on writes you must read many metadata, a disabling affects also writes. Disabling readcache is only good to simulate a filer with 1GB ram (quite senseless) or to compare to settings where you want to exclude ram effects, see http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 3.4 vs 3.5 where I have done such a comparison with a 4 x raid-0 diskpool (quite similar to your 4 vdev pool). If you only want to know if performance is good enough or as expected, always enable readcache.

For your intention:
- you can use an l2arc but do not expect this as relevant
- you can use an slog or not. Look only at the async write values and they are not affected by an slog
compare: http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 5 (4 disks raid-0) with your results.

- for a basic calculation:
theoretical max continous write sequential performance or your pool is n*vdev (200 MB/disk)= 800 MB/s, more due cache effects
theoretical max continous read sequential performance or your pool is 2n*vdev (200 MB/disk)= 1600 MB/s, more due cache effects

max write iops is n x vdev: (100 per disk): 400 iops, more due cache effects
max readiops is 2n x vdev: (100 per disk): 800 iops, more due cache effects

Due fragmentation effects, performance of a pool degrades with fillrate so bench empty pools for comparisons.

If you have similar disks than mine (HGST HE8-6TB, 7200 rpm) you should expect similar values on writes and slightly better on reads as you use raid-10 instead of my raid-0. If you use slower disks you can reduce expectations (say 30%).

Basic values for your workload are singlestreamwrite /singlestreamread (readcache=all)
my results: 1.1 GB/s write and 2.6 GB/s read (my pdf chapter 3.5)
your results: 950 MB/s write and 2.5 GB/s read

so this result is as expected.

The real question is now. How did you copy files and determine real performance, with a pure time + cp command (local copy), ex time cp /x /y where you can calculate transfer based on time and data or remotely ex via SMB and Windows?

In second case, you are propable limited due other effects. If you are mainly want good sequential perfomance via SMB, use a AJA video benchmark for further testings. AJA tests are first write then read (no concurrent read/write to same pool like a pure copy). In an 1G network (with say 110 MB/s max performance) a concurent read+write to same pool can give only 110 MB/s for read+write what means that your effective copy performance must be between 50 and 80MB/s. (move between filesystems=copy)

altano · Apr 13, 2018

Thanks for the thorough reply. Everything your saying makes sense.

For those following along, I created this visualization of my benchmark results to make it clear that (a) everything gea has been saying about sync=always and the Optane is definitely correct and (b) there I see no reason to add an L2ARC if you have enough RAM:

The real question is now. How did you copy files and determine real performance, with a pure time + cp command (local copy), ex time cp /x /y where you can calculate transfer based on time and data or remotely ex via SMB and Windows?

I've tried two things: `time cp /x /y` on the local Solaris machine AND copying from one folder to another in Windows over SMB (a Windows VM on the same ESXi host, so no Ethernet bottleneck). If I could even hit HALF the speed I'm getting in the dd benchmark (400MB/s of 800MB/s) I'd be happy but I can't even break 200MB/s in any real-world test. Is there any reason a local file copy between filesystems would be *SO* much slower than my read/write speeds in benchmarks?

I'll check out the AJA benchmark and add it to my list as soon as I have some more free time.

dragonme · Apr 13, 2018

@altano

I have already told you... my thoughts...

you are looking at napp-it -- I think its more to do with you hardware/esxi environment.

I have already told you how I have gotten napp-it to be more performant -- noticeably --- without log devices just by changing the napp-it VMs latency and resource settings..

I have already asked you to take a look at your storage controllers and disks under esxclitop ... have you done it?

if you don't want to look at your esxi environment ... export your pools from napp-it VM.. download and install the napp-it to go on a USB and boot into Napp-it BARE METAL ... run the same benchmarks and graph the difference between your tests.. bare metal vs esxi environment and rule out system resource constraints and you esxi settings..

further.. it could be driver issues with esxi especially if you are not using hardware approved by esxi at anyplace in your system

you are saying that all your pool drives are connected to a motherboard controller and not a separate HBA? depending on how that controller is given PCI lanes and interrupts can make a difference as well.. are the motherboard bios, and the mfg onboard lsi controller firmware at latest versions?

etc .. etc...

dragonme · Apr 13, 2018

also .. trying to follow your tests..

you are saying that dd local on napp-it to your pools is good

but a copy from one pool to another via smb (even though its a virtual net) is poor?

still sounds resource bound to me... even though the network ops are virtual.. they are still treated like a network operation which means more system interrupts are used.. and yet more driver issues might come into play as now you have to look at network issues..

what kind of iperf numbers are you seeing between between your VMs and from VMs to AND from napp-it VM?

what kind of interrupt activity are you seeing while saturating your virtual network.

again... there are only so many interrupts your host has available.. and if your physical network, virtual network, and your HBA are all on the motherboard.. it has to slice up finite resources of the the motherboard to handle them all.

dragonme · Apr 13, 2018

"Unfortunately the ZIL/SLOG and the readcache/L2ARC may have a huge impact on benchmarks, but I'm still seeing 100MB/s local intrapool file copies."

how are you doing the copies.. locally in napp-it using terminal command or ONLY over a SMB network from a windows VM

What kind of numbers are you seeing on the windows network adapter during this copy... I don't think this would be a server side copy so napp-it would have to read the data, send it over the virtual network to windows then back to napp-it? you can verify that by watching network activity on the windows VM

what resource utilization .. cpu, interrupts (reported inside the VMs and host) etc are being reported during the copy

what vmware tools network adapter are you using e1000, vmxnet3 ? etc..

dragonme · Apr 13, 2018

according to your motherboards manual.. your broadcom 2116 sas/sata 16 controller is given 4 pci3 lanes.. but it on the same SOC controller as every other device on the board.. so that SOC is likely working hard.

have you checked your boards rev number and its compatibility with the version of esxi you are using?

I have never heard of broadcom 2116.. and you are passing that to omnios/napp-it.. does it have a good driver>

based on the raw DD performance of your drives locally.. but having issues with pool copies that appear to come from other VMs.. I still think this is either a network adapter issue.. or the combination of network overhead (virtual) and its associated interrupts affecting the SOC controller and limiting overall bandwith..

dragonme · Apr 13, 2018

oh .. and now I see also that its intrapool.. not between pools on different disks.. so that is a LOT of simultaneous read/write not just a dd write.

also you are using sata disks.. what is the queue depth for those disks.. 32? that might be an issue too. I remember an article stating that lowering the queue depths in certain circumstances depending on NCQ and compatibility can really fubar ZFS performance. but I am not going to research that.

cw823 · Apr 15, 2018

I’ve always seen similar speeds when doing intrapool transfers, which is much of the reason I migrated to VSAN for my VMs as opposed to NFS shares from my main array.

Sent from my iPhone using Tapatalk

MiniKnight · Apr 15, 2018

Great thread.

gea · Apr 16, 2018

cw823 said:
I’ve always seen similar speeds when doing intrapool transfers, which is much of the reason I migrated to VSAN for my VMs as opposed to NFS shares from my main array.
Sent from my iPhone using Tapatalk

You do not want ZFS because of best possible disk performance but best of all data security. A filesystem with CopyOnWrite and Checksums (unlike a default btrfs and ReFS even on data not only metadata) must read/write more data and suffers from a higher fragmentation what makes it slower on disk access. The real advantage of ZFS is that it is fast despite all the extra security, with enough RAM often faster than older systems and with a fast slog even with secure sync writes.

Additionally, ZFS offers a raid technique without the write hole problems of a traditional raid 1/10/5/6 arrays where a crash during a write can always lead to a corrupt raid or damaged filesystem.

dragonme · Apr 16, 2018

@cw823

What kind of performance numbers do you see from a single host VSAN?

Oh... yeah.. thats right.. you need 3 hosts minimum and each has to have SSD storage + spinners.. a little bit of overkill for a home file server or media server.. having a 3 host cluster up 24/7 and VSAN requires a license that is more expensive per year than all my hardware cost to by .. no thanks

dragonme · Apr 16, 2018

@gea

Until he starts posting hardware monitoring numbers from his BMC and esxitop .. we are never going to get an idea where his constraints are.

Personally he is running a heavy read/write copy on pool with esxi doing sync writes for active VMs at the same time all to the same pool... I think his little Xeon and mobo SOC is doing all it can..

cw823 · Apr 16, 2018

gea said:
You do not want ZFS because of best possible disk performance but best of all data security. A filesystem with CopyOnWrite and Checksums (unlike a default btrfs and ReFS even on data not only metadata) must read/write more data and suffers from a higher fragmentation what makes it slower on disk access. The real advantage of ZFS is that it is fast despite all the extra security, with enough RAM often faster than older systems and with a fast slog even with secure sync writes.

Additionally, ZFS offers a raid technique without the write hole problems of a traditional raid 1/10/5/6 arrays where a crash during a write can always lead to a corrupt raid or damaged filesystem.

Agreed. Just offering that I was seeing the same restraints in what I believe was a similar setup.

Search

Finding my ZFS bottleneck

altano

Active Member

dragonme

Active Member

gea

Well-Known Member

altano

Active Member

altano

Active Member

altano

Active Member

gea

Well-Known Member

altano

Active Member

dragonme

Active Member

dragonme

Active Member

dragonme

Active Member

dragonme

Active Member

dragonme

Active Member

cw823

Active Member

MiniKnight

Well-Known Member

gea

Well-Known Member

dragonme

Active Member

dragonme

Active Member

cw823

Active Member