Slow read speed for large files

RonanR

Member
Jul 27, 2018
30
0
6
Hi,
I'm wondering why I got very slow read speed for large files (basically file larger than my RAM), compared to top notch async write speed.
Here is my setup:
Up to date Omnios 151028
Supermicro X10SRL-F board with a Xeon 1620 v3
32GB of DDR4 ECC
Controller LSI 3008
12 HGST DC HC510 Sata (10TB drives) configured in one z2 pool with record size set to 256K

Basically, as soon as I read a file which cannot fit in the ARC cache, I got very slow read speed, whereas my async write speed is very good.
On an AJA Speed test over a 10Gb connection, with 16GB file test, I got 950MB/s in write and 900MB/s in read.
If I set the filesize to 64GB, I still got 950MB/s in write but only around 100MB/s in read.
I also got the same result doing large files copy through Windows explorer.

It's the exact same effect if I disable the cache when doing benchmarks:
Bennchmark filesystem: /hdd12z2/_Pool_Benchmark
Read:
filebench+dd, Write: filebench_sequential, date: 11.23.2018

time dd if=/dev/zero of=/hdd12z2/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 3.431949 secs (1456898318 bytes/sec)

hostname XST24BA Memory size: 32661 Megabytes
pool hdd12z2
(recsize=256k, compr=off, readcache=none)
slog -
remark



Fb3 sync=always sync=disabled

Fb4 singlestreamwrite.f sync=always sync=disabled
246 ops 7426 ops
49.197 ops/s 1483.963 ops/s
10802us cpu/op 2042us cpu/op
20.2ms latency 0.7ms latency
49.0 MB/s 1483.8 MB/s
____________________________________________________________________________
read fb 7-9 + dd (opt) randomread.f randomrw.f singlestreamr dd
pri/sec cache=none 0.4 MB/s 0.8 MB/s 81.2 MB/s 119.9 MB/s
____________________________________________________________________________

If I set the record size to 1M, I got a small a little over 200MB/s in read, but it's still far less than expected.

Bennchmark filesystem: /hdd12z2/_Pool_Benchmark
Read:
filebench+dd, Write: filebench_sequential, date: 11.23.2018

time dd if=/dev/zero of=/hdd12z2/_Pool_Benchmark/syncwrite.tst bs=500000000 count=10
5000000000 bytes transferred in 1.931137 secs (2589148332 bytes/sec)

hostname XST24BA Memory size: 32661 Megabytes
pool hdd12z2
(recsize=1M, compr=off, readcache=none)
slog -
remark



Fb3 sync=always sync=disabled

Fb4 singlestreamwrite.f sync=always sync=disabled
212 ops 9971 ops
42.397 ops/s 1994.163 ops/s
12158us cpu/op 1552us cpu/op
23.4ms latency 0.5ms latency
42.2 MB/s 1994.0 MB/s
__________________________________________________________________________

read fb 7-9 + dd (opt) randomread.f randomrw.f singlestreamr dd
pri/sec cache=none 0.4 MB/s 0.8 MB/s 200.0 MB/s 243.9 MB/s
__________________________________________________________________________

While I can somehow understand the result when the cache is disabled, why do I have the same result when the cache is enable, but only with files bigger than my RAM size ?
What's more strange for me is that it's slow even at the beginning of the copy/AJA read test, as if the cache was never used.
I used to work with XFS shares through SMB, using a dedicated LSI MegaRAID card with BBU, and on which I never had such disparities in read and write performances using any file size.
I'm quite new to ZFS, so I'm trying to understand its strengths (and there are plenty!) and limits.
 

gea

Well-Known Member
Dec 31, 2010
2,489
838
113
DE
At first a few remarks

ZFS is a high secure filesystem with data/metadata checksums and Copy on Write. Both are needed for the security level ZFS wants to achieve, The first produces more data to process, the second a higher fragmentation. To overcome these limitations, ZFS has superiour cache strategies.

So the first thing is, you need the readcaches. Disable only when you want to check some details, not for regular use or regular benchmark checks. Even on writes you need the readcache to read metadata.

The readcache also do not store files otherwise it would fill up with a single file. It stores ZFS datablocks based on a read most/ read last strategy. It does not help with large sequential files but is intended for small random reads. For large sequential files it caches only metadata. On a pool with a higher fragmentation this can lead to lower readperformance than writeperformance as read of sequential files is iops limited.

Only the L2Arc (Addition to the readcache on SSDs or NVMe) can enable read ahead that can improve reading of sequential data.

On writes, all writes are always cached in the RAM. The size of this writecache is 10% RAM/max 4GB per default. Its intention is to transfer many small and slow random writes to a single fast sequential write. This allows a high write performance even on a slower pool and often even higher values than on read where you lack this level of cache support.

If you want a secure/ crash save write behaviour (like with the BBU on a hardwareraid) you can enable/force sync write. In this case you additionally log any single write commit to the onpool ZIL or an extra Slog device. In such a case a low iops pool is slow unless you add a fast Slog, ex an Intel Optane up from 800P.

I have also seen 900 MB/s write and 100 MB/s reads in some cases. But such huge differences were always related to a bad cable, bad settings or driver problems. On Windows ex you should use the newest drivers and disable int throtteling. Optionally try another cable. On large files 800 MB/s writes and 600 MB/s reads are a value that I have seen often with disk pools.
 

RonanR

Member
Jul 27, 2018
30
0
6
Hi Gea,

Thanks for your explanation.
As my server is going to be used for reading large video file, I will add a NVMe for L2Arc cache, to improve sequential data read. Correct me if I'm wrong, but if I use a SATA SSD, my read will be limited by the SSD speed, so around 500MB/s, right?

Regarding ARC, I understand its importance. I just found very strange that only with files bigger than my RAM size I got slow read performances, as if ARC was never used in this case. I didn't have the "classic" behavior of fast read while it's in cache and then slow as soon as it's not cached anymore.

For the secure write behavior, I don't really need it as if it crashes while I'm copying a media, the media will be corrupted anyway.
The only potential case it can be helpful is if the system crash while saving a project, but for this case I prepared a separate pool with sync write activated.

I don't think I'm in a bad cable/settings/drivers problem case, as I got really good performances (950MB/s in write and 900MB/s in read), with smaller files.
 

gea

Well-Known Member
Dec 31, 2010
2,489
838
113
DE
With enough RAM the improvments of an L2Arc can be quite low as the Arc delivers the data. But in your case you should use an Intel Optane if you want an L2Arc (800p or better).

You must enable read ahed manually and size of L2Arc should be between 5x and 10x RAM

btw<
Its not normal that read performance can go down from 800 MB/s to 100 MB/s without a cable, driver or setting problem. Your pool should be able to deliver 800 MB/s in any case.

What you can check is iostat of disks during load. All disks should behave similar. If you have one weak disk this can explain this as well.

If you need a better performance for random reads with large files, use a multi-raid 10 pool.
 
Last edited:

RonanR

Member
Jul 27, 2018
30
0
6
Can you tell me how to enable read ahead manually? I did some search but wasn't able to find a proper answer.

Since last Friday, I've done a lot of tests and there really is a problem with files larger than RAM size.
Here is what I've done:
First, I checked if I have one bad disk which can slow everything: it's not the case, all disk are used the same way when I look at iostat.
I then tested by modifying my record size to 512K, in this case I got 980MB/s in write and around 230MB/s in read
With the record size set to 1M, I got 1020MB/s in write and 350MB/s in read.

I removed 16GB of memory to validate it's linked to it, and effectively, now I got the same slow read performances with a 16GB file (when I used 32GB of memory, I got around 950MB/s in write and 900MB/s in read using a record size of 256K and the same 16GB file).

I did the test on two different systems, using two different 10Gb cards on my server, although the iperf test didin't revealed any flaw with my cards (tested in both ways, as a client and as a server).
I ordered a NVMe SSD so I can try with L2ARC, but I'm still finding this very strange.
 

RonanR

Member
Jul 27, 2018
30
0
6
Ok, now it make sense, thanks.
I will post my test results as soon as I get my NVMe SSD.
 

carcass

New Member
Aug 14, 2017
13
2
3
38
Try tuning your prefetcher -- start with "echo 67108864 > /sys/module/zfs/parameters/zfetch_max_distance"
 

RonanR

Member
Jul 27, 2018
30
0
6
Ok, I got a Sandisk Extrem Pro 1TB NVMe and did some tests, using the AJA disk test with a 64GB file size.
I properly set zfs:l2arc_noprefetch=0

Without the prefetcher tuning suggested by Caracass:
Standalone NVMe drive:
Either with 128k, 256k, 512k and 1m record size, I got roughly the same bandwidth:
1016 Mo/s in write and 810 Mo/s in read
The read performances seemed to be capped by the network, so I have to test some tuning on it to see if I can get the same bandwidth on read and write.

2x8 10TB HDD in RAID Z2, no L2ARC (write / read):
128k : 948 / 120
256k : 950 / 144
512k : 957 / 188
1m : 950 / 522

This time with my NVMe added as L2ARC:
128k : 931 / 110
256k : 959 / 135
512k : 958 / 200
1m : 918 / 505

Adding a L2ARC NVMe cache didn't change anything in my case.

With Carcass tuning suggestion (zfetch_max_distance set to 64MB, in my case 0x4000000, reported by echo "::zfs_params" | mdb -k):
Without L2ARC
128k : 950/ 110
256k : 965 / 145
512k : 960 / 200
1m : 970 / 525

With L2ARC
128k : 975 / 120
256k : 970 / 145
512k : 970 / 205
1m : 975 / 562

I got a little more write bandwidth with this setting, but read stay the same. The L2ARC NVMe seemed to have a small effect on read performances, but the different between with and without is close to none.

Any suggestion?
 

carcass

New Member
Aug 14, 2017
13
2
3
38
L2ARC would have zero impact on your streaming read perfomance unless data has already been read.
what is the ashift value for this pool (zdb -C <pool_name> | grep ashift)?
when you reading that large file what is an actual IO size (zpool iostat -rp 1 in ZoL)?
maybe you can put here all current zfs tunables along with zfs get all <pool_name>?
 

RonanR

Member
Jul 27, 2018
30
0
6
For L2ARC, that's what I thought.
ashift value is 12 (I got 4Kn HDD)

Here is the IO stat results (zpool iostat -v in OmniOSce). Test done without L2ARC, but with zfetch_max_distance set to 64MB.

with 1Mb record size:
In write:
Code:
                             capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.61T   139T      0  1.30K      0   888M
  raidz2                   3.31T  69.4T      0    739      0   449M
    c0t5000CCA273C5FF84d0      -      -      0    485      0  70.5M
    c0t5000CCA273CE0A68d0      -      -      0    488      0  71.8M
    c0t5000CCA273CE9FE7d0      -      -      0    486      0  71.9M
    c0t5000CCA273CEB7CDd0      -      -      0    525      0  77.8M
    c0t5000CCA273CED387d0      -      -      0    483      0  71.0M
    c0t5000CCA273CED4B5d0      -      -      0    481      0  71.4M
    c0t5000CCA273CEE080d0      -      -      0    471      0  68.9M
    c0t5000CCA273D1BBB2d0      -      -      0    488      0  71.2M
  raidz2                   3.30T  69.5T      0    588      0   439M
    c0t5000CCA273CF0DBFd0      -      -      0    484      0  74.2M
    c0t5000CCA273CF0DCCd0      -      -      0    483      0  74.4M
    c0t5000CCA273CF381Ed0      -      -      0    471      0  72.2M
    c0t5000CCA273CF4C17d0      -      -      0    477      0  73.7M
    c0t5000CCA273CF9BF4d0      -      -      0    472      0  73.2M
    c0t5000CCA273D07A91d0      -      -      0    488      0  76.0M
    c0t5000CCA273D186AAd0      -      -      0    483      0  75.3M
    c0t5000CCA273D1BB78d0      -      -      0    486      0  75.0M
-------------------------  -----  -----  -----  -----  -----  -----
In read:
Code:
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.68T   139T    492      0   492M      0
  raidz2                   3.35T  69.4T    238      0   239M      0
    c0t5000CCA273C5FF84d0      -      -    184      0  30.7M      0
    c0t5000CCA273CE0A68d0      -      -    180      0  30.0M      0
    c0t5000CCA273CE9FE7d0      -      -    184      0  30.8M      0
    c0t5000CCA273CEB7CDd0      -      -    178      0  29.8M      0
    c0t5000CCA273CED387d0      -      -    186      0  31.1M      0
    c0t5000CCA273CED4B5d0      -      -    187      0  31.3M      0
    c0t5000CCA273CEE080d0      -      -    188      0  31.4M      0
    c0t5000CCA273D1BBB2d0      -      -    186      0  31.1M      0
  raidz2                   3.33T  69.4T    253      0   253M      0
    c0t5000CCA273CF0DBFd0      -      -    189      0  31.6M      0
    c0t5000CCA273CF0DCCd0      -      -    191      0  32.0M      0
    c0t5000CCA273CF381Ed0      -      -    193      0  32.2M      0
    c0t5000CCA273CF4C17d0      -      -    195      0  32.6M      0
    c0t5000CCA273CF9BF4d0      -      -    190      0  31.7M      0
    c0t5000CCA273D07A91d0      -      -    189      0  31.6M      0
    c0t5000CCA273D186AAd0      -      -    194      0  32.4M      0
    c0t5000CCA273D1BB78d0      -      -    194      0  32.4M      0
-------------------------  -----  -----  -----  -----  -----  -----
Small write operations every 4 to 5 seconds
Code:
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.68T   139T    401    268   402M  2.37M
  raidz2                   3.35T  69.4T    220     94   221M   830K
    c0t5000CCA273C5FF84d0      -      -    162     30  27.1M   237K
    c0t5000CCA273CE0A68d0      -      -    166     26  27.8M   208K
    c0t5000CCA273CE9FE7d0      -      -    162     23  27.0M   191K
    c0t5000CCA273CEB7CDd0      -      -    168     24  28.0M   208K
    c0t5000CCA273CED387d0      -      -    157     27  26.2M   212K
    c0t5000CCA273CED4B5d0      -      -    159     27  26.6M   210K
    c0t5000CCA273CEE080d0      -      -    157     31  26.3M   233K
    c0t5000CCA273D1BBB2d0      -      -    160     28  26.7M   228K
  raidz2                   3.33T  69.4T    181    173   181M  1.56M
    c0t5000CCA273CF0DBFd0      -      -    135     54  22.6M   440K
    c0t5000CCA273CF0DCCd0      -      -    132     50  22.0M   426K
    c0t5000CCA273CF381Ed0      -      -    130     47  21.8M   391K
    c0t5000CCA273CF4C17d0      -      -    129     45  21.6M   387K
    c0t5000CCA273CF9BF4d0      -      -    134     42  22.4M   381K
    c0t5000CCA273D07A91d0      -      -    133     43  22.3M   375K
    c0t5000CCA273D186AAd0      -      -    130     44  21.7M   377K
    c0t5000CCA273D1BB78d0      -      -    130     48  21.7M   408K
-------------------------  -----  -----  -----  -----  -----  -----


with 256k record size:
In write:
Code:
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.63T   139T      0  4.39K      0   989M
  raidz2                   3.32T  69.4T      0  2.27K      0   511M
    c0t5000CCA273C5FF84d0      -      -      0    862      0  87.0M
    c0t5000CCA273CE0A68d0      -      -      0    882      0  86.9M
    c0t5000CCA273CE9FE7d0      -      -      0    881      0  86.9M
    c0t5000CCA273CEB7CDd0      -      -      0    864      0  86.5M
    c0t5000CCA273CED387d0      -      -      0    892      0  87.3M
    c0t5000CCA273CED4B5d0      -      -      0    873      0  87.2M
    c0t5000CCA273CEE080d0      -      -      0    879      0  86.8M
    c0t5000CCA273D1BBB2d0      -      -      0    887      0  86.8M
  raidz2                   3.30T  69.4T      0  2.12K      0   478M
    c0t5000CCA273CF0DBFd0      -      -      0    825      0  81.1M
    c0t5000CCA273CF0DCCd0      -      -      0    802      0  80.9M
    c0t5000CCA273CF381Ed0      -      -      0    809      0  81.4M
    c0t5000CCA273CF4C17d0      -      -      0    813      0  81.6M
    c0t5000CCA273CF9BF4d0      -      -      0    802      0  82.3M
    c0t5000CCA273D07A91d0      -      -      0    803      0  82.2M
    c0t5000CCA273D186AAd0      -      -      0    813      0  82.0M
    c0t5000CCA273D1BB78d0      -      -      0    817      0  81.7M
-------------------------  -----  -----  -----  -----  -----  -----
In read:
Code:
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.68T   139T    528      0   132M      0
  raidz2                   3.35T  69.4T    273      0  68.4M      0
    c0t5000CCA273C5FF84d0      -      -    127      0  8.36M      0
    c0t5000CCA273CE0A68d0      -      -    120      0  8.15M      0
    c0t5000CCA273CE9FE7d0      -      -    117      0  8.16M      0
    c0t5000CCA273CEB7CDd0      -      -    117      0  8.15M      0
    c0t5000CCA273CED387d0      -      -    128      0  8.67M      0
    c0t5000CCA273CED4B5d0      -      -    122      0  8.09M      0
    c0t5000CCA273CEE080d0      -      -    120      0  8.07M      0
    c0t5000CCA273D1BBB2d0      -      -    122      0  8.25M      0
  raidz2                   3.33T  69.4T    254      0  63.7M      0
    c0t5000CCA273CF0DBFd0      -      -    123      0  7.95M      0
    c0t5000CCA273CF0DCCd0      -      -    126      0  7.95M      0
    c0t5000CCA273CF381Ed0      -      -    128      0  7.97M      0
    c0t5000CCA273CF4C17d0      -      -    119      0  7.98M      0
    c0t5000CCA273CF9BF4d0      -      -    128      0  8.02M      0
    c0t5000CCA273D07A91d0      -      -    132      0  7.99M      0
    c0t5000CCA273D186AAd0      -      -    123      0  7.96M      0
    c0t5000CCA273D1BB78d0      -      -    119      0  7.99M      0
-------------------------  -----  -----  -----  -----  -----  -----
Once every 2 to 3 seconds, their are also small write operations:
Code:
                              capacity     operations    bandwidth
pool                       alloc   free   read  write   read  write
-------------------------  -----  -----  -----  -----  -----  -----
hdd2x8z2                   6.68T   139T    555    173   139M   742K
  raidz2                   3.35T  69.4T    277     59  69.3M   254K
    c0t5000CCA273C5FF84d0      -      -    188     18  8.80M   119K
    c0t5000CCA273CE0A68d0      -      -    190     18  8.91M   113K
    c0t5000CCA273CE9FE7d0      -      -    170     16  8.65M   119K
    c0t5000CCA273CEB7CDd0      -      -    185     14  8.84M   103K
    c0t5000CCA273CED387d0      -      -    165     16  8.78M   105K
    c0t5000CCA273CED4B5d0      -      -    174     14  8.89M  91.8K
    c0t5000CCA273CEE080d0      -      -    191     14  8.84M  86.4K
    c0t5000CCA273D1BBB2d0      -      -    185     16  8.79M  94.5K
  raidz2                   3.33T  69.4T    278    114  69.7M   488K
    c0t5000CCA273CF0DBFd0      -      -    148     27  8.82M   197K
    c0t5000CCA273CF0DCCd0      -      -    203     29  9.03M   197K
    c0t5000CCA273CF381Ed0      -      -    188     29  8.86M   200K
    c0t5000CCA273CF4C17d0      -      -    172     24  9.09M   181K
    c0t5000CCA273CF9BF4d0      -      -    192     25  9.09M   184K
    c0t5000CCA273D07A91d0      -      -    192     30  9.06M   189K
    c0t5000CCA273D186AAd0      -      -    189     27  8.99M   178K
    c0t5000CCA273D1BB78d0      -      -    174     29  8.98M   178K
-------------------------  -----  -----  -----  -----  -----  -----
Actually I specified only two zfs tunables:
zfs:l2arc_noprefetch = 0
zfs:zfetch_max_distance = 0x4000000

Everything else are default parameters:
Code:
arc_lotsfree_percent = 0xa
arc_pages_pp_reserve = 0x40
arc_reduce_dnlc_percent = 0x3
arc_swapfs_reserve = 0x40
arc_zio_arena_free_shift = 0x2
dbuf_cache_hiwater_pct = 0xa
dbuf_cache_lowater_pct = 0xa
dbuf_cache_max_bytes = 0x3d434780
mdb: variable dbuf_cache_max_shift not found: unknown symbol name
ddt_zap_indirect_blockshift = 0xc
ddt_zap_leaf_blockshift = 0xc
ditto_same_vdev_distance_shift = 0x3
dmu_find_threads = 0x0
dmu_rescan_dnode_threshold = 0x20000
dsl_scan_delay_completion = 0x0
fzap_default_block_shift = 0xe
l2arc_feed_again = 0x1
l2arc_feed_min_ms = 0xc8
l2arc_feed_secs = 0x1
l2arc_headroom = 0x2
l2arc_headroom_boost = 0xc8
l2arc_noprefetch = 0x0
l2arc_norw = 0x1
l2arc_write_boost = 0x800000
l2arc_write_max = 0x800000
metaslab_aliquot = 0x80000
metaslab_bias_enabled = 0x1
metaslab_debug_load = 0x0
metaslab_debug_unload = 0x0
metaslab_df_alloc_threshold = 0x20000
metaslab_df_free_pct = 0x4
metaslab_fragmentation_factor_enabled = 0x1
metaslab_force_ganging = 0x1000001
metaslab_lba_weighting_enabled = 0x1
metaslab_load_pct = 0x32
metaslab_min_alloc_size = 0x2000000
metaslab_ndf_clump_shift = 0x4
metaslab_preload_enabled = 0x1
metaslab_preload_limit = 0x3
metaslab_trace_enabled = 0x1
metaslab_trace_max_entries = 0x1388
metaslab_unload_delay = 0x8
mdb: variable metaslabs_per_vdev not found: unknown symbol name
mdb: variable reference_history not found: unknown symbol name
mdb: variable reference_tracking_enable not found: unknown symbol name
send_holes_without_birth_time = 0x1
spa_asize_inflation = 0x18
spa_load_verify_data = 0x1
spa_load_verify_maxinflight = 0x2710
spa_load_verify_metadata = 0x1
spa_max_replication_override = 0x3
spa_min_slop = 0x8000000
spa_mode_global = 0x3
spa_slop_shift = 0x5
mdb: variable space_map_blksz not found: unknown symbol name
vdev_mirror_shift = 0x15
zfetch_max_distance = 0x4000000
zfs_abd_chunk_size = 0x1000
zfs_abd_scatter_enabled = 0x1
zfs_arc_average_blocksize = 0x2000
zfs_arc_evict_batch_limit = 0xa
zfs_arc_grow_retry = 0x0
zfs_arc_max = 0x0
zfs_arc_meta_limit = 0x0
zfs_arc_meta_min = 0x0
zfs_arc_min = 0x0
zfs_arc_p_min_shift = 0x0
zfs_arc_shrink_shift = 0x0
zfs_async_block_max_blocks = 0xffffffffffffffff
zfs_ccw_retry_interval = 0x12c
zfs_commit_timeout_pct = 0x5
zfs_compressed_arc_enabled = 0x1
zfs_condense_indirect_commit_entry_delay_ticks = 0x0
zfs_condense_indirect_vdevs_enable = 0x1
zfs_condense_max_obsolete_bytes = 0x40000000
zfs_condense_min_mapping_bytes = 0x20000
zfs_condense_pct = 0xc8
zfs_dbgmsg_maxsize = 0x400000
zfs_deadman_checktime_ms = 0x1388
zfs_deadman_enabled = 0x1
zfs_deadman_synctime_ms = 0xf4240
zfs_dedup_prefetch = 0x1
zfs_default_bs = 0x9
zfs_default_ibs = 0x11
zfs_delay_max_ns = 0x5f5e100
zfs_delay_min_dirty_percent = 0x3c
zfs_delay_resolution_ns = 0x186a0
zfs_delay_scale = 0x7a120
zfs_dirty_data_max = 0xcc0a7e66
zfs_dirty_data_max_max = 0x100000000
zfs_dirty_data_max_percent = 0xa
mdb: variable zfs_dirty_data_sync not found: unknown symbol name
zfs_flags = 0x0
zfs_free_bpobj_enabled = 0x1
zfs_free_leak_on_eio = 0x0
zfs_free_min_time_ms = 0x3e8
zfs_fsync_sync_cnt = 0x4
zfs_immediate_write_sz = 0x8000
zfs_indirect_condense_obsolete_pct = 0x19
zfs_lua_check_instrlimit_interval = 0x64
zfs_lua_max_instrlimit = 0x5f5e100
zfs_lua_max_memlimit = 0x6400000
zfs_max_recordsize = 0x100000
zfs_mdcomp_disable = 0x0
zfs_metaslab_condense_block_threshold = 0x4
zfs_metaslab_fragmentation_threshold = 0x46
zfs_metaslab_segment_weight_enabled = 0x1
zfs_metaslab_switch_threshold = 0x2
zfs_mg_fragmentation_threshold = 0x55
zfs_mg_noalloc_threshold = 0x0
zfs_multilist_num_sublists = 0x0
zfs_no_scrub_io = 0x0
zfs_no_scrub_prefetch = 0x0
zfs_nocacheflush = 0x0
zfs_nopwrite_enabled = 0x1
zfs_object_remap_one_indirect_delay_ticks = 0x0
zfs_obsolete_min_time_ms = 0x1f4
zfs_pd_bytes_max = 0x3200000
zfs_per_txg_dirty_frees_percent = 0x1e
zfs_prefetch_disable = 0x0
zfs_read_chunk_size = 0x100000
zfs_recover = 0x0
zfs_recv_queue_length = 0x1000000
zfs_redundant_metadata_most_ditto_level = 0x2
zfs_remap_blkptr_enable = 0x1
zfs_remove_max_copy_bytes = 0x4000000
zfs_remove_max_segment = 0x100000
zfs_resilver_delay = 0x2
zfs_resilver_min_time_ms = 0xbb8
zfs_scan_idle = 0x32
zfs_scan_min_time_ms = 0x3e8
zfs_scrub_delay = 0x4
zfs_scrub_limit = 0xa
zfs_send_corrupt_data = 0x0
zfs_send_queue_length = 0x1000000
zfs_send_set_freerecords_bit = 0x1
zfs_sync_pass_deferred_free = 0x2
zfs_sync_pass_dont_compress = 0x5
zfs_sync_pass_rewrite = 0x2
zfs_sync_taskq_batch_pct = 0x4b
zfs_top_maxinflight = 0x20
zfs_txg_timeout = 0x5
zfs_vdev_aggregation_limit = 0x20000
zfs_vdev_async_read_max_active = 0x3
zfs_vdev_async_read_min_active = 0x1
zfs_vdev_async_write_active_max_dirty_percent = 0x3c
zfs_vdev_async_write_active_min_dirty_percent = 0x1e
zfs_vdev_async_write_max_active = 0xa
zfs_vdev_async_write_min_active = 0x1
zfs_vdev_cache_bshift = 0x10
zfs_vdev_cache_max = 0x4000
zfs_vdev_cache_size = 0x0
zfs_vdev_max_active = 0x3e8
zfs_vdev_queue_depth_pct = 0x3e8
zfs_vdev_read_gap_limit = 0x8000
zfs_vdev_removal_max_active = 0x2
zfs_vdev_removal_min_active = 0x1
zfs_vdev_scrub_max_active = 0x2
zfs_vdev_scrub_min_active = 0x1
zfs_vdev_sync_read_max_active = 0xa
zfs_vdev_sync_read_min_active = 0xa
zfs_vdev_sync_write_max_active = 0xa
zfs_vdev_sync_write_min_active = 0xa
zfs_vdev_write_gap_limit = 0x1000
zfs_write_implies_delete_child = 0x1
zfs_zil_clean_taskq_maxalloc = 0x100000
zfs_zil_clean_taskq_minalloc = 0x400
zfs_zil_clean_taskq_nthr_pct = 0x64
zil_replay_disable = 0x0
zil_slog_bulk = 0xc0000
zio_buf_debug_limit = 0x0
zio_dva_throttle_enabled = 0x1
zio_injection_enabled = 0x0
zvol_immediate_write_sz = 0x8000
zvol_maxphys = 0x1000000
zvol_unmap_enabled = 0x1
zvol_unmap_sync_enabled = 0x0
zfs_max_dataset_nesting = 0x32

Her is the output of zfs get all mypool
Code:
NAME      PROPERTY              VALUE                  SOURCE
hdd2x8z2  type                  filesystem             -
hdd2x8z2  creation              Wed Nov 28 17:26 2018  -
hdd2x8z2  used                  4.69T                  -
hdd2x8z2  available             95.5T                  -
hdd2x8z2  referenced            230K                   -
hdd2x8z2  compressratio         1.00x                  -
hdd2x8z2  mounted               yes                    -
hdd2x8z2  quota                 none                   default
hdd2x8z2  reservation           none                   default
hdd2x8z2  recordsize            1M                     local
hdd2x8z2  mountpoint            /hdd2x8z2              default
hdd2x8z2  sharenfs              off                    default
hdd2x8z2  checksum              on                     default
hdd2x8z2  compression           off                    default
hdd2x8z2  atime                 on                     default
hdd2x8z2  devices               on                     default
hdd2x8z2  exec                  on                     default
hdd2x8z2  setuid                on                     default
hdd2x8z2  readonly              off                    default
hdd2x8z2  zoned                 off                    default
hdd2x8z2  snapdir               hidden                 default
hdd2x8z2  aclmode               passthrough            local
hdd2x8z2  aclinherit            passthrough            local
hdd2x8z2  createtxg             1                      -
hdd2x8z2  canmount              on                     default
hdd2x8z2  xattr                 on                     default
hdd2x8z2  copies                1                      default
hdd2x8z2  version               5                      -
hdd2x8z2  utf8only              off                    -
hdd2x8z2  normalization         none                   -
hdd2x8z2  casesensitivity       sensitive              -
hdd2x8z2  vscan                 off                    default
hdd2x8z2  nbmand                off                    default
hdd2x8z2  sharesmb              off                    default
hdd2x8z2  refquota              none                   default
hdd2x8z2  refreservation        none                   default
hdd2x8z2  guid                  13382713124067928909   -
hdd2x8z2  primarycache          all                    default
hdd2x8z2  secondarycache        all                    default
hdd2x8z2  usedbysnapshots       0                      -
hdd2x8z2  usedbydataset         230K                   -
hdd2x8z2  usedbychildren        4.69T                  -
hdd2x8z2  usedbyrefreservation  0                      -
hdd2x8z2  logbias               latency                default
hdd2x8z2  dedup                 off                    default
hdd2x8z2  mlslabel              none                   default
hdd2x8z2  sync                  disabled               local
hdd2x8z2  refcompressratio      1.00x                  -
hdd2x8z2  written               230K                   -
hdd2x8z2  logicalused           4.94T                  -
hdd2x8z2  logicalreferenced     45K                    -
hdd2x8z2  filesystem_limit      none                   default
hdd2x8z2  snapshot_limit        none                   default
hdd2x8z2  filesystem_count      none                   default
hdd2x8z2  snapshot_count        none                   default
hdd2x8z2  redundant_metadata    all                    default
 
Last edited:

RonanR

Member
Jul 27, 2018
30
0
6
And here is the output of zfs get all mypool/myzfsvolume (which is on of my volumes I used for these tests)
Code:
NAME            PROPERTY              VALUE                  SOURCE
hdd2x8z2/test2  type                  filesystem             -
hdd2x8z2/test2  creation              Fri Dec 14 16:14 2018  -
hdd2x8z2/test2  used                  188K                   -
hdd2x8z2/test2  available             95.5T                  -
hdd2x8z2/test2  referenced            188K                   -
hdd2x8z2/test2  compressratio         1.00x                  -
hdd2x8z2/test2  mounted               yes                    -
hdd2x8z2/test2  quota                 none                   default
hdd2x8z2/test2  reservation           none                   default
hdd2x8z2/test2  recordsize            1M                     local
hdd2x8z2/test2  mountpoint            /hdd2x8z2/test2        default
hdd2x8z2/test2  sharenfs              off                    default
hdd2x8z2/test2  checksum              on                     default
hdd2x8z2/test2  compression           off                    default
hdd2x8z2/test2  atime                 off                    local
hdd2x8z2/test2  devices               on                     default
hdd2x8z2/test2  exec                  on                     default
hdd2x8z2/test2  setuid                on                     default
hdd2x8z2/test2  readonly              off                    default
hdd2x8z2/test2  zoned                 off                    default
hdd2x8z2/test2  snapdir               hidden                 local
hdd2x8z2/test2  aclmode               passthrough            local
hdd2x8z2/test2  aclinherit            passthrough            local
hdd2x8z2/test2  createtxg             27546                  -
hdd2x8z2/test2  canmount              on                     default
hdd2x8z2/test2  xattr                 on                     default
hdd2x8z2/test2  copies                1                      default
hdd2x8z2/test2  version               5                      -
hdd2x8z2/test2  utf8only              on                     -
hdd2x8z2/test2  normalization         formD                  -
hdd2x8z2/test2  casesensitivity       insensitive            -
hdd2x8z2/test2  vscan                 off                    default
hdd2x8z2/test2  nbmand                on                     local
hdd2x8z2/test2  sharesmb              name=test2             local
hdd2x8z2/test2  refquota              none                   default
hdd2x8z2/test2  refreservation        none                   default
hdd2x8z2/test2  guid                  10297845018154907042   -
hdd2x8z2/test2  primarycache          all                    default
hdd2x8z2/test2  secondarycache        all                    default
hdd2x8z2/test2  usedbysnapshots       0                      -
hdd2x8z2/test2  usedbydataset         188K                   -
hdd2x8z2/test2  usedbychildren        0                      -
hdd2x8z2/test2  usedbyrefreservation  0                      -
hdd2x8z2/test2  logbias               latency                default
hdd2x8z2/test2  dedup                 off                    default
hdd2x8z2/test2  mlslabel              none                   default
hdd2x8z2/test2  sync                  disabled               inherited from hdd2x8z2
hdd2x8z2/test2  refcompressratio      1.00x                  -
hdd2x8z2/test2  written               188K                   -
hdd2x8z2/test2  logicalused           36.5K                  -
hdd2x8z2/test2  logicalreferenced     36.5K                  -
hdd2x8z2/test2  filesystem_limit      none                   default
hdd2x8z2/test2  snapshot_limit        none                   default
hdd2x8z2/test2  filesystem_count      none                   default
hdd2x8z2/test2  snapshot_count        none                   default
hdd2x8z2/test2  redundant_metadata    all                    default
 

carcass

New Member
Aug 14, 2017
13
2
3
38
Are you getting the same result when reading that file locally (with dd) with empty ARC?
 

RonanR

Member
Jul 27, 2018
30
0
6
Internally with dd, I have differences but not this big.
here is what I got for 256k record size:
write:
time sh -c "dd if=/dev/zero of=/hdd2x8z2/test2/dd-256k-256 bs=256k count=160000"
160000+0 records in
160000+0 records out
41943040000 bytes transferred in 36.283510 secs (1155980780 bytes/sec)

read
time sh -c "dd if=/hdd2x8z2/test2/dd-256k-256 of=/dev/zero bs=256k"
160000+0 records in
160000+0 records out
41943040000 bytes transferred in 63.418068 secs (661373669 bytes/sec)

And for 1M record size:
write
time sh -c "dd if=/dev/zero of=/hdd2x8z2/test2/dd-1m bs=1M count=40000"
40000+0 records in
40000+0 records out
41943040000 bytes transferred in 31.749132 secs (1321076742 bytes/sec)

read
time sh -c "dd if=/hdd2x8z2/test2/dd-1m of=/dev/zero bs=1M"
40000+0 records in
40000+0 records out
41943040000 bytes transferred in 38.201376 secs (1097945805 bytes/sec)
 

carcass

New Member
Aug 14, 2017
13
2
3
38
So assuming that you did clear ARC between "write" and "read" you've got 660MB/s read with 256k recordsize and 1.1GB/s read with 1M recordsize?
 

RonanR

Member
Jul 27, 2018
30
0
6
Yes, that's right. As I don't know how to properly clear the ARC, I did a reboot to be sure before doing each read test.
 

RonanR

Member
Jul 27, 2018
30
0
6
Thanks for your time. I already tried to follow Gea's guide and applied network tuning parameters, without success. I will play with network parameters on both y server and my client and also try with another 10g network card.