How many here run ZFS on Linux and get good fsync performance?

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
I took your suggestion and did some research this weekend. Conclusion ZoL has much better performance on CentOS, but is still slow.

For comparison, same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync

Ubuntu 2200 iops
Debian 2000 iops
CentOS 8000 iops

FreeBSD 16000 iops

Ubuntu + XFS 34000 iops
Ubuntu + EXT4 32000 iops
Ubuntu + BcacheFS 14000 iops
I'm still searching for better iscsi backing store.

Here's some fio benchmarks of Zvols on CentOS vs Ubuntu.
This shows 4k sequential, 4k random and 4k random mixed read/write IOPS.
for a 10GB RAM block device,
and for a zvol on the 10GB ram device.
I repeated both for Centos and Ubuntu.

On centos, I rolled back to ZFS v0.6.5.11 due to the reports about performance regressions.

Random 4k write IOPS:
Centos ramdisk: 2054k zvol: 32.3k
Ubuntu ramdisk: 2535k zvol: 26.2k

Centos is definitely faster, but still has a ceiling of around 30k random write IOPS.

I'm beginning to think ZFS is too much of a bottleneck.
It looks like no matter how fast my storage is, I can't get more than 30k random write IOPS from a Zvol.

I plan to look at FreeBSD and illumos; however, neither one will do iSER targets.

Code:
centos 7.6 ZFS v0.6.5.11-1 10GB RAM block device
  4k.read:      IOPS=2705k, BW=10.3GiB/s (11.1GB/s)(160GiB/15504msec)
  4k.write:     IOPS=2114k, BW=8256MiB/s (8657MB/s)(160GiB/19845msec)
  4k.randread:  IOPS=2691k, BW=10.3GiB/s (11.0GB/s)(160GiB/15589msec)
  4k.randwrite: IOPS=2054k, BW=8024MiB/s (8414MB/s)(157GiB/20002msec)
  4k.70%read:   IOPS=1545k, BW=6035MiB/s (6328MB/s)(112GiB/19004msec)
  4k.30%write:  IOPS=662k, BW=2586MiB/s (2712MB/s)(47.0GiB/19004msec)
centos 7.6 ZFS v0.6.5.11-1 zvol on 10GB RAM block device
  4k.read:      IOPS=289k, BW=1130MiB/s (1185MB/s)(22.1GiB/20002msec)
  4k.write:     IOPS=145k, BW=565MiB/s (593MB/s)(11.0GiB/20001msec)
  4k.randread:  IOPS=295k, BW=1153MiB/s (1209MB/s)(22.5GiB/20002msec)
  4k.randwrite: IOPS=32.3k, BW=126MiB/s (132MB/s)(2523MiB/20020msec)
  4k.70%read:   IOPS=52.5k, BW=205MiB/s (215MB/s)(4102MiB/20002msec)
  4k.30%write:  IOPS=22.5k, BW=87.9MiB/s (92.2MB/s)(1759MiB/20002msec)
Ubuntu 18.04 ZFS v0.7.5-1ubuntu16.4 10GB ram block device
  4k.read:      IOPS=3653k, BW=13.9GiB/s (14.0GB/s)(160GiB/11482msec)
  4k.write:     IOPS=2730k, BW=10.4GiB/s (11.2GB/s)(160GiB/15365msec)
  4k.randread:  IOPS=3335k, BW=12.7GiB/s (13.7GB/s)(160GiB/12578msec)
  4k.randwrite: IOPS=2535k, BW=9902MiB/s (10.4GB/s)(160GiB/16546msec)
  4k.70%read:   IOPS=2105k, BW=8221MiB/s (8620MB/s)(112GiB/13953msec)
  4k.30%write:  IOPS=902k, BW=3522MiB/s (3693MB/s)(47.0GiB/13953msec)
Ubuntu 18.04 ZFS v0.7.5-1ubuntu16.4 zvol on 10GB ram block device
  4k.read:      IOPS=213k, BW=833MiB/s (873MB/s)(16.3GiB/20003msec)
  4k.write:     IOPS=173k, BW=677MiB/s (710MB/s)(13.2GiB/20002msec)
  4k.randread:  IOPS=149k, BW=580MiB/s (609MB/s)(11.3GiB/20003msec)
  4k.randwrite: IOPS=26.2k, BW=102MiB/s (107MB/s)(2051MiB/20069msec)
  4k.70%read:   IOPS=35.3k, BW=138MiB/s (145MB/s)(2759MiB/20008msec)
  4k.30%write:  IOPS=15.2k, BW=59.2MiB/s (62.1MB/s)(1185MiB/20008msec)
Here's the script that generated the above results.
Code:
dofio () {
    local FIO="fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=16 --runtime=20 --group_reporting --bs=4k"
    local FILE=$1
    shift
    ($FIO $* --filename=${FILE} --rw=read                   |sed 's/ read:/4k.read:     /';
     $FIO $* --filename=${FILE} --rw=write                  |sed 's/write:/4k.write:    /';
     $FIO $* --filename=${FILE} --rw=randread               |sed 's/ read:/4k.randread: /';
     $FIO $* --filename=${FILE} --rw=randwrite              |sed 's/write:/4k.randwrite:/';
     $FIO $* --filename=${FILE} --rw=randrw --rwmixread=70  |sed 's/ read:/4k.70%read:  /' |
                                                             sed 's/write:/4k.30%write: /' ;
    )| grep IOPS=
}
modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) # 10GB RAM block device
dd if=/dev/zero of=/dev/ram0 bs=1M
dofio /dev/ram0
P=b
zpool create ${P} /dev/ram0
VOL=volb
SIZE=$(zfs get -H -p -o value available ${P})
BLOCKSIZE=32768
SIZE=$(( ${SIZE} * 80 / ( 100 * ${BLOCKSIZE} ) * ${BLOCKSIZE} ))
sudo zfs create -s -b ${BLOCKSIZE} -V ${SIZE} -s  ${P}/${VOL}
dofio /dev/zvol/${P}/${VOL}
zpool destroy ${P}
rmmod brd
 
  • Like
Reactions: TomUK and gigatexal

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
Tried the same thing on solaris 11.3.
Zvol IOPS are lower than they were on Centos 7.6/zfs 0.6.5.
drat.

Code:
solaris 11.3 8G RAM block device
  4k.read:      io=752176KB, bw=4055.7MB/s, iops=1038.3K, runt= 20380msec
  4k.write:     io=469448KB, bw=2240.3MB/s, iops=573495 , runt= 20317msec
  4k.randread:  io=2855.4MB, bw=3188.4MB/s, iops=816206 , runt= 20166msec
  4k.randwrite: io=4051.5MB, bw=1996.8MB/s, iops=511168 , runt= 20491msec
  4k.70%read:   io=4008.6MB, bw=1831.6MB/s, iops=468863 , runt= 20080msec
  4k.30%write:  io=3474.4MB, bw=803819KB/s, iops=200954 , runt= 20080msec
solaris 11.3 zvol on 8G RAM block device
  4k.read:      io=3364.2MB, bw=3848.9MB/s, iops=985301 , runt= 20030msec
  4k.write:     io=1627.2MB, bw=502717KB/s, iops=125679 , runt= 20001msec
  4k.randread:  io=3773.4MB, bw=2645.8MB/s, iops=677308 , runt= 20004msec
  4k.randwrite: io=2176.6MB, bw=111147KB/s, iops=27786 , runt= 20052msec
  4k.70%read:   io=3159.5MB, bw=161229KB/s, iops=40307 , runt= 20066msec
  4k.30%write:  io=1354.9MB, bw=69140KB/s, iops=17284 , runt= 20066msec
 

efschu2

Member
Feb 14, 2019
68
12
8
proxmox kernel 4.15.18-9 | zfs 0.7.12-1

Code:
10GB ramdisk
  4k.read:      io=163840MB, bw=15796MB/s, iops=4043.9K, runt= 10372msec
  4k.write:     io=163840MB, bw=12245MB/s, iops=3134.8K, runt= 13380msec
  4k.randread:  io=163840MB, bw=14387MB/s, iops=3683.9K, runt= 11388msec
  4k.randwrite: io=163840MB, bw=9229.1MB/s, iops=2362.9K, runt= 17751msec
  4k.70%read:   io=114704MB, bw=8526.1MB/s, iops=2182.9K, runt= 13452msec
  4k.30%write:  io=49136MB, bw=3652.7MB/s, iops=935084, runt= 13452msec
10GB ramdisk zvol
  4k.read:      io=29757MB, bw=1487.8MB/s, iops=380854, runt= 20002msec
  4k.write:     io=16633MB, bw=851565KB/s, iops=212891, runt= 20001msec
  4k.randread:  io=20155MB, bw=1007.7MB/s, iops=257962, runt= 20002msec
  4k.randwrite: io=2770.5MB, bw=141650KB/s, iops=35412, runt= 20028msec
  4k.70%read:   io=3539.2MB, bw=181123KB/s, iops=45280, runt= 20009msec
  4k.30%write:  io=1517.8MB, bw=77672KB/s, iops=19417, runt= 20009msec
 

efschu2

Member
Feb 14, 2019
68
12
8
Anyone already posted this to zfs mailing list? I know there is a performance hit, but it must not be worse by factor 10 up to factor 65!?
 

lumdol

Member
Jun 9, 2016
30
2
8
37
This is interesting. Maybe I should hold off on those optanes and let my S3700s get a little longer in the tooth.

Are you using VMs? Would you mind spinning up Darwin with o3x for an additional datapoint?
 
Last edited:

lumdol

Member
Jun 9, 2016
30
2
8
37
I took your suggestion and did some research this weekend. Conclusion ZoL has much better performance on CentOS, but is still slow.

For comparison, same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync

Ubuntu 2200 iops
Debian 2000 iops
CentOS 8000 iops

FreeBSD 16000 iops

Ubuntu + XFS 34000 iops
Ubuntu + EXT4 32000 iops
Ubuntu + BcacheFS 14000 iops
I inquired with the zfs on linux folks and got the answer I was expecting:

One would need more data to see what's going on.

But default settings sticks out... if you mean 128k records...if you write 4k random sync writes over 128k records in a pool, you'll incur something like 64:1 IO amplification when RMW happens

you will do RMW at txg commit time for all small zil writes and if that becomes limiting...you may see very different numbers

if you really really want to isolate zil performance...set zfs_txg_timeout to several minutes, raise zfs_dirty_data_max way up, do a bunch of little sync writes and watch zpool iostat -r

but first thing to do is post your pool property values using zfs get all [pool]
 

efschu2

Member
Feb 14, 2019
68
12
8
...
But default settings sticks out... if you mean 128k records...if you write 4k random sync writes over 128k records in a pool, you'll incur something like 64:1 IO amplification when RMW happens
We are already working with 32k volblocksize
BLOCKSIZE=32768
Results for 4k 8T16Q randwr on recordize/volblocksize 4k
Code:
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/zvol/${P}/${VOL} --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=25.8MiB/s][r=0,w=6600 IOPS][eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=5815: Fri Apr  5 08:25:49 2019
  write: IOPS=7374, BW=28.8MiB/s (30.2MB/s)(289MiB/10021msec)
    slat (nsec): min=1492, max=283778k, avg=21567.99, stdev=1379626.28
    clat (usec): min=34, max=382853, avg=17324.08, stdev=19804.98
     lat (usec): min=90, max=382855, avg=17345.85, stdev=19850.90
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    9], 10.00th=[   10], 20.00th=[   11],
     | 30.00th=[   12], 40.00th=[   14], 50.00th=[   15], 60.00th=[   16],
     | 70.00th=[   17], 80.00th=[   18], 90.00th=[   21], 95.00th=[   29],
     | 99.00th=[  115], 99.50th=[  155], 99.90th=[  268], 99.95th=[  288],
     | 99.99th=[  368]
   bw (  KiB/s): min= 2120, max= 5472, per=12.50%, avg=3688.05, stdev=522.15, samples=160
   iops        : min=  530, max= 1368, avg=921.99, stdev=130.54, samples=160
  lat (usec)   : 50=0.01%, 100=0.04%, 250=0.20%, 500=0.20%, 750=0.10%
  lat (usec)   : 1000=0.08%
  lat (msec)   : 2=0.14%, 4=0.23%, 10=11.65%, 20=76.78%, 50=7.48%
  lat (msec)   : 100=1.82%, 250=1.12%, 500=0.14%
  cpu          : usr=0.33%, sys=0.46%, ctx=15195, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,73904,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=289MiB (303MB), run=10021-10021msec
Code:
zfs get all
NAME    PROPERTY              VALUE                 SOURCE
b       type                  filesystem            -
b       creation              Fr Apr  5  8:19 2019  -
b       used                  535M                  -
b       available             345M                  -
b       referenced            96K                   -
b       compressratio         1.00x                 -
b       mounted               yes                   -
b       quota                 none                  default
b       reservation           none                  default
b       recordsize            4K                    local
b       mountpoint            /b                    default
b       sharenfs              off                   default
b       checksum              on                    default
b       compression           off                   default
b       atime                 on                    default
b       devices               on                    default
b       exec                  on                    default
b       setuid                on                    default
b       readonly              off                   default
b       zoned                 off                   default
b       snapdir               hidden                default
b       aclinherit            restricted            default
b       createtxg             1                     -
b       canmount              on                    default
b       xattr                 on                    default
b       copies                1                     default
b       version               5                     -
b       utf8only              off                   -
b       normalization         none                  -
b       casesensitivity       sensitive             -
b       vscan                 off                   default
b       nbmand                off                   default
b       sharesmb              off                   default
b       refquota              none                  default
b       refreservation        none                  default
b       guid                  14752277821102290775  -
b       primarycache          all                   default
b       secondarycache        all                   default
b       usedbysnapshots       0B                    -
b       usedbydataset         96K                   -
b       usedbychildren        534M                  -
b       usedbyrefreservation  0B                    -
b       logbias               latency               default
b       dedup                 off                   default
b       mlslabel              none                  default
b       sync                  standard              default
b       dnodesize             legacy                default
b       refcompressratio      1.00x                 -
b       written               96K                   -
b       logicalused           525M                  -
b       logicalreferenced     40K                   -
b       volmode               default               default
b       filesystem_limit      none                  default
b       snapshot_limit        none                  default
b       filesystem_count      none                  default
b       snapshot_count        none                  default
b       snapdev               hidden                default
b       acltype               off                   default
b       context               none                  default
b       fscontext             none                  default
b       defcontext            none                  default
b       rootcontext           none                  default
b       relatime              off                   default
b       redundant_metadata    all                   default
b       overlay               off                   default
b/volb  type                  volume                -
b/volb  creation              Fr Apr  5  8:21 2019  -
b/volb  used                  531M                  -
b/volb  available             345M                  -
b/volb  referenced            531M                  -
b/volb  compressratio         1.00x                 -
b/volb  reservation           none                  default
b/volb  volsize               704M                  local
b/volb  volblocksize          4K                    -
b/volb  checksum              on                    default
b/volb  compression           off                   default
b/volb  readonly              off                   default
b/volb  createtxg             22                    -
b/volb  copies                1                     default
b/volb  refreservation        none                  default
b/volb  guid                  10441742856011729804  -
b/volb  primarycache          all                   default
b/volb  secondarycache        all                   default
b/volb  usedbysnapshots       0B                    -
b/volb  usedbydataset         531M                  -
b/volb  usedbychildren        0B                    -
b/volb  usedbyrefreservation  0B                    -
b/volb  logbias               latency               default
b/volb  dedup                 off                   default
b/volb  mlslabel              none                  default
b/volb  sync                  standard              default
b/volb  refcompressratio      1.00x                 -
b/volb  written               531M                  -
b/volb  logicalused           524M                  -
b/volb  logicalreferenced     524M                  -
b/volb  volmode               default               default
b/volb  snapshot_limit        none                  default
b/volb  snapshot_count        none                  default
b/volb  snapdev               hidden                default
b/volb  context               none                  default
b/volb  fscontext             none                  default
b/volb  defcontext            none                  default
b/volb  rootcontext           none                  default
b/volb  redundant_metadata    all                   default
So the result is becoming even worse?

Latency is in tens of ms for a ramdisk, hugh?

results for /dev/ram0 direct

Code:
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/ram0 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8)
test: (groupid=0, jobs=8): err= 0: pid=15912: Fri Apr  5 10:22:03 2019
  write: IOPS=1524k, BW=5953MiB/s (6243MB/s)(8192MiB/1376msec)
    slat (nsec): min=1522, max=7510.0k, avg=2886.92, stdev=11973.56
    clat (nsec): min=1043, max=11356k, avg=75762.81, stdev=80754.63
     lat (usec): min=2, max=11358, avg=78.75, stdev=81.87
    clat percentiles (usec):
     |  1.00th=[   50],  5.00th=[   72], 10.00th=[   73], 20.00th=[   73],
     | 30.00th=[   74], 40.00th=[   74], 50.00th=[   74], 60.00th=[   75],
     | 70.00th=[   75], 80.00th=[   76], 90.00th=[   76], 95.00th=[   77],
     | 99.00th=[   90], 99.50th=[  145], 99.90th=[  445], 99.95th=[ 1336],
     | 99.99th=[ 2769]
   bw (  KiB/s): min=703640, max=821218, per=12.98%, avg=791497.06, stdev=31242.83, samples=16
   iops        : min=175910, max=205304, avg=197874.19, stdev=7810.66, samples=16
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=1.35%, 100=97.78%
  lat (usec)   : 250=0.71%, 500=0.05%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.06%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=40.03%, sys=57.11%, ctx=781, majf=0, minf=78
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2097152,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=5953MiB/s (6243MB/s), 5953MiB/s-5953MiB/s (6243MB/s-6243MB/s), io=8192MiB (8590MB), run=1376-1376msec

Disk stats (read/write):
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
results for mdadm raid0 on /dev/ram0

Code:
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/md0 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8)
test: (groupid=0, jobs=8): err= 0: pid=22816: Fri Apr  5 10:27:30 2019
  write: IOPS=1345k, BW=5254MiB/s (5510MB/s)(8160MiB/1553msec)
    slat (nsec): min=1974, max=6919.2k, avg=3371.10, stdev=16099.68
    clat (nsec): min=1209, max=7531.3k, avg=84376.29, stdev=78147.61
     lat (usec): min=4, max=7534, avg=87.85, stdev=79.94
    clat percentiles (usec):
     |  1.00th=[   67],  5.00th=[   79], 10.00th=[   80], 20.00th=[   80],
     | 30.00th=[   81], 40.00th=[   81], 50.00th=[   81], 60.00th=[   82],
     | 70.00th=[   82], 80.00th=[   83], 90.00th=[   84], 95.00th=[   88],
     | 99.00th=[  105], 99.50th=[  178], 99.90th=[ 1188], 99.95th=[ 1369],
     | 99.99th=[ 3261]
   bw (  KiB/s): min=608088, max=746600, per=13.21%, avg=710512.42, stdev=36226.05, samples=19
   iops        : min=152024, max=186650, avg=177628.21, stdev=9056.20, samples=19
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=98.50%
  lat (usec)   : 250=1.19%, 500=0.14%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.12%, 4=0.02%, 10=0.01%
  cpu          : usr=35.58%, sys=60.65%, ctx=1085, majf=0, minf=74
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2088960,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=5254MiB/s (5510MB/s), 5254MiB/s-5254MiB/s (5510MB/s-5510MB/s), io=8160MiB (8556MB), run=1553-1553msec

Disk stats (read/write):
    md0: ios=83/1984981, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
btrfs on ramdisk file

Code:
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=./tmp -size=500m --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
test: Laying out IO file (1 file / 500MiB)
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=54.2MiB/s][r=0,w=13.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=4675: Fri Apr  5 11:12:57 2019
  write: IOPS=37.9k, BW=148MiB/s (155MB/s)(1482MiB/10020msec)
    slat (usec): min=8, max=223277, avg=207.46, stdev=1660.49
    clat (usec): min=3, max=226527, avg=3169.97, stdev=6470.60
     lat (usec): min=194, max=227086, avg=3377.84, stdev=6681.59
    clat percentiles (usec):
     |  1.00th=[   285],  5.00th=[   791], 10.00th=[  1205], 20.00th=[  1582],
     | 30.00th=[  1844], 40.00th=[  2089], 50.00th=[  2343], 60.00th=[  2638],
     | 70.00th=[  3032], 80.00th=[  3556], 90.00th=[  4555], 95.00th=[  6521],
     | 99.00th=[ 15401], 99.50th=[ 33424], 99.90th=[113771], 99.95th=[120062],
     | 99.99th=[223347]
   bw (  KiB/s): min= 3840, max=30824, per=12.52%, avg=18961.75, stdev=6346.25, samples=160
   iops        : min=  960, max= 7706, avg=4740.44, stdev=1586.56, samples=160
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 250=0.13%, 500=3.14%
  lat (usec)   : 750=1.40%, 1000=2.40%
  lat (msec)   : 2=29.50%, 4=49.08%, 10=11.92%, 20=1.81%, 50=0.27%
  lat (msec)   : 100=0.24%, 250=0.10%
  cpu          : usr=1.80%, sys=19.06%, ctx=928477, majf=0, minf=89
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,379363,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=1482MiB (1554MB), run=10020-10020msec
ZFS IOPS=7,374
BTRFS filewrite IOPS=37,900
MDADM IOPS=1,345,000
DIRECT IOPS=1,524,000

Well, googling around tells me there has always been this poor performance on zfs random write. Guess no work was ever done to improve this behaviour or even find out what is causing this huge bottleneck. Have to switch to something else for databases.

Another weired thing is, I create the vol by following:
sudo zfs create -o primarycache=none -o secondarycache=none -o compression=off -s -b ${BLOCKSIZE} -V ${SIZE} -s ${P}/${VOL}

then I run fio and watch perf top

The process eating the cpu the most is lz4 compress and decompress, but compression is off?
 
Last edited:

lumdol

Member
Jun 9, 2016
30
2
8
37
He said defaults... defaults would be 128k recordsize.

Regardless, you are going to take a performance hit using ZFS any way you shake it but it’s not extreme like this unless you don’t understand how to tune it to your workload.

Since that’s a pretty wild performance hit it likely has to do with something in your tuning rather than a problem inherent in ZFS.

While he’s provided some info he’s also ran so many different configurations it’s hard to get a handle on it and sort wheat from chaff.

Attacking it with performance scripts without making efforts of understanding what’s going on more deeply is a flawed approach.

Reading the performance tuning docs for database workloads would be a good place to start: Performance tuning - OpenZFS

What’s the configuration / sector size of the ssd? Are you using vms for testing? Are you writing straight to the pool?

Try differently configured zvols and see where that gets you. Tweaking parameters will likely change your results for the better, pointing to user error rather than filesystem limitations.

Saying ZFS has poor random write performance as a blanket statement is missing the boat / the point entirely.

Take a look here at darkbasic’s response on different configurations: https://www.reddit.com/r/zfs/comments/8m1pxj/which_ones_are_good_sector_and_record_sizes_for/

I wish I could take a closer look myself however I’m in the middle of rehousing my zfs array in a new diy rack so I’m unable to provide any comparison data.

If it is some kind of bug it would be incredibly useful to share this with those who can best evaluate.

With the proliferation of NVMe ssds this is of huge relevance for present and future zfs builds.

Don’t waste your time “googling around”- go straight to the source. I’d advise joining the openzfslinux channel on irc on freenode and posting your information where the maintainers can see what’s going on.

Mailing lists don’t have the benefit of real-time troubleshooting.

They can likely get to the bottom of this in a fraction of the time it will take from your own research and STH feedback.

Then please post your findings back here so it can be of use to others.

I’d put good money on it not being a limitation of ZFS.
 
Last edited:

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
But default settings sticks out... if you mean 128k records...if you write 4k random sync writes over 128k records in a pool, you'll incur something like 64:1 IO amplification when RMW happens
Does this write amplification happen to the same extent when the vdevs are composed of mirrored pairs?

I had been using a 32Kbyte blocksize because I had not been able to measure a significant difference in IOPS for smaller blocksizes, *and*, I want to use a relatively large L2ARC. The larger recordsize benefits from less RAM consumed by pointers to L2ARC. (eg. L2ARC size * 70 / recordsize = RAM consumed).

So I did a simple benchmark comparing 8K and 32K zvol blocksizes in my setup.
zfs create -o volmode=dev -V 100G -b 8K a/a
zfs create -o volmode=dev -V 100G -b 32K a/b

sequential 1MB 100% read and separately 100% write

8K: read: 4302MiB/s write: 2633MiB/s
32K: read: 9290MiB/s write: 5281MiB/s

random 4K 100% read and separately 100% write
8K: read: IOPS=185k lat 344.65 usec write: IOPS=59.4k lat 1073.97 usec
32K: read: IOPS=166k lat 384.39 usec write: IOPS=37.0k lat 1726.10 usec


This is primairly measuring the ARC, but that's my primary use case. In other discussions here we've been looking at the limitations of ZFS itself by benchmarking it on a ramdisk, and that has been helpful so far.

The sequential read and write rate double when blocksize increases from 8K to 32K.
Also ram consumed by L2ARC is reduced to 1/4th when blocksize increases from 8K to 32K.

I would call the read IOPS fairly close because there is significant variance in results from run to run with ZFS, even when rebuilding the dataset before each run.

The write IOPS increase 60% when blocksize is reduced from 32k to 8k.
The write latency is reduced ~40% when blocksize is reduced from 32k to 8k.

I imagine that the zvol default blocksize is 8K for some good reasons, though it's good to be aware of the tradeoffs. This suggests that sequential read/write bandwidth and L2ARC size issues are some of the tradeoffs.

sudo zfs create -o volmode=dev -V 100G -b 8K a/a
dofio /dev/zvol/a/a
________________________________ 1Mbyte sequential ________________
1m.read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=2858MiB/s][r=2857 IOPS][eta 00m:00s]
1m.read: (groupid=0, jobs=4): err= 0: pid=14011: Fri Apr 5 23:36:07 2019
read: IOPS=4301, BW=4302MiB/s (4511MB/s)(16.9GiB/4019msec)
slat (usec): min=37, max=7335, avg=100.71, stdev=281.25
clat (usec): min=2039, max=44517, avg=14743.68, stdev=5961.19
lat (usec): min=2600, max=44567, avg=14844.95, stdev=5946.86
clat percentiles (usec):
| 1.00th=[ 5735], 5.00th=[ 7308], 10.00th=[ 8291], 20.00th=[ 9765],
| 30.00th=[10814], 40.00th=[11994], 50.00th=[13173], 60.00th=[14746],
| 70.00th=[16909], 80.00th=[20055], 90.00th=[23725], 95.00th=[26346],
| 99.00th=[30802], 99.50th=[32637], 99.90th=[36963], 99.95th=[38536],
| 99.99th=[41681]
bw ( MiB/s): min= 2682, max= 5810, per=100.00%, avg=4304.86, stdev=304.90, samples=32
iops : min= 2682, max= 5810, avg=4304.25, stdev=304.87, samples=32
lat (msec) : 4=0.10%, 10=22.19%, 20=57.59%, 50=20.12%
cpu : usr=0.55%, sys=7.40%, ctx=18917, majf=0, minf=16517
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.7%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=17289,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=4302MiB/s (4511MB/s), 4302MiB/s-4302MiB/s (4511MB/s-4511MB/s), io=16.9GiB (18.1GB), run=4019-4019msec
1m.write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 4 (f=4): [W(4)][100.0%][w=2845MiB/s][w=2844 IOPS][eta 00m:00s]
1m.write: (groupid=0, jobs=4): err= 0: pid=14319: Fri Apr 5 23:36:11 2019
write: IOPS=2633, BW=2633MiB/s (2761MB/s)(10.3GiB/4003msec)
slat (usec): min=30, max=257658, avg=897.37, stdev=5344.39
clat (usec): min=4, max=329963, avg=23028.94, stdev=39215.30
lat (usec): min=334, max=373260, avg=23927.49, stdev=40486.02
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 7], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11],
| 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 81], 95.00th=[ 111],
| 99.00th=[ 201], 99.50th=[ 224], 99.90th=[ 326], 99.95th=[ 330],
| 99.99th=[ 330]
bw ( MiB/s): min= 476, max= 5198, per=100.00%, avg=2749.75, stdev=379.37, samples=29
iops : min= 476, max= 5198, avg=2749.75, stdev=379.37, samples=29
lat (usec) : 10=0.02%, 250=0.01%, 750=0.02%, 1000=0.02%
lat (msec) : 2=0.05%, 4=0.22%, 10=50.21%, 20=35.48%, 50=2.03%
lat (msec) : 100=4.98%, 250=6.63%, 500=0.33%
cpu : usr=17.12%, sys=4.17%, ctx=11879, majf=0, minf=124
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=99.4%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10541,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=2633MiB/s (2761MB/s), 2633MiB/s-2633MiB/s (2761MB/s-2761MB/s), io=10.3GiB (11.1GB), run=4003-4003msec
________________________________ 4k ramdom ________________________
4k.read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=766MiB/s][r=196k IOPS][eta 00m:00s]
4k.read: (groupid=0, jobs=4): err= 0: pid=18096: Fri Apr 5 23:36:17 2019
read: IOPS=185k, BW=722MiB/s (757MB/s)(2889MiB/4001msec)
slat (usec): min=2, max=16288, avg=13.16, stdev=78.85
clat (usec): min=3, max=42416, avg=331.19, stdev=896.48
lat (usec): min=27, max=42421, avg=344.65, stdev=903.65
clat percentiles (usec):
| 1.00th=[ 77], 5.00th=[ 103], 10.00th=[ 120], 20.00th=[ 143],
| 30.00th=[ 161], 40.00th=[ 178], 50.00th=[ 196], 60.00th=[ 219],
| 70.00th=[ 243], 80.00th=[ 297], 90.00th=[ 494], 95.00th=[ 742],
| 99.00th=[ 2606], 99.50th=[ 5932], 99.90th=[14484], 99.95th=[17171],
| 99.99th=[23462]
bw ( KiB/s): min=609656, max=851608, per=99.63%, avg=736620.86, stdev=20860.12, samples=28
iops : min=152416, max=212902, avg=184155.43, stdev=5214.83, samples=28
lat (usec) : 4=0.01%, 50=0.10%, 100=4.31%, 250=67.74%, 500=17.97%
lat (usec) : 750=4.94%, 1000=2.06%
lat (msec) : 2=1.64%, 4=0.50%, 10=0.47%, 20=0.22%, 50=0.02%
cpu : usr=9.55%, sys=41.26%, ctx=186161, majf=0, minf=201
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=739553,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=722MiB/s (757MB/s), 722MiB/s-722MiB/s (757MB/s-757MB/s), io=2889MiB (3029MB), run=4001-4001msec
4k.write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 1 (f=1): [_(2),f(1),_(1)][100.0%][w=202MiB/s][w=51.6k IOPS][eta 00m:00s]
4k.write: (groupid=0, jobs=4): err= 0: pid=18349: Fri Apr 5 23:36:27 2019
write: IOPS=59.4k, BW=232MiB/s (243MB/s)(1857MiB/8001msec)
slat (usec): min=2, max=4781, avg=15.87, stdev=49.03
clat (usec): min=14, max=41808, avg=1057.87, stdev=2704.93
lat (usec): min=34, max=41814, avg=1073.97, stdev=2704.12
clat percentiles (usec):
| 1.00th=[ 106], 5.00th=[ 147], 10.00th=[ 182], 20.00th=[ 233],
| 30.00th=[ 273], 40.00th=[ 310], 50.00th=[ 367], 60.00th=[ 424],
| 70.00th=[ 510], 80.00th=[ 685], 90.00th=[ 1450], 95.00th=[ 4752],
| 99.00th=[15664], 99.50th=[18744], 99.90th=[24511], 99.95th=[27132],
| 99.99th=[34341]
bw ( KiB/s): min=18144, max=510464, per=92.52%, avg=219915.87, stdev=45467.56, samples=60
iops : min= 4536, max=127616, avg=54978.73, stdev=11366.85, samples=60
lat (usec) : 20=0.01%, 50=0.01%, 100=0.73%, 250=23.42%, 500=44.93%
lat (usec) : 750=13.04%, 1000=4.47%
lat (msec) : 2=5.54%, 4=2.43%, 10=2.78%, 20=2.32%, 50=0.34%
cpu : usr=5.23%, sys=11.57%, ctx=198943, majf=0, minf=121
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,475444,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s (243MB/s-243MB/s), io=1857MiB (1947MB), run=8001-8001msec
________________________________ 4k ramdom mixed %70 read %30 write
4k.70%read: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 1 (f=1): [_(2),f(1),_(1)][100.0%][r=160MiB/s,w=68.4MiB/s][r=41.0k,w=17.5k IOPS][eta 00m:00s]
4k.70%read: (groupid=0, jobs=4): err= 0: pid=22845: Fri Apr 5 23:36:41 2019
read: IOPS=64.8k, BW=253MiB/s (265MB/s)(3037MiB/12001msec)
slat (usec): min=2, max=13564, avg=15.13, stdev=61.43
clat (usec): min=13, max=42503, avg=563.64, stdev=1124.28
lat (usec): min=22, max=42512, avg=579.00, stdev=1126.50
clat percentiles (usec):
| 1.00th=[ 84], 5.00th=[ 117], 10.00th=[ 139], 20.00th=[ 169],
| 30.00th=[ 198], 40.00th=[ 229], 50.00th=[ 260], 60.00th=[ 293],
| 70.00th=[ 334], 80.00th=[ 498], 90.00th=[ 1287], 95.00th=[ 1876],
| 99.00th=[ 5997], 99.50th=[ 7963], 99.90th=[12256], 99.95th=[14615],
| 99.99th=[19530]
bw ( KiB/s): min=41408, max=545320, per=98.67%, avg=255686.13, stdev=30975.45, samples=92
iops : min=10352, max=136330, avg=63921.52, stdev=7743.90, samples=92
write: IOPS=27.7k, BW=108MiB/s (114MB/s)(1300MiB/12001msec)
slat (usec): min=2, max=8910, avg=15.01, stdev=56.57
clat (usec): min=19, max=32948, avg=933.75, stdev=2143.72
lat (usec): min=38, max=32953, avg=948.98, stdev=2143.43
clat percentiles (usec):
| 1.00th=[ 98], 5.00th=[ 131], 10.00th=[ 153], 20.00th=[ 184],
| 30.00th=[ 212], 40.00th=[ 243], 50.00th=[ 273], 60.00th=[ 306],
| 70.00th=[ 351], 80.00th=[ 529], 90.00th=[ 2966], 95.00th=[ 5080],
| 99.00th=[11207], 99.50th=[13566], 99.90th=[19006], 99.95th=[21365],
| 99.99th=[23987]
bw ( KiB/s): min=18736, max=234696, per=98.69%, avg=109440.61, stdev=13205.45, samples=92
iops : min= 4684, max=58674, avg=27360.04, stdev=3301.37, samples=92
lat (usec) : 20=0.01%, 50=0.06%, 100=2.07%, 250=43.77%, 500=33.79%
lat (usec) : 750=5.69%, 1000=2.21%
lat (msec) : 2=5.94%, 4=2.94%, 10=2.95%, 20=0.55%, 50=0.03%
cpu : usr=5.95%, sys=25.68%, ctx=277519, majf=0, minf=133
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=777468,332711,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=
Run status group 0 (all jobs):
READ: bw=253MiB/s (265MB/s), 253MiB/s-253MiB/s (265MB/s-265MB/s), io=3037MiB (3185MB), run=12001-12001msec
WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=1300MiB (1363MB), run=12001-12001msec

================================================================
sudo zfs create -o volmode=dev -V 100G -b 32K a/a
dofio /dev/zvol/a/b
________________________________ 1Mbyte sequential ________________
1m.read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 4 (f=4): [R(4)][100.0%][r=9213MiB/s][r=9213 IOPS][eta 00m:00s]
1m.read: (groupid=0, jobs=4): err= 0: pid=56186: Fri Apr 5 23:36:55 2019
read: IOPS=9290, BW=9290MiB/s (9742MB/s)(36.3GiB/4001msec)
slat (usec): min=38, max=10755, avg=345.08, stdev=519.87
clat (usec): min=4, max=43494, avg=6527.21, stdev=2975.18
lat (usec): min=793, max=43622, avg=6873.40, stdev=3007.41
clat percentiles (usec):
| 1.00th=[ 1598], 5.00th=[ 2868], 10.00th=[ 3589], 20.00th=[ 4424],
| 30.00th=[ 5014], 40.00th=[ 5538], 50.00th=[ 6063], 60.00th=[ 6652],
| 70.00th=[ 7308], 80.00th=[ 8160], 90.00th=[ 9634], 95.00th=[11469],
| 99.00th=[17957], 99.50th=[20841], 99.90th=[26608], 99.95th=[28443],
| 99.99th=[32637]
bw ( MiB/s): min= 8070, max=11226, per=99.89%, avg=9279.79, stdev=267.18, samples=29
iops : min= 8070, max=11226, avg=9279.79, stdev=267.18, samples=29
lat (usec) : 10=0.01%, 500=0.01%, 750=0.01%, 1000=0.10%
lat (msec) : 2=1.73%, 4=12.50%, 10=77.12%, 20=7.88%, 50=0.65%
cpu : usr=0.95%, sys=18.89%, ctx=22059, majf=0, minf=16519
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.8%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=37171,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=9290MiB/s (9742MB/s), 9290MiB/s-9290MiB/s (9742MB/s-9742MB/s), io=36.3GiB (38.0GB), run=4001-4001msec
1m.write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 1 (f=1): [_(2),f(1),_(1)][100.0%][eta 00m:00s]
1m.write: (groupid=0, jobs=4): err= 0: pid=56438: Fri Apr 5 23:37:03 2019
write: IOPS=5280, BW=5281MiB/s (5537MB/s)(20.6GiB/4001msec)
slat (usec): min=29, max=101635, avg=210.60, stdev=1231.85
clat (usec): min=6, max=194528, avg=11561.39, stdev=12941.80
lat (usec): min=217, max=194579, avg=11772.70, stdev=13102.20
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 5], 10.00th=[ 6], 20.00th=[ 6],
| 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 8], 60.00th=[ 10],
| 70.00th=[ 12], 80.00th=[ 15], 90.00th=[ 21], 95.00th=[ 27],
| 99.00th=[ 86], 99.50th=[ 97], 99.90th=[ 122], 99.95th=[ 127],
| 99.99th=[ 136]
bw ( MiB/s): min= 2338, max= 8960, per=100.00%, avg=5288.12, stdev=585.37, samples=31
iops : min= 2338, max= 8960, avg=5288.09, stdev=585.37, samples=31
lat (usec) : 10=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.03%, 4=0.12%, 10=63.50%, 20=26.40%, 50=7.56%
lat (msec) : 100=1.98%, 250=0.38%
cpu : usr=41.58%, sys=8.22%, ctx=13122, majf=0, minf=127
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.7%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,21129,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=5281MiB/s (5537MB/s), 5281MiB/s-5281MiB/s (5537MB/s-5537MB/s), io=20.6GiB (22.2GB), run=4001-4001msec
________________________________ 4k ramdom ________________________
4k.read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=672MiB/s][r=172k IOPS][eta 00m:00s]
4k.read: (groupid=0, jobs=4): err= 0: pid=3698: Fri Apr 5 23:37:07 2019
read: IOPS=166k, BW=648MiB/s (679MB/s)(2592MiB/4001msec)
slat (usec): min=2, max=12791, avg=18.16, stdev=64.10
clat (usec): min=4, max=69533, avg=366.00, stdev=515.18
lat (usec): min=29, max=69546, avg=384.39, stdev=525.86
clat percentiles (usec):
| 1.00th=[ 73], 5.00th=[ 99], 10.00th=[ 118], 20.00th=[ 143],
| 30.00th=[ 163], 40.00th=[ 182], 50.00th=[ 202], 60.00th=[ 231],
| 70.00th=[ 302], 80.00th=[ 490], 90.00th=[ 807], 95.00th=[ 1106],
| 99.00th=[ 2180], 99.50th=[ 3556], 99.90th=[ 5800], 99.95th=[ 6652],
| 99.99th=[10683]
bw ( KiB/s): min=458444, max=854128, per=100.00%, avg=670085.57, stdev=32693.03, samples=28
iops : min=114613, max=213532, avg=167521.43, stdev=8173.18, samples=28
lat (usec) : 10=0.01%, 50=0.07%, 100=5.12%, 250=58.94%, 500=16.37%
lat (usec) : 750=8.10%, 1000=5.06%
lat (msec) : 2=5.17%, 4=0.78%, 10=0.38%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=8.75%, sys=40.67%, ctx=136195, majf=0, minf=201
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=663663,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=648MiB/s (679MB/s), 648MiB/s-648MiB/s (679MB/s-679MB/s), io=2592MiB (2718MB), run=4001-4001msec
4k.write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 1 (f=1): [f(1),_(3)][100.0%][eta 00m:00s]
4k.write: (groupid=0, jobs=4): err= 0: pid=4114: Fri Apr 5 23:37:19 2019
write: IOPS=37.0k, BW=145MiB/s (152MB/s)(1157MiB/8003msec)
slat (usec): min=2, max=4109, avg=11.47, stdev=31.53
clat (usec): min=34, max=24698, avg=1714.37, stdev=1922.23
lat (usec): min=40, max=24705, avg=1726.10, stdev=1920.39
clat percentiles (usec):
| 1.00th=[ 115], 5.00th=[ 161], 10.00th=[ 225], 20.00th=[ 363],
| 30.00th=[ 570], 40.00th=[ 840], 50.00th=[ 1172], 60.00th=[ 1565],
| 70.00th=[ 2278], 80.00th=[ 2769], 90.00th=[ 3294], 95.00th=[ 4621],
| 99.00th=[ 8717], 99.50th=[13435], 99.90th=[19530], 99.95th=[21103],
| 99.99th=[23200]
bw ( KiB/s): min=32136, max=510336, per=100.00%, avg=148020.12, stdev=27407.49, samples=64
iops : min= 8034, max=127586, avg=37005.00, stdev=6852.07, samples=64
lat (usec) : 50=0.01%, 100=0.49%, 250=11.60%, 500=14.89%, 750=9.96%
lat (usec) : 1000=8.28%
lat (msec) : 2=21.74%, 4=26.07%, 10=6.18%, 20=0.70%, 50=0.09%
cpu : usr=3.93%, sys=8.35%, ctx=161763, majf=0, minf=121
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,296118,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=1157MiB (1213MB), run=8003-8003msec
________________________________ 4k ramdom mixed %70 read %30 write
4k.70%read: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.13-27-gef32d
Starting 4 processes
Jobs: 1 (f=1): [_(1),f(1),_(2)][100.0%][eta 00m:00s]
4k.70%read: (groupid=0, jobs=4): err= 0: pid=2044: Fri Apr 5 23:37:34 2019
read: IOPS=74.1k, BW=290MiB/s (304MB/s)(3477MiB/12010msec)
slat (usec): min=2, max=13029, avg=15.18, stdev=37.99
clat (usec): min=20, max=100428, avg=492.96, stdev=1060.50
lat (usec): min=27, max=100431, avg=508.34, stdev=1060.10
clat percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 180], 10.00th=[ 212], 20.00th=[ 255],
| 30.00th=[ 285], 40.00th=[ 302], 50.00th=[ 314], 60.00th=[ 330],
| 70.00th=[ 367], 80.00th=[ 545], 90.00th=[ 889], 95.00th=[ 1205],
| 99.00th=[ 3195], 99.50th=[ 5211], 99.90th=[10814], 99.95th=[13304],
| 99.99th=[22938]
bw ( KiB/s): min=102840, max=520282, per=100.00%, avg=296655.00, stdev=35753.32, samples=96
iops : min=25710, max=130069, avg=74163.67, stdev=8938.30, samples=96
write: IOPS=31.7k, BW=124MiB/s (130MB/s)(1489MiB/12010msec)
slat (usec): min=2, max=3165, avg=14.85, stdev=15.94
clat (usec): min=38, max=100971, avg=808.63, stdev=1648.50
lat (usec): min=43, max=100977, avg=823.68, stdev=1646.81
clat percentiles (usec):
| 1.00th=[ 131], 5.00th=[ 196], 10.00th=[ 227], 20.00th=[ 273],
| 30.00th=[ 297], 40.00th=[ 310], 50.00th=[ 322], 60.00th=[ 343],
| 70.00th=[ 433], 80.00th=[ 1045], 90.00th=[ 1860], 95.00th=[ 2606],
| 99.00th=[ 7963], 99.50th=[11207], 99.90th=[18220], 99.95th=[21627],
| 99.99th=[31327]
bw ( KiB/s): min=43720, max=223567, per=100.00%, avg=127045.42, stdev=15368.37, samples=96
iops : min=10930, max=55891, avg=31761.25, stdev=3842.05, samples=96
lat (usec) : 50=0.01%, 100=0.45%, 250=16.97%, 500=59.04%, 750=7.12%
lat (usec) : 1000=4.71%
lat (msec) : 2=7.92%, 4=2.56%, 10=0.94%, 20=0.26%, 50=0.02%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=6.90%, sys=38.54%, ctx=230502, majf=0, minf=133
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=890123,381228,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=3477MiB (3646MB), run=12010-12010msec
WRITE: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=1489MiB (1562MB), run=12010-12010msec
 

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
@efschu2 and I have reported results here for ZFS on a ramdisk that show a large disparity between ZFS and other filesystems for this *certain* scenario. We're trying to understand and discuss results, methods of measuring them and methods of tuning the systems. And we are sharing what we have as we learn.

I think some of the misunderstanding about @efschu2's results could be cleared up by looking at answers to your questions in his previous posts. Some of mine are closely related. And I understand that takes time, so no big deal!

We are not trying to be critical of ZFS. What we do want to do, is understand why "ZFS has poor random write performance" relative to other filesystems for the specific senario, and we are sharing what we have, how we generate it, to help each other.

Those are great links, had have been helpful, but after going over them many times, well, here we are still trying to adapt benchmarks and methods to understand the trade-offs that are discussed in them.

We have indeed been talking about taking parts of the discussion to IRC, but honestly, this forum has been far easier to use in terms of bookmarking threads or specific posts that attempt to summarize or script specific scenarios. There's nothing wrong with having a long write-up here, and taking a more interactive discussion to IRC.
 

lumdol

Member
Jun 9, 2016
30
2
8
37
Sorry I’ve been short on time lately and am unable to review in detail.

I realized too late the OP has disappeared and now you both have taken up the original convo. And then it went from the octane pcie to ramdisk and so forth.

What I’m saying is the vastly stark contrast in performance does not seem validated by the inherent differences in the filesystems, from my understanding.

And it seems notable enough to bring to the maintainers.

Perhaps you may be missing something they can bring to light. Or vice versa.

Maybe something like this: ZFS and VSP | Hitachi Vantara Community

Oracles dtrace tutorial / examples will likely provide some insight: Tutorial: DTrace by Example

As I said I’ve been unable to form a complete picture or attempt to reproduce myself.
 
Last edited:

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
Hey, no worries. I just reactivated my irccloud account and got on #openzfs, so hopefully we'll be able to discuss there too.

Zfs does some wicked smart stuff under the hood, and given the resiliency, I personally find it easy to accept that there's some performance cost for the level of durability and resistance to silent data loss.

Like @efschu2, one of the things I'm also looking at is mdadm vs zvols for block storage on flash arrays, and even if there are performance differences, there are so many advantages to ZFS, that it's worth the effort to learn more about tuning it.
 

efschu2

Member
Feb 14, 2019
68
12
8
Two more things, first, I've retested with disabled sync, and it does not increase random iops on ramdisk?, second, thinking about these tests, we are using a RAM (random access memory) device, so it should not make any difference if we are doing seq or rand. I'm out of ideas and gone for vacation for now. But will keep reading but have no machine to validate til next week. Good luck.
 

zxv

The more I C, the less I see.
Sep 10, 2017
155
55
28
Two more things, first, I've retested with disabled sync, and it does not increase random iops on ramdisk?, second, thinking about these tests, we are using a RAM (random access memory) device, so it should not make any difference if we are doing seq or rand. I'm out of ideas and gone for vacation for now. But will keep reading but have no machine to validate til next week. Good luck.
I'm still looking at this, and I agree - if ZFS on ramdisk shows low IOPS, it's counter intuitive. Either way, ramdisk results may be easier to replicate and compare.

I searched again for performance troubleshooting in blogs, etc., and the closest I could find was discussion from the authors of the OpenZFS write throttle. They used dtrace:
Tuning the OpenZFS write throttle | Delphix
Adam Leventhal's blog » The OpenZFS write throttle

Brendan Greg seems to be switching to bpftrace:
Learn eBPF Tracing: Tutorial and Examples

I'm beginning to use bpftrace, in hopes this will allow us to inspect things like the write queues, and effects of write throttling. However; iovisors zfs examples are not generating any output, so it'll take more time to get up to speed on it.

https://raw.githubusercontent.com/iovisor/bcc/master/tools/zfsdist.py
https://raw.githubusercontent.com/iovisor/bcc/master/tools/zfsslower.py

I also took a step back re-tested all my devices (eg. smartctl -t long) and ran fio benchmarks, to ensure all devices are performing consistently.

I'm still working on this. I'm also on #zfsonlinux and #openzfs on freenode if anyone wants to chat.
 
  • Like
Reactions: gigatexal