Nope, but with the help of a (cheap) rdma/infiniband adapter that supports sr-iov + nvmeof you end up with something very close - a part of the 900p as 'native' nvme device in a vm ...
I'm still searching for better iscsi backing store.I took your suggestion and did some research this weekend. Conclusion ZoL has much better performance on CentOS, but is still slow.
For comparison, same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync
Ubuntu 2200 iops
Debian 2000 iops
CentOS 8000 iops
FreeBSD 16000 iops
Ubuntu + XFS 34000 iops
Ubuntu + EXT4 32000 iops
Ubuntu + BcacheFS 14000 iops
centos 7.6 ZFS v0.6.5.11-1 10GB RAM block device
4k.read: IOPS=2705k, BW=10.3GiB/s (11.1GB/s)(160GiB/15504msec)
4k.write: IOPS=2114k, BW=8256MiB/s (8657MB/s)(160GiB/19845msec)
4k.randread: IOPS=2691k, BW=10.3GiB/s (11.0GB/s)(160GiB/15589msec)
4k.randwrite: IOPS=2054k, BW=8024MiB/s (8414MB/s)(157GiB/20002msec)
4k.70%read: IOPS=1545k, BW=6035MiB/s (6328MB/s)(112GiB/19004msec)
4k.30%write: IOPS=662k, BW=2586MiB/s (2712MB/s)(47.0GiB/19004msec)
centos 7.6 ZFS v0.6.5.11-1 zvol on 10GB RAM block device
4k.read: IOPS=289k, BW=1130MiB/s (1185MB/s)(22.1GiB/20002msec)
4k.write: IOPS=145k, BW=565MiB/s (593MB/s)(11.0GiB/20001msec)
4k.randread: IOPS=295k, BW=1153MiB/s (1209MB/s)(22.5GiB/20002msec)
4k.randwrite: IOPS=32.3k, BW=126MiB/s (132MB/s)(2523MiB/20020msec)
4k.70%read: IOPS=52.5k, BW=205MiB/s (215MB/s)(4102MiB/20002msec)
4k.30%write: IOPS=22.5k, BW=87.9MiB/s (92.2MB/s)(1759MiB/20002msec)
Ubuntu 18.04 ZFS v0.7.5-1ubuntu16.4 10GB ram block device
4k.read: IOPS=3653k, BW=13.9GiB/s (14.0GB/s)(160GiB/11482msec)
4k.write: IOPS=2730k, BW=10.4GiB/s (11.2GB/s)(160GiB/15365msec)
4k.randread: IOPS=3335k, BW=12.7GiB/s (13.7GB/s)(160GiB/12578msec)
4k.randwrite: IOPS=2535k, BW=9902MiB/s (10.4GB/s)(160GiB/16546msec)
4k.70%read: IOPS=2105k, BW=8221MiB/s (8620MB/s)(112GiB/13953msec)
4k.30%write: IOPS=902k, BW=3522MiB/s (3693MB/s)(47.0GiB/13953msec)
Ubuntu 18.04 ZFS v0.7.5-1ubuntu16.4 zvol on 10GB ram block device
4k.read: IOPS=213k, BW=833MiB/s (873MB/s)(16.3GiB/20003msec)
4k.write: IOPS=173k, BW=677MiB/s (710MB/s)(13.2GiB/20002msec)
4k.randread: IOPS=149k, BW=580MiB/s (609MB/s)(11.3GiB/20003msec)
4k.randwrite: IOPS=26.2k, BW=102MiB/s (107MB/s)(2051MiB/20069msec)
4k.70%read: IOPS=35.3k, BW=138MiB/s (145MB/s)(2759MiB/20008msec)
4k.30%write: IOPS=15.2k, BW=59.2MiB/s (62.1MB/s)(1185MiB/20008msec)
dofio () {
local FIO="fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=16 --runtime=20 --group_reporting --bs=4k"
local FILE=$1
shift
($FIO $* --filename=${FILE} --rw=read |sed 's/ read:/4k.read: /';
$FIO $* --filename=${FILE} --rw=write |sed 's/write:/4k.write: /';
$FIO $* --filename=${FILE} --rw=randread |sed 's/ read:/4k.randread: /';
$FIO $* --filename=${FILE} --rw=randwrite |sed 's/write:/4k.randwrite:/';
$FIO $* --filename=${FILE} --rw=randrw --rwmixread=70 |sed 's/ read:/4k.70%read: /' |
sed 's/write:/4k.30%write: /' ;
)| grep IOPS=
}
modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) # 10GB RAM block device
dd if=/dev/zero of=/dev/ram0 bs=1M
dofio /dev/ram0
P=b
zpool create ${P} /dev/ram0
VOL=volb
SIZE=$(zfs get -H -p -o value available ${P})
BLOCKSIZE=32768
SIZE=$(( ${SIZE} * 80 / ( 100 * ${BLOCKSIZE} ) * ${BLOCKSIZE} ))
sudo zfs create -s -b ${BLOCKSIZE} -V ${SIZE} -s ${P}/${VOL}
dofio /dev/zvol/${P}/${VOL}
zpool destroy ${P}
rmmod brd
solaris 11.3 8G RAM block device
4k.read: io=752176KB, bw=4055.7MB/s, iops=1038.3K, runt= 20380msec
4k.write: io=469448KB, bw=2240.3MB/s, iops=573495 , runt= 20317msec
4k.randread: io=2855.4MB, bw=3188.4MB/s, iops=816206 , runt= 20166msec
4k.randwrite: io=4051.5MB, bw=1996.8MB/s, iops=511168 , runt= 20491msec
4k.70%read: io=4008.6MB, bw=1831.6MB/s, iops=468863 , runt= 20080msec
4k.30%write: io=3474.4MB, bw=803819KB/s, iops=200954 , runt= 20080msec
solaris 11.3 zvol on 8G RAM block device
4k.read: io=3364.2MB, bw=3848.9MB/s, iops=985301 , runt= 20030msec
4k.write: io=1627.2MB, bw=502717KB/s, iops=125679 , runt= 20001msec
4k.randread: io=3773.4MB, bw=2645.8MB/s, iops=677308 , runt= 20004msec
4k.randwrite: io=2176.6MB, bw=111147KB/s, iops=27786 , runt= 20052msec
4k.70%read: io=3159.5MB, bw=161229KB/s, iops=40307 , runt= 20066msec
4k.30%write: io=1354.9MB, bw=69140KB/s, iops=17284 , runt= 20066msec
10GB ramdisk
4k.read: io=163840MB, bw=15796MB/s, iops=4043.9K, runt= 10372msec
4k.write: io=163840MB, bw=12245MB/s, iops=3134.8K, runt= 13380msec
4k.randread: io=163840MB, bw=14387MB/s, iops=3683.9K, runt= 11388msec
4k.randwrite: io=163840MB, bw=9229.1MB/s, iops=2362.9K, runt= 17751msec
4k.70%read: io=114704MB, bw=8526.1MB/s, iops=2182.9K, runt= 13452msec
4k.30%write: io=49136MB, bw=3652.7MB/s, iops=935084, runt= 13452msec
10GB ramdisk zvol
4k.read: io=29757MB, bw=1487.8MB/s, iops=380854, runt= 20002msec
4k.write: io=16633MB, bw=851565KB/s, iops=212891, runt= 20001msec
4k.randread: io=20155MB, bw=1007.7MB/s, iops=257962, runt= 20002msec
4k.randwrite: io=2770.5MB, bw=141650KB/s, iops=35412, runt= 20028msec
4k.70%read: io=3539.2MB, bw=181123KB/s, iops=45280, runt= 20009msec
4k.30%write: io=1517.8MB, bw=77672KB/s, iops=19417, runt= 20009msec
butZFS on Linux is not that bottlenecked if you use multiple threads. But single threaded random write is REALLY slow.
"fio ... --numjobs=16 ..."
I inquired with the zfs on linux folks and got the answer I was expecting:I took your suggestion and did some research this weekend. Conclusion ZoL has much better performance on CentOS, but is still slow.
For comparison, same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync
Ubuntu 2200 iops
Debian 2000 iops
CentOS 8000 iops
FreeBSD 16000 iops
Ubuntu + XFS 34000 iops
Ubuntu + EXT4 32000 iops
Ubuntu + BcacheFS 14000 iops
We are already working with 32k volblocksize...
But default settings sticks out... if you mean 128k records...if you write 4k random sync writes over 128k records in a pool, you'll incur something like 64:1 IO amplification when RMW happens
Results for 4k 8T16Q randwr on recordize/volblocksize 4kBLOCKSIZE=32768
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/zvol/${P}/${VOL} --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=25.8MiB/s][r=0,w=6600 IOPS][eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=5815: Fri Apr 5 08:25:49 2019
write: IOPS=7374, BW=28.8MiB/s (30.2MB/s)(289MiB/10021msec)
slat (nsec): min=1492, max=283778k, avg=21567.99, stdev=1379626.28
clat (usec): min=34, max=382853, avg=17324.08, stdev=19804.98
lat (usec): min=90, max=382855, avg=17345.85, stdev=19850.90
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 11],
| 30.00th=[ 12], 40.00th=[ 14], 50.00th=[ 15], 60.00th=[ 16],
| 70.00th=[ 17], 80.00th=[ 18], 90.00th=[ 21], 95.00th=[ 29],
| 99.00th=[ 115], 99.50th=[ 155], 99.90th=[ 268], 99.95th=[ 288],
| 99.99th=[ 368]
bw ( KiB/s): min= 2120, max= 5472, per=12.50%, avg=3688.05, stdev=522.15, samples=160
iops : min= 530, max= 1368, avg=921.99, stdev=130.54, samples=160
lat (usec) : 50=0.01%, 100=0.04%, 250=0.20%, 500=0.20%, 750=0.10%
lat (usec) : 1000=0.08%
lat (msec) : 2=0.14%, 4=0.23%, 10=11.65%, 20=76.78%, 50=7.48%
lat (msec) : 100=1.82%, 250=1.12%, 500=0.14%
cpu : usr=0.33%, sys=0.46%, ctx=15195, majf=0, minf=70
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.8%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,73904,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=289MiB (303MB), run=10021-10021msec
zfs get all
NAME PROPERTY VALUE SOURCE
b type filesystem -
b creation Fr Apr 5 8:19 2019 -
b used 535M -
b available 345M -
b referenced 96K -
b compressratio 1.00x -
b mounted yes -
b quota none default
b reservation none default
b recordsize 4K local
b mountpoint /b default
b sharenfs off default
b checksum on default
b compression off default
b atime on default
b devices on default
b exec on default
b setuid on default
b readonly off default
b zoned off default
b snapdir hidden default
b aclinherit restricted default
b createtxg 1 -
b canmount on default
b xattr on default
b copies 1 default
b version 5 -
b utf8only off -
b normalization none -
b casesensitivity sensitive -
b vscan off default
b nbmand off default
b sharesmb off default
b refquota none default
b refreservation none default
b guid 14752277821102290775 -
b primarycache all default
b secondarycache all default
b usedbysnapshots 0B -
b usedbydataset 96K -
b usedbychildren 534M -
b usedbyrefreservation 0B -
b logbias latency default
b dedup off default
b mlslabel none default
b sync standard default
b dnodesize legacy default
b refcompressratio 1.00x -
b written 96K -
b logicalused 525M -
b logicalreferenced 40K -
b volmode default default
b filesystem_limit none default
b snapshot_limit none default
b filesystem_count none default
b snapshot_count none default
b snapdev hidden default
b acltype off default
b context none default
b fscontext none default
b defcontext none default
b rootcontext none default
b relatime off default
b redundant_metadata all default
b overlay off default
b/volb type volume -
b/volb creation Fr Apr 5 8:21 2019 -
b/volb used 531M -
b/volb available 345M -
b/volb referenced 531M -
b/volb compressratio 1.00x -
b/volb reservation none default
b/volb volsize 704M local
b/volb volblocksize 4K -
b/volb checksum on default
b/volb compression off default
b/volb readonly off default
b/volb createtxg 22 -
b/volb copies 1 default
b/volb refreservation none default
b/volb guid 10441742856011729804 -
b/volb primarycache all default
b/volb secondarycache all default
b/volb usedbysnapshots 0B -
b/volb usedbydataset 531M -
b/volb usedbychildren 0B -
b/volb usedbyrefreservation 0B -
b/volb logbias latency default
b/volb dedup off default
b/volb mlslabel none default
b/volb sync standard default
b/volb refcompressratio 1.00x -
b/volb written 531M -
b/volb logicalused 524M -
b/volb logicalreferenced 524M -
b/volb volmode default default
b/volb snapshot_limit none default
b/volb snapshot_count none default
b/volb snapdev hidden default
b/volb context none default
b/volb fscontext none default
b/volb defcontext none default
b/volb rootcontext none default
b/volb redundant_metadata all default
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/ram0 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8)
test: (groupid=0, jobs=8): err= 0: pid=15912: Fri Apr 5 10:22:03 2019
write: IOPS=1524k, BW=5953MiB/s (6243MB/s)(8192MiB/1376msec)
slat (nsec): min=1522, max=7510.0k, avg=2886.92, stdev=11973.56
clat (nsec): min=1043, max=11356k, avg=75762.81, stdev=80754.63
lat (usec): min=2, max=11358, avg=78.75, stdev=81.87
clat percentiles (usec):
| 1.00th=[ 50], 5.00th=[ 72], 10.00th=[ 73], 20.00th=[ 73],
| 30.00th=[ 74], 40.00th=[ 74], 50.00th=[ 74], 60.00th=[ 75],
| 70.00th=[ 75], 80.00th=[ 76], 90.00th=[ 76], 95.00th=[ 77],
| 99.00th=[ 90], 99.50th=[ 145], 99.90th=[ 445], 99.95th=[ 1336],
| 99.99th=[ 2769]
bw ( KiB/s): min=703640, max=821218, per=12.98%, avg=791497.06, stdev=31242.83, samples=16
iops : min=175910, max=205304, avg=197874.19, stdev=7810.66, samples=16
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=1.35%, 100=97.78%
lat (usec) : 250=0.71%, 500=0.05%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.06%, 4=0.01%, 10=0.01%, 20=0.01%
cpu : usr=40.03%, sys=57.11%, ctx=781, majf=0, minf=78
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,2097152,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=5953MiB/s (6243MB/s), 5953MiB/s-5953MiB/s (6243MB/s-6243MB/s), io=8192MiB (8590MB), run=1376-1376msec
Disk stats (read/write):
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=/dev/md0 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8)
test: (groupid=0, jobs=8): err= 0: pid=22816: Fri Apr 5 10:27:30 2019
write: IOPS=1345k, BW=5254MiB/s (5510MB/s)(8160MiB/1553msec)
slat (nsec): min=1974, max=6919.2k, avg=3371.10, stdev=16099.68
clat (nsec): min=1209, max=7531.3k, avg=84376.29, stdev=78147.61
lat (usec): min=4, max=7534, avg=87.85, stdev=79.94
clat percentiles (usec):
| 1.00th=[ 67], 5.00th=[ 79], 10.00th=[ 80], 20.00th=[ 80],
| 30.00th=[ 81], 40.00th=[ 81], 50.00th=[ 81], 60.00th=[ 82],
| 70.00th=[ 82], 80.00th=[ 83], 90.00th=[ 84], 95.00th=[ 88],
| 99.00th=[ 105], 99.50th=[ 178], 99.90th=[ 1188], 99.95th=[ 1369],
| 99.99th=[ 3261]
bw ( KiB/s): min=608088, max=746600, per=13.21%, avg=710512.42, stdev=36226.05, samples=19
iops : min=152024, max=186650, avg=177628.21, stdev=9056.20, samples=19
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=98.50%
lat (usec) : 250=1.19%, 500=0.14%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.12%, 4=0.02%, 10=0.01%
cpu : usr=35.58%, sys=60.65%, ctx=1085, majf=0, minf=74
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,2088960,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=5254MiB/s (5510MB/s), 5254MiB/s-5254MiB/s (5510MB/s-5510MB/s), io=8160MiB (8556MB), run=1553-1553msec
Disk stats (read/write):
md0: ios=83/1984981, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
fio --name=test --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --iodepth=16 --numjobs=8 --runtime=10 --group_reporting --bs=4k --filename=./tmp -size=500m --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 8 processes
test: Laying out IO file (1 file / 500MiB)
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=54.2MiB/s][r=0,w=13.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=4675: Fri Apr 5 11:12:57 2019
write: IOPS=37.9k, BW=148MiB/s (155MB/s)(1482MiB/10020msec)
slat (usec): min=8, max=223277, avg=207.46, stdev=1660.49
clat (usec): min=3, max=226527, avg=3169.97, stdev=6470.60
lat (usec): min=194, max=227086, avg=3377.84, stdev=6681.59
clat percentiles (usec):
| 1.00th=[ 285], 5.00th=[ 791], 10.00th=[ 1205], 20.00th=[ 1582],
| 30.00th=[ 1844], 40.00th=[ 2089], 50.00th=[ 2343], 60.00th=[ 2638],
| 70.00th=[ 3032], 80.00th=[ 3556], 90.00th=[ 4555], 95.00th=[ 6521],
| 99.00th=[ 15401], 99.50th=[ 33424], 99.90th=[113771], 99.95th=[120062],
| 99.99th=[223347]
bw ( KiB/s): min= 3840, max=30824, per=12.52%, avg=18961.75, stdev=6346.25, samples=160
iops : min= 960, max= 7706, avg=4740.44, stdev=1586.56, samples=160
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 250=0.13%, 500=3.14%
lat (usec) : 750=1.40%, 1000=2.40%
lat (msec) : 2=29.50%, 4=49.08%, 10=11.92%, 20=1.81%, 50=0.27%
lat (msec) : 100=0.24%, 250=0.10%
cpu : usr=1.80%, sys=19.06%, ctx=928477, majf=0, minf=89
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,379363,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=1482MiB (1554MB), run=10020-10020msec
Does this write amplification happen to the same extent when the vdevs are composed of mirrored pairs?But default settings sticks out... if you mean 128k records...if you write 4k random sync writes over 128k records in a pool, you'll incur something like 64:1 IO amplification when RMW happens
Thanks very much for these.Maybe something like this: ZFS and VSP | Hitachi Vantara Community
Oracles dtrace tutorial / examples will likely provide some insight: Tutorial: DTrace by Example.
I'm still looking at this, and I agree - if ZFS on ramdisk shows low IOPS, it's counter intuitive. Either way, ramdisk results may be easier to replicate and compare.Two more things, first, I've retested with disabled sync, and it does not increase random iops on ramdisk?, second, thinking about these tests, we are using a RAM (random access memory) device, so it should not make any difference if we are doing seq or rand. I'm out of ideas and gone for vacation for now. But will keep reading but have no machine to validate til next week. Good luck.