ZFS benchmark

Albert Yang · Jul 26, 2019

Hi,
So currently trying to bench mark my ZFS setup before getting a SLOG. Here are my few question is someone could shed some light

1) what is the rule of thumb when creating a slog? depending on the RAM or size of the disks or the zfs pool? If i have 4 disks of 4tb what is the size of the SSD i need to get?
2) would this be sufficient? the 58gigs ssd https://www.amazon.com/Intel-Optane-800P-58GB-XPoint/dp/B078ZJSD6F
3) Currently bench marking with FIO and yes i have pretty bad results but i was reading also depends what type physical disk im using 512 and on the vm storage volblocksize by default is 8k. not sure changing would help?

Code:

cat /sys/block/sda/queue/hw_sector_size
512

4) currently I have arc max to 2gigs which i think might be too low but currently have 32 gigs with 26gigs using for the Vms (prob need to add more ram) but what is the rule of thumb for the ARC max
5) Can setting compression off help?
6) setting the atime value to off would it also help on the writes because the VMs are RAW inside of proxmox

Code:

zfs set atime=off rpool

Thank you

Well these are the stats (pretty bad i know)
Proxmox Host

Code:

command: fio --filename=test --sync=1 --rw=randwrite --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random write

test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [f(1)][100.0%][w=475MiB/s][w=475 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=18230: Fri Jul 26 16:36:21 2019
  write: IOPS=167, BW=168MiB/s (176MB/s)(10.0GiB/61028msec)
    clat (usec): min=99, max=7805.3k, avg=5948.35, stdev=158625.94
     lat (usec): min=104, max=7805.3k, avg=5958.56, stdev=158625.92
    clat percentiles (usec):
     |  1.00th=[    108],  5.00th=[    113], 10.00th=[    120],
     | 20.00th=[    141], 30.00th=[    151], 40.00th=[    157],
     | 50.00th=[    165], 60.00th=[    178], 70.00th=[    198],
     | 80.00th=[    219], 90.00th=[    277], 95.00th=[    347],
     | 99.00th=[    652], 99.50th=[   1500], 99.90th=[2533360],
     | 99.95th=[3137340], 99.99th=[6945768]
   bw (  KiB/s): min=147456, max=2461696, per=100.00%, avg=1037687.90, stdev=663809.16, samples=20
   iops        : min=  144, max= 2404, avg=1013.30, stdev=648.27, samples=20
  lat (usec)   : 100=0.04%, 250=86.66%, 500=11.78%, 750=0.66%, 1000=0.17%
  lat (msec)   : 2=0.26%, 4=0.07%, 10=0.03%, 20=0.05%, 50=0.04%
  lat (msec)   : 250=0.07%, 500=0.01%, 2000=0.04%, >=2000=0.13%
  cpu          : usr=0.14%, sys=2.64%, ctx=7440, majf=8, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=10.0GiB (10.7GB), run=61028-61028msec



command: fio --filename=test --sync=1 --rw=randread --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random read

test: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=76.0MiB/s][r=76 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=337: Fri Jul 26 16:48:21 2019
  read: IOPS=73, BW=73.4MiB/s (76.0MB/s)(10.0GiB/139469msec)
    clat (usec): min=168, max=326933, avg=13616.37, stdev=8044.79
     lat (usec): min=168, max=326934, avg=13616.73, stdev=8044.80
    clat percentiles (usec):
     |  1.00th=[   260],  5.00th=[  7439], 10.00th=[  8717], 20.00th=[ 10159],
     | 30.00th=[ 11076], 40.00th=[ 11994], 50.00th=[ 12649], 60.00th=[ 13435],
     | 70.00th=[ 14222], 80.00th=[ 15270], 90.00th=[ 17695], 95.00th=[ 23462],
     | 99.00th=[ 42730], 99.50th=[ 59507], 99.90th=[ 99091], 99.95th=[111674],
     | 99.99th=[135267]
   bw (  KiB/s): min=16384, max=102400, per=99.88%, avg=75091.18, stdev=12184.43, samples=278
   iops        : min=   16, max=  100, avg=73.25, stdev=11.91, samples=278
  lat (usec)   : 250=0.82%, 500=0.85%
  lat (msec)   : 2=0.90%, 4=0.19%, 10=16.28%, 20=73.71%, 50=6.58%
  lat (msec)   : 100=0.58%, 250=0.09%, 500=0.01%
  cpu          : usr=0.03%, sys=1.70%, ctx=10205, majf=0, minf=268
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=73.4MiB/s (76.0MB/s), 73.4MiB/s-73.4MiB/s (76.0MB/s-76.0MB/s), io=10.0GiB (10.7GB), run=139469-139469msec


command: fio --filename=test --sync=1 --rw=read --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Sequential read   

test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [R(1)][97.6%][r=335MiB/s][r=335 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=16042: Fri Jul 26 16:50:29 2019
  read: IOPS=256, BW=257MiB/s (269MB/s)(10.0GiB/39919msec)
    clat (usec): min=173, max=1988.9k, avg=3896.22, stdev=27967.08
     lat (usec): min=174, max=1988.9k, avg=3896.47, stdev=27967.15
    clat percentiles (usec):
     |  1.00th=[    204],  5.00th=[    217], 10.00th=[    229],
     | 20.00th=[    260], 30.00th=[    318], 40.00th=[   1156],
     | 50.00th=[   2147], 60.00th=[   2933], 70.00th=[   3720],
     | 80.00th=[   4817], 90.00th=[   7308], 95.00th=[  10159],
     | 99.00th=[  26870], 99.50th=[  36439], 99.90th=[  74974],
     | 99.95th=[ 274727], 99.99th=[1384121]
   bw (  KiB/s): min= 2048, max=401408, per=100.00%, avg=283533.05, stdev=99968.47, samples=73
   iops        : min=    2, max=  392, avg=276.88, stdev=97.62, samples=73
  lat (usec)   : 250=17.56%, 500=16.98%, 750=2.29%, 1000=1.92%
  lat (msec)   : 2=9.26%, 4=25.09%, 10=21.77%, 20=3.32%, 50=1.53%
  lat (msec)   : 100=0.21%, 250=0.01%, 500=0.02%, 750=0.01%, 2000=0.03%
  cpu          : usr=0.07%, sys=6.32%, ctx=7006, majf=0, minf=266
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=257MiB/s (269MB/s), 257MiB/s-257MiB/s (269MB/s-269MB/s), io=10.0GiB (10.7GB), run=39919-39919msec
  
  
 
 Command: fio --filename=test --sync=1 --rw=write --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
 
 Sequential write
 
 test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][90.2%][eta 00m:05s]                         
test: (groupid=0, jobs=1): err= 0: pid=23005: Fri Jul 26 16:52:19 2019
  write: IOPS=219, BW=219MiB/s (230MB/s)(10.0GiB/46746msec)
    clat (usec): min=91, max=9317.8k, avg=4554.20, stdev=160615.65
     lat (usec): min=97, max=9317.8k, avg=4564.12, stdev=160615.78
    clat percentiles (usec):
     |  1.00th=[     97],  5.00th=[    108], 10.00th=[    122],
     | 20.00th=[    130], 30.00th=[    137], 40.00th=[    143],
     | 50.00th=[    151], 60.00th=[    161], 70.00th=[    176],
     | 80.00th=[    188], 90.00th=[    212], 95.00th=[    330],
     | 99.00th=[   1532], 99.50th=[   1762], 99.90th=[ 105382],
     | 99.95th=[4462740], 99.99th=[6207570]
   bw (  MiB/s): min=    2, max= 2500, per=100.00%, avg=1418.39, stdev=863.69, samples=13
   iops        : min=    2, max= 2500, avg=1418.31, stdev=863.68, samples=13
  lat (usec)   : 100=2.60%, 250=90.42%, 500=3.31%, 750=0.24%, 1000=0.35%
  lat (msec)   : 2=2.62%, 4=0.12%, 10=0.14%, 20=0.02%, 100=0.01%
  lat (msec)   : 250=0.09%, 500=0.01%, 2000=0.01%, >=2000=0.07%
  cpu          : usr=0.20%, sys=2.89%, ctx=11012, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=10.0GiB (10.7GB), run=46746-46746msec

and the VM

Code:

command: fio --filename=test --sync=1 --rw=randwrite --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
    
    randwrite
    
    fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=117MiB/s][w=117 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30330: Fri Jul 26 16:56:09 2019
  write: IOPS=89, BW=89.5MiB/s (93.9MB/s)(10.0GiB/114364msec)
    clat (usec): min=918, max=595797, avg=11132.76, stdev=26966.94
     lat (usec): min=932, max=595843, avg=11161.68, stdev=26969.05
    clat percentiles (usec):
     |  1.00th=[  1074],  5.00th=[  1156], 10.00th=[  1221], 20.00th=[  1336],
     | 30.00th=[  1500], 40.00th=[  1713], 50.00th=[  1942], 60.00th=[  2343],
     | 70.00th=[  4883], 80.00th=[ 14222], 90.00th=[ 28181], 95.00th=[ 49021],
     | 99.00th=[131597], 99.50th=[170918], 99.90th=[295699], 99.95th=[375391],
     | 99.99th=[492831]
   bw (  KiB/s): min= 2048, max=286720, per=99.79%, avg=91493.11, stdev=49831.22, samples=228
   iops        : min=    2, max=  280, avg=89.28, stdev=48.66, samples=228
  lat (usec)   : 1000=0.07%
  lat (msec)   : 2=51.91%, 4=17.09%, 10=6.06%, 20=9.51%, 50=10.50%
  lat (msec)   : 100=3.17%, 250=1.49%, 500=0.18%, 750=0.01%
  cpu          : usr=0.32%, sys=5.31%, ctx=36214, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=89.5MiB/s (93.9MB/s), 89.5MiB/s-89.5MiB/s (93.9MB/s-93.9MB/s), io=10.0GiB (10.7GB), run=114364-114364msec

Disk stats (read/write):
  vda: ios=359/36257, merge=110/26936, ticks=260/128020, in_queue=128308, util=92.90%
 

command: fio --filename=test --sync=1 --rw=randread --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random read

test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=49.0MiB/s][r=49 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30377: Fri Jul 26 17:03:11 2019
  read: IOPS=27, BW=27.7MiB/s (29.0MB/s)(8300MiB/300017msec)
    clat (usec): min=873, max=3333.3k, avg=36138.39, stdev=91739.19
     lat (usec): min=873, max=3333.3k, avg=36139.11, stdev=91739.19
    clat percentiles (usec):
     |  1.00th=[   1012],  5.00th=[   1106], 10.00th=[   1254],
     | 20.00th=[  13042], 30.00th=[  18220], 40.00th=[  21890],
     | 50.00th=[  25560], 60.00th=[  29230], 70.00th=[  33817],
     | 80.00th=[  42206], 90.00th=[  63177], 95.00th=[  88605],
     | 99.00th=[ 202376], 99.50th=[ 375391], 99.90th=[1317012],
     | 99.95th=[2122318], 99.99th=[3338666]
   bw (  KiB/s): min= 2043, max=67584, per=100.00%, avg=30456.64, stdev=15658.90, samples=558
   iops        : min=    1, max=   66, avg=29.71, stdev=15.30, samples=558
  lat (usec)   : 1000=0.78%
  lat (msec)   : 2=16.05%, 4=0.51%, 10=1.07%, 20=16.02%, 50=50.54%
  lat (msec)   : 100=11.08%, 250=3.19%, 500=0.37%, 750=0.11%, 1000=0.06%
  lat (msec)   : 2000=0.13%, >=2000=0.07%
  cpu          : usr=0.04%, sys=1.04%, ctx=69453, majf=0, minf=204
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8300,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=27.7MiB/s (29.0MB/s), 27.7MiB/s-27.7MiB/s (29.0MB/s-29.0MB/s), io=8300MiB (8703MB), run=300017-300017msec

Disk stats (read/write):
  vda: ios=72206/557, merge=5772/489, ticks=354680/1624, in_queue=356132, util=98.63%



command: fio --filename=test --sync=1 --rw=read --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Sequential read   


test: (groupid=0, jobs=1): err= 0: pid=30435: Fri Jul 26 17:06:41 2019
  read: IOPS=183, BW=184MiB/s (193MB/s)(10.0GiB/55699msec)
    clat (usec): min=394, max=3498.9k, avg=5432.27, stdev=39175.67
     lat (usec): min=395, max=3498.9k, avg=5432.98, stdev=39175.68
    clat percentiles (usec):
     |  1.00th=[   685],  5.00th=[   799], 10.00th=[   848], 20.00th=[   955],
     | 30.00th=[  1139], 40.00th=[  1418], 50.00th=[  2089], 60.00th=[  2933],
     | 70.00th=[  4015], 80.00th=[  5669], 90.00th=[  9110], 95.00th=[ 15795],
     | 99.00th=[ 43254], 99.50th=[ 64226], 99.90th=[206570], 99.95th=[526386],
     | 99.99th=[775947]
   bw (  KiB/s): min= 2048, max=348160, per=100.00%, avg=204690.43, stdev=96757.40, samples=102
   iops        : min=    2, max=  340, avg=199.86, stdev=94.50, samples=102
  lat (usec)   : 500=0.12%, 750=2.07%, 1000=21.33%
  lat (msec)   : 2=25.58%, 4=20.82%, 10=21.22%, 20=5.19%, 50=2.93%
  lat (msec)   : 100=0.46%, 250=0.21%, 500=0.03%, 750=0.04%, 1000=0.01%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=0.29%, sys=5.94%, ctx=35793, majf=0, minf=269
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=10.0GiB (10.7GB), run=55699-55699msec

Disk stats (read/write):
  vda: ios=41292/1655, merge=387/4867, ticks=106828/1292, in_queue=108056, util=98.08%


 Command: fio --filename=test --sync=1 --rw=write --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
 
 Sequential write
 
 test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=62.1MiB/s][w=62 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30539: Fri Jul 26 17:10:32 2019
  write: IOPS=127, BW=128MiB/s (134MB/s)(10.0GiB/80072msec)
    clat (usec): min=851, max=1115.1k, avg=7792.02, stdev=32943.88
     lat (usec): min=862, max=1115.2k, avg=7814.53, stdev=32945.37
    clat percentiles (usec):
     |  1.00th=[   947],  5.00th=[  1004], 10.00th=[  1045], 20.00th=[  1123],
     | 30.00th=[  1205], 40.00th=[  1303], 50.00th=[  1467], 60.00th=[  1631],
     | 70.00th=[  1827], 80.00th=[  2311], 90.00th=[ 14222], 95.00th=[ 29492],
     | 99.00th=[135267], 99.50th=[223347], 99.90th=[429917], 99.95th=[566232],
     | 99.99th=[624952]
   bw (  KiB/s): min= 2048, max=362496, per=100.00%, avg=131168.47, stdev=80230.28, samples=159
   iops        : min=    2, max=  354, avg=128.04, stdev=78.35, samples=159
  lat (usec)   : 1000=4.51%
  lat (msec)   : 2=70.53%, 4=10.39%, 10=2.52%, 20=4.36%, 50=4.70%
  lat (msec)   : 100=1.55%, 250=1.05%, 500=0.33%, 750=0.05%, 2000=0.01%
  cpu          : usr=0.45%, sys=5.85%, ctx=35313, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=10.0GiB (10.7GB), run=80072-80072msec

Disk stats (read/write):
  vda: ios=309/33730, merge=202/17093, ticks=13188/81368, in_queue=94600, util=92.71%

gea · Jul 27, 2019

1.
Slog size must be at least 2 x write-ramcache size
10 GB is lower limit.

2.
Optane up from 800P-58 is a very good slog for home or lab use.
The upperclass/datacenter Optane like the 4801X is only a little faster
but offers a better write endurance and guaranteed powerloss protection
what makes it a perfect slog for a production server.

3.
I have made a series of filebench benchmarks on Solarish with different type
of pools, sync enabled/disabled, RAM size and slog (from different disks, ssd, Zeusram, NVMe, Optane). If you have filebench you can compare otherwise you can use it to
decide about some pool layouts.

https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

4.
readcache makes random io and access to metadata fast.
2 GB is stable but way to small if you want performance. Think of 8-32 GB or more.

Multi VM performance is quite related to iops ans small random read/write.
Without enough RAM you are limited by pure disk performance. Not a matter
with a pool from Optane but bad with disks.

see ESXi VM performance vs RAM
https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

5,
yes, use LZ4

6.
atime off reduces write operations, disable it.

Albert Yang · Jul 27, 2019

Thanks for the reply,
1) as for the slog size did not really get what you mean 2 x write-ramcache size?
2) yes i remember you did mention it i want to give a try before buying the other intel optane to see the difference
3) thank you will check the benchmark, But what about the volblocksize? its different from the vm and the Hardware recommend to change it?

Thank you

gea · Jul 27, 2019

The only task of an slog is to protect the content of the rambased writecache.
On Open-ZFS default size of writecache is 10% RAM/ max 4GB (slog is not a writecache)

When the writecache is full, it is written as a large and fast sequential write to pool
while a second area of the ramcache is taking new writes.

This is why an Slog must be 2 x ramcache size (2 x 4GB) as a minimum.

On ZFS default recordsize is 128k and this is a good compromise.
For VM storage use I recommend settings between 32 and 64k.
For a single user media filer 1M may be better.

Albert Yang · Jul 27, 2019

Thanks for the reply, so an example with 32 gigs of RAM with 7tb pool?

as for the vm storage if its currently 8k so changing to 64k or 32 would help on the reads? even thought the disk physical is 512

Thank you

gea · Jul 27, 2019

There is no direct relation between ramsize and poolsize but there is a relation between ramsize and performance (or active amount of pooldata) especially with slow disks. On a performant ZFS system more than 80% of random reads can be delivered from RAM readcache.

Do not confuse physical sectorsize of disks (512B or 4k) or blocksize ex of a iSCSI LUN or vdisks (8k) with ZFS recordsize (default 128k). The first two is a given. Only ZFS recordsize (how many physical blocks are read/written in one io, dedup, compress or checksums are related to this) is a performance tunable parameter.

Albert Yang · Jul 28, 2019

Thank you for the reply, as in our previous conversation you mention

If you want to use a single disk as an slog for 4 datapools you must create 4 partition with a minimum size of 10 GB each. Add a partition per pool as slog.

so as long as i have 1 pool, a minimum for a slog 10gig per pool? I was reading this document i think that's why im somewhat confused
https://martin.heiland.io/2018/02/23/zfs-tuning/

and for the SLOG i was reading to improve the reading performance i would add cache pool but to improve write i would add a log pool, but couldn't i just add both with the device with 58gigs of storage?

Thank you

gea · Jul 28, 2019

Slog is a vdev that you add to the pool as log device.
So you need an slog per pool

Normally I would not dual-use a disk for slog and l2arc.
Intel Optane is so fast that you can do without serious disadvantages (especially as your ram is low.)

Recommended size for l2arc is 5x ram (never more than 10x).
With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) but seriously, use more RAM. Even on Solarish with the lowest RAM needs for ZFS, I would prefer 8GB+ and would not go below 4GB.

BackupProphet · Jul 31, 2019

A slog is not a write cache,

A slog takes only synchronous writes. That means writes that has the following behavior:
write 4kb (2us latency) ->
flush to disk (120us latency) ->
write another 8kb (3us latency) ->
flush to disk (160us latency) ->
write 4kb (2us latency) ->
flush to disk (120 us latency)

Total time spent
407 us

Without a fast dedicated slog it would be something more like
write 4kb (2us latency) ->
flush to disk (4500us latency) ->
write another 8kb (3us latency) ->
flush to disk (4500us latency) ->
write 4kb (2us latency) ->
flush to disk (4500 us latency)

Total 13507 us spent.

Most writes are asynchronous and behave as the following
write 4kb (2us latency) ->
write another 8kb (3us latency) ->
write 4kb (2us latency)

Total time spent 7 us.

Your operating system will handle the disk flushes automatically.

The Optane 800P is not something I would recommend either, it has low endurance and is pricey. Either just get a 900p 280GB or get the 16GB which is available for cheap.

To have more RAM for disk cache, consider increasing your swap size. Too many configurations has no swap at all. I highly recommend a few GB's, and the Intel Optane is perfect for swap

Albert Yang · Jul 31, 2019

gea said:
Slog is a vdev that you add to the pool as log device.
So you need an slog per pool

Normally I would not dual-use a disk for slog and l2arc.
Intel Optane is so fast that you can do without serious disadvantages (especially as your ram is low.)

Recommended size for l2arc is 5x ram (never more than 10x).
With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) but seriously, use more RAM. Even on Solarish with the lowest RAM needs for ZFS, I would prefer 8GB+ and would not go below 4GB.

Thanks for the reply, as for this With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) 10gig of SLOG? as correct this is going to be the LAB before putting it in production on a server with 220 gigs of ram and raid 10 15k RPM disks 600gigs each disk, but before i need to understand how to distribute the SLOG, as for the mirror is just in case the disk dies out, i have time to replace the SSD. SO rule of thumb 10gig per pool? but 1 slog (ssd) can only be configured either cache or log? cant be both right?
sorry for my ignorance

Thank you

Albert Yang · Jul 31, 2019

BackupProphet said:
A slog is not a write cache,

A slog takes only synchronous writes. That means writes that has the following behavior:
write 4kb (2us latency) ->
flush to disk (120us latency) ->
write another 8kb (3us latency) ->
flush to disk (160us latency) ->
write 4kb (2us latency) ->
flush to disk (120 us latency)

Total time spent
407 us

Without a fast dedicated slog it would be something more like
write 4kb (2us latency) ->
flush to disk (4500us latency) ->
write another 8kb (3us latency) ->
flush to disk (4500us latency) ->
write 4kb (2us latency) ->
flush to disk (4500 us latency)

Total 13507 us spent.

Most writes are asynchronous and behave as the following
write 4kb (2us latency) ->
write another 8kb (3us latency) ->
write 4kb (2us latency)

Total time spent 7 us.

Your operating system will handle the disk flushes automatically.

The Optane 800P is not something I would recommend either, it has low endurance and is pricey. Either just get a 900p 280GB or get the 16GB which is available for cheap.

To have more RAM for disk cache, consider increasing your swap size. Too many configurations has no swap at all. I highly recommend a few GB's, and the Intel Optane is perfect for swap

Thank you for the reply, a bit more about SLOG there are two configuration LOG or CACHE to add to the pool. i was looking at what you recommend something like this?
https://www.amazon.com/Intel-Optane...ane+900p+16gb&qid=1564614425&s=gateway&sr=8-1

The idea is that on my test lab is the 32 gigs of ram with 1tb 7200rpm raid 10 ZFS. but the real LAB is 220 gigs of ram and 600gigs 15k rpm disks with raid 10 so im trying test out before on production and bench test before and after with FIO. With the 16 gigs will be enough for the SLOG? As because were running MSSQL would it be better a LOG slog or CACHE?
Thank you

BackupProphet · Aug 1, 2019

SQL servers benefit a lot from having a dedicated slog, they do a lot of sync writes. Optane is so fast that you can create two partitations for slog and L2ARC without risking degraded performance.

Albert Yang · Aug 1, 2019

Thank you so much for this help, so from the link amazon works for the slog i need?
Thank you again

gea · Aug 1, 2019

The 16/32G Optane is not really bad, but more like a very good Sata SSD. The Optane up from 800P are around 3x as fast, see https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 2.9 vs 2.10

Optane is fast enough so you can partition for a dualuse slog+l2arc. With 32G RAM an L2Arc may be not really helpful.

For a cheap home/lab server the 16/32G Optane may be ok (although I had troubles with them on some settings).

For a production system, use datacenter disks (DC 4801X). propably in an Optane only pool for ultrafast databases without the need of an extra slog.

Albert Yang · Aug 1, 2019

gea said:
The 16/32G Optane is not really bad, but more like a very good Sata SSD. The Optane up from 800P are around 3x as fast, see https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 2.9 vs 2.10

Optane is fast enough so you can partition for a dualuse slog+l2arc. With 32G RAM an L2Arc may be not really helpful.

For a cheap home/lab server the 16/32G Optane may be ok (although I had troubles with them on some settings).

For a production system, use datacenter disks (DC 4801X). propably in an Optane only pool for ultrafast databases without the need of an extra slog.

Thanks for the reply, im going to try out the SLOG and post back the results
Thank you again

Davewolfs · Oct 7, 2019

@gea Between the 100GB 4801x and 900p. Which one would you go for?

gea · Oct 7, 2019

Only a matter of price.
The higher capacity 480x are as fast as the 900P and guarantee plp what the 900p does not but the 4801-100 is not as fast a sthe 900p.

In my lab I use 900P and see no reason to replace, If I should suggest a production setup and only the 4801-100 or the 900P are affordable, I would suggest the 4801.

Search

ZFS benchmark

Albert Yang

Member

gea

Well-Known Member

Albert Yang

Member

gea

Well-Known Member

Albert Yang

Member

gea

Well-Known Member

Albert Yang

Member

gea

Well-Known Member

BackupProphet

Well-Known Member

Albert Yang

Member

Albert Yang

Member

BackupProphet

Well-Known Member

Albert Yang

Member

gea

Well-Known Member

Albert Yang

Member

Davewolfs

Active Member

gea

Well-Known Member