ZFS benchmark

Albert Yang

Member
Oct 26, 2017
55
0
6
26
Hi,
So currently trying to bench mark my ZFS setup before getting a SLOG. Here are my few question is someone could shed some light

1) what is the rule of thumb when creating a slog? depending on the RAM or size of the disks or the zfs pool? If i have 4 disks of 4tb what is the size of the SSD i need to get?
2) would this be sufficient? the 58gigs ssd https://www.amazon.com/Intel-Optane-800P-58GB-XPoint/dp/B078ZJSD6F
3) Currently bench marking with FIO and yes i have pretty bad results but i was reading also depends what type physical disk im using 512 and on the vm storage volblocksize by default is 8k. not sure changing would help?
Code:
cat /sys/block/sda/queue/hw_sector_size
512
4) currently I have arc max to 2gigs which i think might be too low but currently have 32 gigs with 26gigs using for the Vms (prob need to add more ram) but what is the rule of thumb for the ARC max
5) Can setting compression off help?
6) setting the atime value to off would it also help on the writes because the VMs are RAW inside of proxmox
Code:
zfs set atime=off rpool
Thank you

Well these are the stats (pretty bad i know)
Proxmox Host
Code:
command: fio --filename=test --sync=1 --rw=randwrite --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random write

test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [f(1)][100.0%][w=475MiB/s][w=475 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=18230: Fri Jul 26 16:36:21 2019
  write: IOPS=167, BW=168MiB/s (176MB/s)(10.0GiB/61028msec)
    clat (usec): min=99, max=7805.3k, avg=5948.35, stdev=158625.94
     lat (usec): min=104, max=7805.3k, avg=5958.56, stdev=158625.92
    clat percentiles (usec):
     |  1.00th=[    108],  5.00th=[    113], 10.00th=[    120],
     | 20.00th=[    141], 30.00th=[    151], 40.00th=[    157],
     | 50.00th=[    165], 60.00th=[    178], 70.00th=[    198],
     | 80.00th=[    219], 90.00th=[    277], 95.00th=[    347],
     | 99.00th=[    652], 99.50th=[   1500], 99.90th=[2533360],
     | 99.95th=[3137340], 99.99th=[6945768]
   bw (  KiB/s): min=147456, max=2461696, per=100.00%, avg=1037687.90, stdev=663809.16, samples=20
   iops        : min=  144, max= 2404, avg=1013.30, stdev=648.27, samples=20
  lat (usec)   : 100=0.04%, 250=86.66%, 500=11.78%, 750=0.66%, 1000=0.17%
  lat (msec)   : 2=0.26%, 4=0.07%, 10=0.03%, 20=0.05%, 50=0.04%
  lat (msec)   : 250=0.07%, 500=0.01%, 2000=0.04%, >=2000=0.13%
  cpu          : usr=0.14%, sys=2.64%, ctx=7440, majf=8, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=10.0GiB (10.7GB), run=61028-61028msec



command: fio --filename=test --sync=1 --rw=randread --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random read

test: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=76.0MiB/s][r=76 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=337: Fri Jul 26 16:48:21 2019
  read: IOPS=73, BW=73.4MiB/s (76.0MB/s)(10.0GiB/139469msec)
    clat (usec): min=168, max=326933, avg=13616.37, stdev=8044.79
     lat (usec): min=168, max=326934, avg=13616.73, stdev=8044.80
    clat percentiles (usec):
     |  1.00th=[   260],  5.00th=[  7439], 10.00th=[  8717], 20.00th=[ 10159],
     | 30.00th=[ 11076], 40.00th=[ 11994], 50.00th=[ 12649], 60.00th=[ 13435],
     | 70.00th=[ 14222], 80.00th=[ 15270], 90.00th=[ 17695], 95.00th=[ 23462],
     | 99.00th=[ 42730], 99.50th=[ 59507], 99.90th=[ 99091], 99.95th=[111674],
     | 99.99th=[135267]
   bw (  KiB/s): min=16384, max=102400, per=99.88%, avg=75091.18, stdev=12184.43, samples=278
   iops        : min=   16, max=  100, avg=73.25, stdev=11.91, samples=278
  lat (usec)   : 250=0.82%, 500=0.85%
  lat (msec)   : 2=0.90%, 4=0.19%, 10=16.28%, 20=73.71%, 50=6.58%
  lat (msec)   : 100=0.58%, 250=0.09%, 500=0.01%
  cpu          : usr=0.03%, sys=1.70%, ctx=10205, majf=0, minf=268
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=73.4MiB/s (76.0MB/s), 73.4MiB/s-73.4MiB/s (76.0MB/s-76.0MB/s), io=10.0GiB (10.7GB), run=139469-139469msec


command: fio --filename=test --sync=1 --rw=read --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Sequential read   

test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=4
fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [R(1)][97.6%][r=335MiB/s][r=335 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=16042: Fri Jul 26 16:50:29 2019
  read: IOPS=256, BW=257MiB/s (269MB/s)(10.0GiB/39919msec)
    clat (usec): min=173, max=1988.9k, avg=3896.22, stdev=27967.08
     lat (usec): min=174, max=1988.9k, avg=3896.47, stdev=27967.15
    clat percentiles (usec):
     |  1.00th=[    204],  5.00th=[    217], 10.00th=[    229],
     | 20.00th=[    260], 30.00th=[    318], 40.00th=[   1156],
     | 50.00th=[   2147], 60.00th=[   2933], 70.00th=[   3720],
     | 80.00th=[   4817], 90.00th=[   7308], 95.00th=[  10159],
     | 99.00th=[  26870], 99.50th=[  36439], 99.90th=[  74974],
     | 99.95th=[ 274727], 99.99th=[1384121]
   bw (  KiB/s): min= 2048, max=401408, per=100.00%, avg=283533.05, stdev=99968.47, samples=73
   iops        : min=    2, max=  392, avg=276.88, stdev=97.62, samples=73
  lat (usec)   : 250=17.56%, 500=16.98%, 750=2.29%, 1000=1.92%
  lat (msec)   : 2=9.26%, 4=25.09%, 10=21.77%, 20=3.32%, 50=1.53%
  lat (msec)   : 100=0.21%, 250=0.01%, 500=0.02%, 750=0.01%, 2000=0.03%
  cpu          : usr=0.07%, sys=6.32%, ctx=7006, majf=0, minf=266
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=257MiB/s (269MB/s), 257MiB/s-257MiB/s (269MB/s-269MB/s), io=10.0GiB (10.7GB), run=39919-39919msec
  
  
 
 Command: fio --filename=test --sync=1 --rw=write --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
 
 Sequential write
 
 test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][90.2%][eta 00m:05s]                         
test: (groupid=0, jobs=1): err= 0: pid=23005: Fri Jul 26 16:52:19 2019
  write: IOPS=219, BW=219MiB/s (230MB/s)(10.0GiB/46746msec)
    clat (usec): min=91, max=9317.8k, avg=4554.20, stdev=160615.65
     lat (usec): min=97, max=9317.8k, avg=4564.12, stdev=160615.78
    clat percentiles (usec):
     |  1.00th=[     97],  5.00th=[    108], 10.00th=[    122],
     | 20.00th=[    130], 30.00th=[    137], 40.00th=[    143],
     | 50.00th=[    151], 60.00th=[    161], 70.00th=[    176],
     | 80.00th=[    188], 90.00th=[    212], 95.00th=[    330],
     | 99.00th=[   1532], 99.50th=[   1762], 99.90th=[ 105382],
     | 99.95th=[4462740], 99.99th=[6207570]
   bw (  MiB/s): min=    2, max= 2500, per=100.00%, avg=1418.39, stdev=863.69, samples=13
   iops        : min=    2, max= 2500, avg=1418.31, stdev=863.68, samples=13
  lat (usec)   : 100=2.60%, 250=90.42%, 500=3.31%, 750=0.24%, 1000=0.35%
  lat (msec)   : 2=2.62%, 4=0.12%, 10=0.14%, 20=0.02%, 100=0.01%
  lat (msec)   : 250=0.09%, 500=0.01%, 2000=0.01%, >=2000=0.07%
  cpu          : usr=0.20%, sys=2.89%, ctx=11012, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=10.0GiB (10.7GB), run=46746-46746msec

and the VM

Code:
command: fio --filename=test --sync=1 --rw=randwrite --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
    
    randwrite
    
    fio-3.15-4-g029b
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=117MiB/s][w=117 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30330: Fri Jul 26 16:56:09 2019
  write: IOPS=89, BW=89.5MiB/s (93.9MB/s)(10.0GiB/114364msec)
    clat (usec): min=918, max=595797, avg=11132.76, stdev=26966.94
     lat (usec): min=932, max=595843, avg=11161.68, stdev=26969.05
    clat percentiles (usec):
     |  1.00th=[  1074],  5.00th=[  1156], 10.00th=[  1221], 20.00th=[  1336],
     | 30.00th=[  1500], 40.00th=[  1713], 50.00th=[  1942], 60.00th=[  2343],
     | 70.00th=[  4883], 80.00th=[ 14222], 90.00th=[ 28181], 95.00th=[ 49021],
     | 99.00th=[131597], 99.50th=[170918], 99.90th=[295699], 99.95th=[375391],
     | 99.99th=[492831]
   bw (  KiB/s): min= 2048, max=286720, per=99.79%, avg=91493.11, stdev=49831.22, samples=228
   iops        : min=    2, max=  280, avg=89.28, stdev=48.66, samples=228
  lat (usec)   : 1000=0.07%
  lat (msec)   : 2=51.91%, 4=17.09%, 10=6.06%, 20=9.51%, 50=10.50%
  lat (msec)   : 100=3.17%, 250=1.49%, 500=0.18%, 750=0.01%
  cpu          : usr=0.32%, sys=5.31%, ctx=36214, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=89.5MiB/s (93.9MB/s), 89.5MiB/s-89.5MiB/s (93.9MB/s-93.9MB/s), io=10.0GiB (10.7GB), run=114364-114364msec

Disk stats (read/write):
  vda: ios=359/36257, merge=110/26936, ticks=260/128020, in_queue=128308, util=92.90%
 

command: fio --filename=test --sync=1 --rw=randread --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Random read

test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=49.0MiB/s][r=49 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30377: Fri Jul 26 17:03:11 2019
  read: IOPS=27, BW=27.7MiB/s (29.0MB/s)(8300MiB/300017msec)
    clat (usec): min=873, max=3333.3k, avg=36138.39, stdev=91739.19
     lat (usec): min=873, max=3333.3k, avg=36139.11, stdev=91739.19
    clat percentiles (usec):
     |  1.00th=[   1012],  5.00th=[   1106], 10.00th=[   1254],
     | 20.00th=[  13042], 30.00th=[  18220], 40.00th=[  21890],
     | 50.00th=[  25560], 60.00th=[  29230], 70.00th=[  33817],
     | 80.00th=[  42206], 90.00th=[  63177], 95.00th=[  88605],
     | 99.00th=[ 202376], 99.50th=[ 375391], 99.90th=[1317012],
     | 99.95th=[2122318], 99.99th=[3338666]
   bw (  KiB/s): min= 2043, max=67584, per=100.00%, avg=30456.64, stdev=15658.90, samples=558
   iops        : min=    1, max=   66, avg=29.71, stdev=15.30, samples=558
  lat (usec)   : 1000=0.78%
  lat (msec)   : 2=16.05%, 4=0.51%, 10=1.07%, 20=16.02%, 50=50.54%
  lat (msec)   : 100=11.08%, 250=3.19%, 500=0.37%, 750=0.11%, 1000=0.06%
  lat (msec)   : 2000=0.13%, >=2000=0.07%
  cpu          : usr=0.04%, sys=1.04%, ctx=69453, majf=0, minf=204
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8300,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=27.7MiB/s (29.0MB/s), 27.7MiB/s-27.7MiB/s (29.0MB/s-29.0MB/s), io=8300MiB (8703MB), run=300017-300017msec

Disk stats (read/write):
  vda: ios=72206/557, merge=5772/489, ticks=354680/1624, in_queue=356132, util=98.63%



command: fio --filename=test --sync=1 --rw=read --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Sequential read   


test: (groupid=0, jobs=1): err= 0: pid=30435: Fri Jul 26 17:06:41 2019
  read: IOPS=183, BW=184MiB/s (193MB/s)(10.0GiB/55699msec)
    clat (usec): min=394, max=3498.9k, avg=5432.27, stdev=39175.67
     lat (usec): min=395, max=3498.9k, avg=5432.98, stdev=39175.68
    clat percentiles (usec):
     |  1.00th=[   685],  5.00th=[   799], 10.00th=[   848], 20.00th=[   955],
     | 30.00th=[  1139], 40.00th=[  1418], 50.00th=[  2089], 60.00th=[  2933],
     | 70.00th=[  4015], 80.00th=[  5669], 90.00th=[  9110], 95.00th=[ 15795],
     | 99.00th=[ 43254], 99.50th=[ 64226], 99.90th=[206570], 99.95th=[526386],
     | 99.99th=[775947]
   bw (  KiB/s): min= 2048, max=348160, per=100.00%, avg=204690.43, stdev=96757.40, samples=102
   iops        : min=    2, max=  340, avg=199.86, stdev=94.50, samples=102
  lat (usec)   : 500=0.12%, 750=2.07%, 1000=21.33%
  lat (msec)   : 2=25.58%, 4=20.82%, 10=21.22%, 20=5.19%, 50=2.93%
  lat (msec)   : 100=0.46%, 250=0.21%, 500=0.03%, 750=0.04%, 1000=0.01%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=0.29%, sys=5.94%, ctx=35793, majf=0, minf=269
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
   READ: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=10.0GiB (10.7GB), run=55699-55699msec

Disk stats (read/write):
  vda: ios=41292/1655, merge=387/4867, ticks=106828/1292, in_queue=108056, util=98.08%


 Command: fio --filename=test --sync=1 --rw=write --bs=1m --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
 
 Sequential write
 
 test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=62.1MiB/s][w=62 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30539: Fri Jul 26 17:10:32 2019
  write: IOPS=127, BW=128MiB/s (134MB/s)(10.0GiB/80072msec)
    clat (usec): min=851, max=1115.1k, avg=7792.02, stdev=32943.88
     lat (usec): min=862, max=1115.2k, avg=7814.53, stdev=32945.37
    clat percentiles (usec):
     |  1.00th=[   947],  5.00th=[  1004], 10.00th=[  1045], 20.00th=[  1123],
     | 30.00th=[  1205], 40.00th=[  1303], 50.00th=[  1467], 60.00th=[  1631],
     | 70.00th=[  1827], 80.00th=[  2311], 90.00th=[ 14222], 95.00th=[ 29492],
     | 99.00th=[135267], 99.50th=[223347], 99.90th=[429917], 99.95th=[566232],
     | 99.99th=[624952]
   bw (  KiB/s): min= 2048, max=362496, per=100.00%, avg=131168.47, stdev=80230.28, samples=159
   iops        : min=    2, max=  354, avg=128.04, stdev=78.35, samples=159
  lat (usec)   : 1000=4.51%
  lat (msec)   : 2=70.53%, 4=10.39%, 10=2.52%, 20=4.36%, 50=4.70%
  lat (msec)   : 100=1.55%, 250=1.05%, 500=0.33%, 750=0.05%, 2000=0.01%
  cpu          : usr=0.45%, sys=5.85%, ctx=35313, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=10.0GiB (10.7GB), run=80072-80072msec

Disk stats (read/write):
  vda: ios=309/33730, merge=202/17093, ticks=13188/81368, in_queue=94600, util=92.71%
 

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
1.
Slog size must be at least 2 x write-ramcache size
10 GB is lower limit.

2.
Optane up from 800P-58 is a very good slog for home or lab use.
The upperclass/datacenter Optane like the 4801X is only a little faster
but offers a better write endurance and guaranteed powerloss protection
what makes it a perfect slog for a production server.

3.
I have made a series of filebench benchmarks on Solarish with different type
of pools, sync enabled/disabled, RAM size and slog (from different disks, ssd, Zeusram, NVMe, Optane). If you have filebench you can compare otherwise you can use it to
decide about some pool layouts.

https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

4.
readcache makes random io and access to metadata fast.
2 GB is stable but way to small if you want performance. Think of 8-32 GB or more.

Multi VM performance is quite related to iops ans small random read/write.
Without enough RAM you are limited by pure disk performance. Not a matter
with a pool from Optane but bad with disks.

see ESXi VM performance vs RAM
https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

5,
yes, use LZ4

6.
atime off reduces write operations, disable it.
 
Last edited:

Albert Yang

Member
Oct 26, 2017
55
0
6
26
Thanks for the reply,
1) as for the slog size did not really get what you mean 2 x write-ramcache size?
2) yes i remember you did mention it i want to give a try before buying the other intel optane to see the difference
3) thank you will check the benchmark, But what about the volblocksize? its different from the vm and the Hardware recommend to change it?

Thank you
 

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
The only task of an slog is to protect the content of the rambased writecache.
On Open-ZFS default size of writecache is 10% RAM/ max 4GB (slog is not a writecache)

When the writecache is full, it is written as a large and fast sequential write to pool
while a second area of the ramcache is taking new writes.

This is why an Slog must be 2 x ramcache size (2 x 4GB) as a minimum.

On ZFS default recordsize is 128k and this is a good compromise.
For VM storage use I recommend settings between 32 and 64k.
For a single user media filer 1M may be better.
 

Albert Yang

Member
Oct 26, 2017
55
0
6
26
Thanks for the reply, so an example with 32 gigs of RAM with 7tb pool?

as for the vm storage if its currently 8k so changing to 64k or 32 would help on the reads? even thought the disk physical is 512

Thank you
 

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
There is no direct relation between ramsize and poolsize but there is a relation between ramsize and performance (or active amount of pooldata) especially with slow disks. On a performant ZFS system more than 80% of random reads can be delivered from RAM readcache.

Do not confuse physical sectorsize of disks (512B or 4k) or blocksize ex of a iSCSI LUN or vdisks (8k) with ZFS recordsize (default 128k). The first two is a given. Only ZFS recordsize (how many physical blocks are read/written in one io, dedup, compress or checksums are related to this) is a performance tunable parameter.
 
  • Like
Reactions: T_Minus

Albert Yang

Member
Oct 26, 2017
55
0
6
26
Thank you for the reply, as in our previous conversation you mention
If you want to use a single disk as an slog for 4 datapools you must create 4 partition with a minimum size of 10 GB each. Add a partition per pool as slog.
so as long as i have 1 pool, a minimum for a slog 10gig per pool? I was reading this document i think that's why im somewhat confused
https://martin.heiland.io/2018/02/23/zfs-tuning/

and for the SLOG i was reading to improve the reading performance i would add cache pool but to improve write i would add a log pool, but couldn't i just add both with the device with 58gigs of storage?

Thank you
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
Slog is a vdev that you add to the pool as log device.
So you need an slog per pool

Normally I would not dual-use a disk for slog and l2arc.
Intel Optane is so fast that you can do without serious disadvantages (especially as your ram is low.)

Recommended size for l2arc is 5x ram (never more than 10x).
With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) but seriously, use more RAM. Even on Solarish with the lowest RAM needs for ZFS, I would prefer 8GB+ and would not go below 4GB.
 
  • Like
Reactions: Albert Yang

BackupProphet

Well-Known Member
Jul 2, 2014
796
284
63
Stavanger, Norway
kingmakers.no
A slog is not a write cache,

A slog takes only synchronous writes. That means writes that has the following behavior:
write 4kb (2us latency) ->
flush to disk (120us latency) ->
write another 8kb (3us latency) ->
flush to disk (160us latency) ->
write 4kb (2us latency) ->
flush to disk (120 us latency)

Total time spent
407 us

Without a fast dedicated slog it would be something more like
write 4kb (2us latency) ->
flush to disk (4500us latency) ->
write another 8kb (3us latency) ->
flush to disk (4500us latency) ->
write 4kb (2us latency) ->
flush to disk (4500 us latency)

Total 13507 us spent.

Most writes are asynchronous and behave as the following
write 4kb (2us latency) ->
write another 8kb (3us latency) ->
write 4kb (2us latency)

Total time spent 7 us.

Your operating system will handle the disk flushes automatically.

The Optane 800P is not something I would recommend either, it has low endurance and is pricey. Either just get a 900p 280GB or get the 16GB which is available for cheap.

To have more RAM for disk cache, consider increasing your swap size. Too many configurations has no swap at all. I highly recommend a few GB's, and the Intel Optane is perfect for swap :)
 
  • Like
Reactions: Albert Yang

Albert Yang

Member
Oct 26, 2017
55
0
6
26
Slog is a vdev that you add to the pool as log device.
So you need an slog per pool

Normally I would not dual-use a disk for slog and l2arc.
Intel Optane is so fast that you can do without serious disadvantages (especially as your ram is low.)

Recommended size for l2arc is 5x ram (never more than 10x).
With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) but seriously, use more RAM. Even on Solarish with the lowest RAM needs for ZFS, I would prefer 8GB+ and would not go below 4GB.
Thanks for the reply, as for this With 2 GB RAM, you may add 10-10 GB L2Arc (as the L2Arc requires RAM to manage) 10gig of SLOG? as correct this is going to be the LAB before putting it in production on a server with 220 gigs of ram and raid 10 15k RPM disks 600gigs each disk, but before i need to understand how to distribute the SLOG, as for the mirror is just in case the disk dies out, i have time to replace the SSD. SO rule of thumb 10gig per pool? but 1 slog (ssd) can only be configured either cache or log? cant be both right?
sorry for my ignorance

Thank you
 

Albert Yang

Member
Oct 26, 2017
55
0
6
26
A slog is not a write cache,

A slog takes only synchronous writes. That means writes that has the following behavior:
write 4kb (2us latency) ->
flush to disk (120us latency) ->
write another 8kb (3us latency) ->
flush to disk (160us latency) ->
write 4kb (2us latency) ->
flush to disk (120 us latency)

Total time spent
407 us

Without a fast dedicated slog it would be something more like
write 4kb (2us latency) ->
flush to disk (4500us latency) ->
write another 8kb (3us latency) ->
flush to disk (4500us latency) ->
write 4kb (2us latency) ->
flush to disk (4500 us latency)

Total 13507 us spent.

Most writes are asynchronous and behave as the following
write 4kb (2us latency) ->
write another 8kb (3us latency) ->
write 4kb (2us latency)

Total time spent 7 us.

Your operating system will handle the disk flushes automatically.

The Optane 800P is not something I would recommend either, it has low endurance and is pricey. Either just get a 900p 280GB or get the 16GB which is available for cheap.

To have more RAM for disk cache, consider increasing your swap size. Too many configurations has no swap at all. I highly recommend a few GB's, and the Intel Optane is perfect for swap :)
Thank you for the reply, a bit more about SLOG there are two configuration LOG or CACHE to add to the pool. i was looking at what you recommend something like this?
https://www.amazon.com/Intel-Optane...ane+900p+16gb&qid=1564614425&s=gateway&sr=8-1

The idea is that on my test lab is the 32 gigs of ram with 1tb 7200rpm raid 10 ZFS. but the real LAB is 220 gigs of ram and 600gigs 15k rpm disks with raid 10 so im trying test out before on production and bench test before and after with FIO. With the 16 gigs will be enough for the SLOG? As because were running MSSQL would it be better a LOG slog or CACHE?
Thank you
 

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
The 16/32G Optane is not really bad, but more like a very good Sata SSD. The Optane up from 800P are around 3x as fast, see https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 2.9 vs 2.10

Optane is fast enough so you can partition for a dualuse slog+l2arc. With 32G RAM an L2Arc may be not really helpful.

For a cheap home/lab server the 16/32G Optane may be ok (although I had troubles with them on some settings).

For a production system, use datacenter disks (DC 4801X). propably in an Optane only pool for ultrafast databases without the need of an extra slog.
 
  • Like
Reactions: Albert Yang

Albert Yang

Member
Oct 26, 2017
55
0
6
26
The 16/32G Optane is not really bad, but more like a very good Sata SSD. The Optane up from 800P are around 3x as fast, see https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf chapter 2.9 vs 2.10

Optane is fast enough so you can partition for a dualuse slog+l2arc. With 32G RAM an L2Arc may be not really helpful.

For a cheap home/lab server the 16/32G Optane may be ok (although I had troubles with them on some settings).

For a production system, use datacenter disks (DC 4801X). propably in an Optane only pool for ultrafast databases without the need of an extra slog.
Thanks for the reply, im going to try out the SLOG and post back the results
Thank you again
 

gea

Well-Known Member
Dec 31, 2010
2,479
835
113
DE
Only a matter of price.
The higher capacity 480x are as fast as the 900P and guarantee plp what the 900p does not but the 4801-100 is not as fast a sthe 900p.

In my lab I use 900P and see no reason to replace, If I should suggest a production setup and only the 4801-100 or the 900P are affordable, I would suggest the 4801.
 
  • Like
Reactions: Patrick