nvmeof 900p quick benchmarks

_alex · Dec 3, 2017

I just made some quick+dirty benchmarks that might be of interest, exporting a (200Gb partition of) 900p via nvmeof

Fabric is CX3 FDR10 via Switch, single Port each Host.
Hosts are both Supermicro X10, 16Gb DDR4 2133
Target has a 2683v3, Initiator 1620v3
OS: Proxmox 5.1
nvmeof via built-in kernel modules / nvme-cli
partition is formated with ext4 and mounted, both with no special options
cpu load on target didn't exceed ca. 5% while fio run on the initiator.

'Target' means local direct mount of the Partition on the Target Host, no nvmeof
'Initiator' means mount after 'nvme connect -t rmda ...' on the Initiator Host

Purpose of this was just to quickly get an idea of nvmeof, i want to compare with srp.

fio:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --readwrite=randrw --size=4G --numjobs=8 --group_reporting --rwmixread=XXX

Target, --rwmixread=100

Code:

Starting 8 processes
Jobs: 2 (f=2): [_(3),r(2),_(3)] [88.2% done] [2276MB/0KB/0KB /s] [583K/0/0 iops] [eta 00m:02s]
test: (groupid=0, jobs=8): err= 0: pid=2210: Sun Dec  3 12:43:57 2017
  read : io=32768MB, bw=2274.7MB/s, iops=582299, runt= 14406msec
  cpu          : usr=15.37%, sys=25.67%, ctx=7400831, majf=0, minf=68
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=8388608/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=32768MB, aggrb=2274.7MB/s, minb=2274.7MB/s, maxb=2274.7MB/s, mint=14406msec, maxt=14406msec

Disk stats (read/write):
  nvme0n1: ios=8380701/0, merge=0/0, ticks=6584136/0, in_queue=6627756, util=99.92%

Initiator, --rwmixread=100

Code:

Starting 8 processes
Jobs: 8 (f=8): [r(8)] [100.0% done] [2277MB/0KB/0KB /s] [583K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=28008: Sun Dec  3 12:48:03 2017
  read : io=32768MB, bw=2272.9MB/s, iops=581653, runt= 14422msec
  cpu          : usr=15.14%, sys=39.74%, ctx=6564187, majf=0, minf=62
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=8388608/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=32768MB, aggrb=2272.9MB/s, minb=2272.9MB/s, maxb=2272.9MB/s, mint=14422msec, maxt=14422msec

Disk stats (read/write):
  nvme0n1: ios=8374700/0, merge=0/0, ticks=7247468/0, in_queue=7432896, util=100.00%

Target, --rwmixread=0

Code:

Starting 8 processes
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 8 (f=8): [w(8)] [100.0% done] [0KB/2036MB/0KB /s] [0/521K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=13201: Sun Dec  3 13:06:12 2017
  write: io=32768MB, bw=2031.4MB/s, iops=520030, runt= 16131msec
  cpu          : usr=4.26%, sys=95.60%, ctx=24911, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=8388608/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=32768MB, aggrb=2031.4MB/s, minb=2031.4MB/s, maxb=2031.4MB/s, mint=16131msec, maxt=16131msec

Disk stats (read/write):
  nvme0n1: ios=0/8267104, merge=0/3, ticks=0/223076, in_queue=237644, util=100.00%

Initiator, --rwmixread=0

Code:

Starting 8 processes
Jobs: 8 (f=8): [w(8)] [100.0% done] [0KB/1527MB/0KB /s] [0/391K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=31060: Sun Dec  3 12:50:05 2017
  write: io=32768MB, bw=1523.8MB/s, iops=390077, runt= 21505msec
  cpu          : usr=4.83%, sys=85.82%, ctx=348697, majf=0, minf=61
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=8388608/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=32768MB, aggrb=1523.8MB/s, minb=1523.8MB/s, maxb=1523.8MB/s, mint=21505msec, maxt=21505msec

Disk stats (read/write):
  nvme0n1: ios=0/8341294, merge=0/4, ticks=0/225736, in_queue=229360, util=100.00%

Target, --rwmixread=75

Code:

Starting 8 processes
Jobs: 8 (f=8): [m(8)] [100.0% done] [1165MB/390.9MB/0KB /s] [298K/100K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=6065: Sun Dec  3 12:51:04 2017
  read : io=24575MB, bw=1155.4MB/s, iops=295763, runt= 21271msec
  write: io=8193.8MB, bw=394420KB/s, iops=98604, runt= 21271msec
  cpu          : usr=5.08%, sys=32.74%, ctx=6843756, majf=0, minf=69
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=6291182/w=2097426/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=24575MB, aggrb=1155.4MB/s, minb=1155.4MB/s, maxb=1155.4MB/s, mint=21271msec, maxt=21271msec
  WRITE: io=8193.8MB, aggrb=394419KB/s, minb=394419KB/s, maxb=394419KB/s, mint=21271msec, maxt=21271msec

Disk stats (read/write):
  nvme0n1: ios=6255606/2085554, merge=0/4, ticks=96736/38788, in_queue=138448, util=100.00%

Initiator, --rwmixread=75

Code:

Starting 8 processes
Jobs: 8 (f=8): [m(8)] [100.0% done] [1135MB/380.5MB/0KB /s] [291K/97.4K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=1742: Sun Dec  3 12:52:01 2017
  read : io=24575MB, bw=1214.9MB/s, iops=310998, runt= 20229msec
  write: io=8193.8MB, bw=414736KB/s, iops=103684, runt= 20229msec
  cpu          : usr=7.72%, sys=40.42%, ctx=6235672, majf=0, minf=61
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=6291182/w=2097426/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=24575MB, aggrb=1214.9MB/s, minb=1214.9MB/s, maxb=1214.9MB/s, mint=20229msec, maxt=20229msec
  WRITE: io=8193.8MB, aggrb=414736KB/s, minb=414736KB/s, maxb=414736KB/s, mint=20229msec, maxt=20229msec

Disk stats (read/write):
  nvme0n1: ios=6264696/2088706, merge=0/4, ticks=207960/66428, in_queue=280896, util=100.00%

Edit: posted a clean(er) run of rwmixread=0 on the initiator, forgot to call sync and now is > 500k IOPS. So pure writes on nvmeof clearly suffer more.
Edit2: Explained 'Target' and 'Initiator'

_alex · Dec 3, 2017

First observation:
- read on initiator doesn't really suffer
- write has a penalty of about 130k IOPS
- mixed workload seems to have no penalty, was actually a bit faster on initiator
i guess due to higher frequency of the 1620

Patrick · Dec 3, 2017

Checking this from my phone but did you try local results for comparison to NVMeoF?

_alex · Dec 3, 2017

All ,Target' is local/on the Target

_alex · Dec 4, 2017

If anyone has an idea how to fight this loss on the write iops i'd be happy to dive into some tuning.
Currently i have no idea where to start.

Both servers only have a single numa node, so this can't be the issue.
Read and also mixed is fine, so setting irq affinity presumably will not help.

My best guess are probably the mlx4_core / mlx4_ib module parameters.

Jeggs101 · Dec 4, 2017

May also be network.

Almost 400K write iops over the network is good. What's your lower QD like QD4 or 8 performance?

_alex · Dec 4, 2017

This is FDR10 IB network (not RoCE or ipoib, just plain rdma) that handled > 580k reads and > ca. 414k mixed.
So i guess it should be possible to get some more iops on pure writes.
Think the next thing is a check with null_blk, this should reveal what the network can handle.
Maybe also network + device-latencies together limit writes to the number i saw.

QD4 / 8 is not that bad, don't have the numbers atm and servers are turned off.
But i remember QD1 still was well above 30k iops.

Will definitely do more benchmarks, with different blocksizes / qd but would like to have possible bottlenecks/missconfiguration eliminated before.

_alex · Dec 7, 2017

ext4 was the bottleneck, writes to the unmounted raw-devices give 560k iops

Jeggs101 said:
What's your lower QD like QD4 or 8 performance?

All done on the initiator:
iodepth 4, numjobs=1
reads and writes the same, 175k 75/25 r/w mix is 128k read, 42k write

iodepth 8, numjobs=1
reads 310k, writes 302k 75/25 r/w mix is 227k read, 76k write

for fun, iodepth 1:
reads 48k, writes 49.8k 75/25 r/w mix is 37k read, 12k write

niekbergboer · Dec 7, 2017

That 12kIOPS write is in line with what I saw using my new 900p as an SLOG on my 4-drive raidz2 pool; 35 Mbyte/s actual writes to the 900p while running Bonnie++ one-byte-at-a-time writing on a sync=always dataset.

I use ~25 Gbyte of the 900p as SLOG, and the remaining ~250 Gbyte as cache. In order to make the most of the cache, I set the L2ARC write speed to 500 Mbyte/s, and I also cache sequential reads off the spinners.

These are four slow WD Red 8TB drives, but with the 900p it feels like one huge SSD pool (of somewhat slower SSDs).

Suffice to say I'm very happy with my 900p.

_alex · Dec 7, 2017

How did you call bonnie++, just without args ?

I'm just trying to figure out how to best use/distribute that massive amount of iops of this amazing device, putting everything on a single node would be a big waste.

Just managed to import a part of the 900p on my esos-box, what quite surprisingly worked

Currently comparing as slog with a micron m500dc - on a unfortunately quite slow pool of 2x 2 3TB SATA mixed WD Black and Seagate spinners in 'Raid 10'.

~~Here the 900p over IB beats the local m500dc with 4940 vs 3968 IOPS in the first quick run.~~
Edit: forget this, need to have a closer look, but it's definitely close.

For sure slog via NVMEoF can not be considered save, but with kernel 4.15 multipath-io for NVME will be available.
RFC: nvme multipath support [LWN.net]
kernel/git/torvalds/linux.git - Linux kernel source tree

If not with this release, i guess soon there will be multipath-io available for NVMEoF.

So options to have fast SLOG on remote Servers in a save way (with i.e. 2 boxes, each exporting slogs over redundant paths) will definitely become available

Search

nvmeof 900p quick benchmarks

_alex

Active Member

_alex

Active Member

Patrick

Administrator

_alex

Active Member

_alex

Active Member

Jeggs101

Well-Known Member

_alex

Active Member

_alex

Active Member

niekbergboer

Active Member

_alex

Active Member