MDRAID performance issues with NVME and NUMA?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

chocamo

New Member
Sep 27, 2021
5
0
1
I have a fairly long thread on level1techs and it was suggested I see if anyone here has any suggestions.


I realized I was having an issue when certain tasks with PostgreSQL were taking about 8 times longer than an 8x 850 Samsung EVO SATA array on a different system.

It seems that when I put disks in an MDRAID (1 or 10) I end up with about 50% (for every disk I add) of the performance of a single disk. After a lot of back and forth, I've found that with fio, I see the same results running separate fio jobs against different disks simultaneously. UNLESS I specify numa_mem_policy and cpus_allowed, then the performance hit doesn't seem nearly as bad. My primary issue is that I have no idea how to apply NUMA optimizations at the various levels to ultimately improve PostgreSQL performance. I *think* this is my issue, but I'm open to any other suggestions. I'm also not sure if a newer generation EPYC with UMA would avoid this issue (and if the Dell R7425 would even accept ZEN 2 or up). Maybe my testing methodology is bad, maybe I'm bottlnecking somwhere else, I don't know, I'm just doubting everything at this point so any suggestions are greatly appreciated.

Code:
Ubuntu 20.04
2 x AMD EPYC 7601
384GB of DDR4 2666 MT/s , 16GB sticks.
8 x KCM6XVUL1T60 Kioxia NVME drives. (PCIE4 drives in GEN3 system, tested other drives as well, same issue)
This is the gist of where I'm at.

lscpu
Code:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           1
Model name:                      AMD EPYC 7601 32-Core Processor
Stepping:                        2
CPU MHz:                         2195.847
BogoMIPS:                        4391.69
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       4 MiB
L2 cache:                        32 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120
NUMA node1 CPU(s):               2,10,18,26,34,42,50,58,66,74,82,90,98,106,114,122
NUMA node2 CPU(s):               4,12,20,28,36,44,52,60,68,76,84,92,100,108,116,124
NUMA node3 CPU(s):               6,14,22,30,38,46,54,62,70,78,86,94,102,110,118,126
NUMA node4 CPU(s):               1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
NUMA node5 CPU(s):               3,11,19,27,35,43,51,59,67,75,83,91,99,107,115,123
NUMA node6 CPU(s):               5,13,21,29,37,45,53,61,69,77,85,93,101,109,117,125
NUMA node7 CPU(s):               7,15,23,31,39,47,55,63,71,79,87,95,103,111,119,127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid a
                                 md_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tc
                                 e topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsave
                                 erptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev sev_es
lstopo
Code:
Machine (378GB total)
  Package L#0
    Die L#0
      NUMANode L#0 (P#0 63GB)
      # redacted for brevity
      HostBridge
       # redacted for brevity
    Die L#1
      NUMANode L#1 (P#1 63GB)
     # redacted for brevity
    Die L#2
      NUMANode L#2 (P#2 31GB)
     # redacted for brevity
    Die L#3
      NUMANode L#3 (P#3 31GB)
     # redacted for brevity
  Package L#1
    Die L#4
      NUMANode L#4 (P#4 63GB)
     # redacted for brevity
    Die L#5
      NUMANode L#5 (P#5 63GB)
      L3 L#10 (8192KB)
        L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (64KB) + Core L#40
          PU L#80 (P#3)
          PU L#81 (P#67)
        L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (64KB) + Core L#41
          PU L#82 (P#19)
          PU L#83 (P#83)
        L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (64KB) + Core L#42
          PU L#84 (P#35)
          PU L#85 (P#99)
        L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (64KB) + Core L#43
          PU L#86 (P#51)
          PU L#87 (P#115)
      L3 L#11 (8192KB)
        L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (64KB) + Core L#44
          PU L#88 (P#11)
          PU L#89 (P#75)
        L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (64KB) + Core L#45
          PU L#90 (P#27)
          PU L#91 (P#91)
        L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (64KB) + Core L#46
          PU L#92 (P#43)
          PU L#93 (P#107)
        L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (64KB) + Core L#47
          PU L#94 (P#59)
          PU L#95 (P#123)
      HostBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI a3:00.0 (NVMExp)
                Block(Disk) "nvme0n1"
            PCIBridge
              PCI a4:00.0 (NVMExp)
                Block(Disk) "nvme1n1"
            PCIBridge
              PCI a5:00.0 (NVMExp)
                Block(Disk) "nvme2n1"
            PCIBridge
              PCI a6:00.0 (NVMExp)
                Block(Disk) "nvme3n1"
            PCIBridge
              PCI a7:00.0 (NVMExp)
                Block(Disk) "nvme4n1"
            PCIBridge
              PCI a8:00.0 (NVMExp)
                Block(Disk) "nvme5n1"
            PCIBridge
              PCI a9:00.0 (NVMExp)
                Block(Disk) "nvme6n1"
            PCIBridge
              PCI aa:00.0 (NVMExp)
                Block(Disk) "nvme7n1"
            PCIBridge
              PCI ad:00.0 (NVMExp)
                Block(Disk) "nvme8n1"
            PCIBridge
              PCI ae:00.0 (NVMExp)
                Block(Disk) "nvme9n1"
    Die L#6
      NUMANode L#6 (P#6 31GB)
      # redacted for brevity
    Die L#7
      NUMANode L#7 (P#7 31GB)
      # redacted for brevity

====
Running without fio numa arguments
===
Code:
fio --filename=$DISK --size=500GB --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=32 --runtime=30 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
/dev/nvme0n1
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][9.7%][r=2059MiB/s][r=527k IOPS][eta 00m:28s]
Jobs: 4 (f=4): [r(4)][13.3%][r=2114MiB/s][r=541k IOPS][eta 00m:26s]
Jobs: 4 (f=4): [r(4)][20.0%][r=2229MiB/s][r=571k IOPS][eta 00m:24s]
Jobs: 4 (f=4): [r(4)][26.7%][r=2175MiB/s][r=557k IOPS][eta 00m:22s]
Jobs: 4 (f=4): [r(4)][33.3%][r=1101MiB/s][r=282k IOPS][eta 00m:20s] <-- when second fio below started
Jobs: 4 (f=4): [r(4)][40.0%][r=1019MiB/s][r=261k IOPS][eta 00m:18s]
Jobs: 4 (f=4): [r(4)][46.7%][r=1018MiB/s][r=261k IOPS][eta 00m:16s]
Jobs: 4 (f=4): [r(4)][53.3%][r=1017MiB/s][r=260k IOPS][eta 00m:14s]
Jobs: 4 (f=4): [r(4)][60.0%][r=1016MiB/s][r=260k IOPS][eta 00m:12s]
Jobs: 4 (f=4): [r(4)][66.7%][r=1017MiB/s][r=260k IOPS][eta 00m:10s]
Jobs: 4 (f=4): [r(4)][73.3%][r=1011MiB/s][r=259k IOPS][eta 00m:08s]
Jobs: 4 (f=4): [r(4)][80.0%][r=1018MiB/s][r=261k IOPS][eta 00m:06s]
Jobs: 4 (f=4): [r(4)][86.7%][r=1006MiB/s][r=257k IOPS][eta 00m:04s]
Jobs: 4 (f=4): [r(4)][93.3%][r=988MiB/s][r=253k IOPS][eta 00m:02s]
Jobs: 4 (f=4): [r(4)][100.0%][r=987MiB/s][r=253k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274084: Tue Sep 28 10:00:29 2021
  read: IOPS=343k, BW=1340MiB/s (1405MB/s)(39.3GiB/30001msec)
    slat (usec): min=2, max=4031, avg= 4.45, stdev= 2.91
    clat (usec): min=53, max=8357, avg=367.57, stdev=136.48
     lat (usec): min=59, max=8660, avg=372.15, stdev=136.00
    clat percentiles (usec):
     |  1.00th=[  186],  5.00th=[  194], 10.00th=[  198], 20.00th=[  204],
     | 30.00th=[  210], 40.00th=[  318], 50.00th=[  453], 60.00th=[  478],
     | 70.00th=[  486], 80.00th=[  494], 90.00th=[  506], 95.00th=[  515],
     | 99.00th=[  529], 99.50th=[  537], 99.90th=[  603], 99.95th=[  619],
     | 99.99th=[  668]
   bw (  MiB/s): min=  985, max= 2433, per=100.00%, avg=1346.57, stdev=131.57, samples=236
   iops        : min=252334, max=622988, avg=344723.19, stdev=33681.77, samples=236
  lat (usec)   : 100=0.01%, 250=37.51%, 500=48.17%, 750=14.31%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=15.46%, sys=38.91%, ctx=2088198, majf=0, minf=25005
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10292314,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=1340MiB/s (1405MB/s), 1340MiB/s-1340MiB/s (1405MB/s-1405MB/s), io=39.3GiB (42.2GB), run=30001-30001msec
Disk stats (read/write):
  nvme0n1: ios=10263491/0, merge=0/0, ticks=2772732/0, in_queue=2772733, util=99.68%
/dev/nvme1n1
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][10.0%][r=1042MiB/s][r=267k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][16.7%][r=1018MiB/s][r=261k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][23.3%][r=1016MiB/s][r=260k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][30.0%][r=1017MiB/s][r=260k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][36.7%][r=1014MiB/s][r=260k IOPS][eta 00m:19s]
Jobs: 4 (f=4): [r(4)][43.3%][r=1013MiB/s][r=259k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][50.0%][r=1020MiB/s][r=261k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][56.7%][r=1005MiB/s][r=257k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][63.3%][r=988MiB/s][r=253k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][70.0%][r=987MiB/s][r=253k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][76.7%][r=2539MiB/s][r=650k IOPS][eta 00m:07s] <-- when first fio above finished
Jobs: 4 (f=4): [r(4)][83.3%][r=2549MiB/s][r=653k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][90.0%][r=2552MiB/s][r=653k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][96.7%][r=2544MiB/s][r=651k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=2555MiB/s][r=654k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274219: Tue Sep 28 10:00:38 2021
  read: IOPS=371k, BW=1449MiB/s (1520MB/s)(42.5GiB/30001msec)
    slat (nsec): min=1943, max=875056, avg=4063.87, stdev=1337.56
    clat (usec): min=48, max=1570, avg=339.82, stdev=153.38
     lat (usec): min=52, max=1575, avg=344.02, stdev=153.04
    clat percentiles (usec):
     |  1.00th=[  145],  5.00th=[  157], 10.00th=[  163], 20.00th=[  174],
     | 30.00th=[  186], 40.00th=[  231], 50.00th=[  388], 60.00th=[  449],
     | 70.00th=[  474], 80.00th=[  494], 90.00th=[  523], 95.00th=[  553],
     | 99.00th=[  586], 99.50th=[  594], 99.90th=[  635], 99.95th=[  701],
     | 99.99th=[  799]
   bw (  MiB/s): min=  983, max= 2599, per=98.84%, avg=1432.65, stdev=170.08, samples=236
   iops        : min=251732, max=665506, avg=366757.27, stdev=43540.73, samples=236
  lat (usec)   : 50=0.01%, 100=0.01%, 250=46.26%, 500=37.11%, 750=16.60%
  lat (usec)   : 1000=0.02%
  lat (msec)   : 2=0.01%
  cpu          : usr=18.39%, sys=40.51%, ctx=1747801, majf=0, minf=16755
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=11131700,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=1449MiB/s (1520MB/s), 1449MiB/s-1449MiB/s (1520MB/s-1520MB/s), io=42.5GiB (45.6GB), run=30001-30001msec
Disk stats (read/write):
  nvme1n1: ios=11018203/0, merge=0/0, ticks=2926228/0, in_queue=2926227, util=99.72%

/dev/md0
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][10.0%][r=1065MiB/s][r=273k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][16.7%][r=1013MiB/s][r=259k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][23.3%][r=1024MiB/s][r=262k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][30.0%][r=1023MiB/s][r=262k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][36.7%][r=1041MiB/s][r=266k IOPS][eta 00m:19s]
Jobs: 4 (f=4): [r(4)][43.3%][r=1033MiB/s][r=264k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][50.0%][r=1030MiB/s][r=264k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][56.7%][r=1031MiB/s][r=264k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][63.3%][r=1018MiB/s][r=261k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][70.0%][r=1021MiB/s][r=261k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][76.7%][r=1026MiB/s][r=263k IOPS][eta 00m:07s]
Jobs: 4 (f=4): [r(4)][83.3%][r=1030MiB/s][r=264k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][90.0%][r=1039MiB/s][r=266k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][96.7%][r=1040MiB/s][r=266k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=1047MiB/s][r=268k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274674: Tue Sep 28 10:05:10 2021
  read: IOPS=265k, BW=1035MiB/s (1085MB/s)(30.3GiB/30001msec)
    slat (usec): min=2, max=1101, avg=11.99, stdev= 6.76
    clat (usec): min=83, max=2058, avg=469.26, stdev=51.72
     lat (usec): min=90, max=2078, avg=481.52, stdev=52.82
    clat percentiles (usec):
     |  1.00th=[  363],  5.00th=[  388], 10.00th=[  404], 20.00th=[  433],
     | 30.00th=[  449], 40.00th=[  461], 50.00th=[  469], 60.00th=[  478],
     | 70.00th=[  490], 80.00th=[  502], 90.00th=[  529], 95.00th=[  545],
     | 99.00th=[  619], 99.50th=[  685], 99.90th=[  775], 99.95th=[  807],
     | 99.99th=[  889]
   bw (  MiB/s): min=  967, max= 1160, per=100.00%, avg=1035.66, stdev= 8.32, samples=236
   iops        : min=247760, max=297022, avg=265130.27, stdev=2129.88, samples=236
  lat (usec)   : 100=0.01%, 250=0.01%, 500=78.07%, 750=21.75%, 1000=0.17%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=10.70%, sys=63.93%, ctx=4166929, majf=0, minf=33474
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=7947556,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=1035MiB/s (1085MB/s), 1035MiB/s-1035MiB/s (1085MB/s-1085MB/s), io=30.3GiB (32.6GB), run=30001-30001msec
Disk stats (read/write):
    md0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3973778/1, aggrmerge=0/0, aggrticks=109416/0, aggrin_queue=109415, aggrutil=99.60%
  nvme0n1: ios=5282115/1, merge=0/0, ticks=107440/0, in_queue=107439, util=99.60%
  nvme1n1: ios=2665441/1, merge=0/0, ticks=111392/0, in_queue=111392, util=99.60%

===
Running with fio numa arguments
===



Code:
fio --numa_mem_policy=local --cpus_allowed=3,11,19,27,35,43,51,59,67,75,83,91,99,107,115,123 --filename=$DISK --size=500GB --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=32 --runtime=30 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
/dev/nvme0n1
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][10.0%][r=2562MiB/s][r=656k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][16.7%][r=2518MiB/s][r=645k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][23.3%][r=2559MiB/s][r=655k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][30.0%][r=2680MiB/s][r=686k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][36.7%][r=2805MiB/s][r=718k IOPS][eta 00m:19s] <-- when second fio below started
Jobs: 4 (f=4): [r(4)][43.3%][r=2850MiB/s][r=730k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][50.0%][r=2853MiB/s][r=730k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][56.7%][r=2856MiB/s][r=731k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][63.3%][r=2853MiB/s][r=730k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][70.0%][r=2854MiB/s][r=731k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][76.7%][r=2854MiB/s][r=731k IOPS][eta 00m:07s]
Jobs: 4 (f=4): [r(4)][83.3%][r=2855MiB/s][r=731k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][90.0%][r=2851MiB/s][r=730k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][96.7%][r=2853MiB/s][r=730k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=2850MiB/s][r=730k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274362: Tue Sep 28 10:02:08 2021
  read: IOPS=706k, BW=2759MiB/s (2893MB/s)(80.8GiB/30001msec)
    slat (nsec): min=1964, max=1039.4k, avg=3958.77, stdev=1451.21
    clat (usec): min=17, max=3655, avg=176.25, stdev=26.30
     lat (usec): min=21, max=3708, avg=180.35, stdev=26.88
    clat percentiles (usec):
     |  1.00th=[  153],  5.00th=[  159], 10.00th=[  163], 20.00th=[  167],
     | 30.00th=[  169], 40.00th=[  172], 50.00th=[  174], 60.00th=[  174],
     | 70.00th=[  176], 80.00th=[  180], 90.00th=[  184], 95.00th=[  190],
     | 99.00th=[  289], 99.50th=[  297], 99.90th=[  314], 99.95th=[  322],
     | 99.99th=[  388]
   bw (  MiB/s): min= 2301, max= 2865, per=100.00%, avg=2758.76, stdev=40.71, samples=236
   iops        : min=589214, max=733624, avg=706243.49, stdev=10422.24, samples=236
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=96.16%, 500=3.84%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=28.32%, sys=69.54%, ctx=698226, majf=0, minf=208
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=21188045,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=2759MiB/s (2893MB/s), 2759MiB/s-2759MiB/s (2893MB/s-2893MB/s), io=80.8GiB (86.8GB), run=30001-30001msec
Disk stats (read/write):
  nvme0n1: ios=21104260/0, merge=0/0, ticks=607526/0, in_queue=607526, util=99.67%
/dev/nvme1n1
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][10.0%][r=2896MiB/s][r=741k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][16.7%][r=2948MiB/s][r=755k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][23.3%][r=2947MiB/s][r=754k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][30.0%][r=2941MiB/s][r=753k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][36.7%][r=2941MiB/s][r=753k IOPS][eta 00m:19s]
Jobs: 4 (f=4): [r(4)][43.3%][r=2946MiB/s][r=754k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][50.0%][r=2945MiB/s][r=754k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][56.7%][r=2946MiB/s][r=754k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][63.3%][r=2944MiB/s][r=754k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][70.0%][r=2945MiB/s][r=754k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][76.7%][r=2905MiB/s][r=744k IOPS][eta 00m:07s]
Jobs: 4 (f=4): [r(4)][83.3%][r=2902MiB/s][r=743k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][90.0%][r=2910MiB/s][r=745k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][96.7%][r=2905MiB/s][r=744k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=2903MiB/s][r=743k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274497: Tue Sep 28 10:02:17 2021
  read: IOPS=748k, BW=2920MiB/s (3062MB/s)(85.5GiB/30001msec)
    slat (nsec): min=1933, max=462866, avg=3743.12, stdev=899.09
    clat (usec): min=58, max=1476, avg=166.53, stdev=35.54
     lat (usec): min=62, max=1494, avg=170.40, stdev=35.63
    clat percentiles (usec):
     |  1.00th=[  121],  5.00th=[  130], 10.00th=[  135], 20.00th=[  141],
     | 30.00th=[  145], 40.00th=[  149], 50.00th=[  153], 60.00th=[  157],
     | 70.00th=[  172], 80.00th=[  206], 90.00th=[  221], 95.00th=[  229],
     | 99.00th=[  258], 99.50th=[  273], 99.90th=[  306], 99.95th=[  322],
     | 99.99th=[  537]
   bw (  MiB/s): min= 2735, max= 2955, per=100.00%, avg=2921.49, stdev=12.05, samples=236
   iops        : min=700190, max=756676, avg=747901.02, stdev=3085.99, samples=236
  lat (usec)   : 100=0.01%, 250=98.66%, 500=1.33%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=28.77%, sys=71.06%, ctx=40665, majf=0, minf=222
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=22426613,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=2920MiB/s (3062MB/s), 2920MiB/s-2920MiB/s (3062MB/s-3062MB/s), io=85.5GiB (91.9GB), run=30001-30001msec
Disk stats (read/write):
  nvme1n1: ios=22287125/0, merge=0/0, ticks=1030062/0, in_queue=1030062, util=99.67%
/dev/md0
Code:
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][10.0%][r=1391MiB/s][r=356k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][16.7%][r=1369MiB/s][r=350k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][23.3%][r=1395MiB/s][r=357k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][30.0%][r=1385MiB/s][r=355k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][36.7%][r=1371MiB/s][r=351k IOPS][eta 00m:19s]
Jobs: 4 (f=4): [r(4)][43.3%][r=1363MiB/s][r=349k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][50.0%][r=1369MiB/s][r=351k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][56.7%][r=1405MiB/s][r=360k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][63.3%][r=1416MiB/s][r=362k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][70.0%][r=1369MiB/s][r=350k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][76.7%][r=1385MiB/s][r=355k IOPS][eta 00m:07s]
Jobs: 4 (f=4): [r(4)][83.3%][r=1358MiB/s][r=348k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][90.0%][r=1368MiB/s][r=350k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][96.7%][r=1381MiB/s][r=354k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=1424MiB/s][r=365k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=274817: Tue Sep 28 10:06:15 2021
  read: IOPS=354k, BW=1384MiB/s (1451MB/s)(40.5GiB/30001msec)
    slat (usec): min=2, max=436, avg= 8.67, stdev= 5.05
    clat (usec): min=49, max=1029, avg=351.01, stdev=54.37
     lat (usec): min=79, max=1037, avg=359.90, stdev=55.53
    clat percentiles (usec):
     |  1.00th=[  258],  5.00th=[  277], 10.00th=[  293], 20.00th=[  310],
     | 30.00th=[  322], 40.00th=[  330], 50.00th=[  351], 60.00th=[  363],
     | 70.00th=[  375], 80.00th=[  383], 90.00th=[  404], 95.00th=[  437],
     | 99.00th=[  562], 99.50th=[  619], 99.90th=[  725], 99.95th=[  766],
     | 99.99th=[  840]
   bw (  MiB/s): min= 1262, max= 1544, per=100.00%, avg=1384.76, stdev=14.31, samples=236
   iops        : min=323212, max=395284, avg=354499.39, stdev=3663.33, samples=236
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.37%, 500=98.06%, 750=1.51%
  lat (usec)   : 1000=0.07%
  lat (msec)   : 2=0.01%
  cpu          : usr=13.81%, sys=64.59%, ctx=4947503, majf=0, minf=221
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10626978,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=1384MiB/s (1451MB/s), 1384MiB/s-1384MiB/s (1451MB/s-1451MB/s), io=40.5GiB (43.5GB), run=30001-30001msec
Disk stats (read/write):
    md0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5313489/0, aggrmerge=0/0, aggrticks=120938/0, aggrin_queue=120939, aggrutil=99.59%
  nvme0n1: ios=7208746/0, merge=0/0, ticks=117898/0, in_queue=117899, util=99.59%
  nvme1n1: ios=3418232/0, merge=0/0, ticks=123979/0, in_queue=123980, util=99.59%
 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
918
695
93
Germany
Welcome to STH... From reading your last level1forum post you have problems on Linux but you seem to be saying that Windows works good? Smells like a kernel problem. What Linux kernel are you running on Ubuntu 20.04? Can you boot/install something newer, like Arch with really new kernel? Try kernel 5.14.8 with simple ext4 on independent disks with increasing number of fio instances e.g. in a screen. See if dent in I/O throughput shows. Also try 5.4.148 which on Arch can be gotten from AUR as package linux-lts54. Sometimes there is a regression on hardware on newer kernels but not with a LTS kernel. No need to go back further than 5.4.
 

chocamo

New Member
Sep 27, 2021
5
0
1
I wouldn't say Windows is working, maybe just that it doesn't plummet quite as bad as on Linux. If you ignore sequential reads, you can see the performance drop off. Linux seems to hold up much better with sequential reads as well, getting up to about 10GiB/s in some scenarios. It's 4k where i have a large issue, and 8k (which Postgres prefers) where really hit issues. Just a single run for each reposted below for reference. With Ubuntu 20.04, I initially noticed the issue with the default 5.4 kernel. I installed the hardware enablement kernel which is currently 5.11. I have also installed the latest Ubuntu 21.10 beta on kernel 5.10 and same issue there. For fun I also tried the 5.13-lowlatency variant on 21.10 with no improvements.

Single Disk (CrystalDisk NVME Profile)


2 Disks in RAID1



8 Disk RAID10

 

Stephan

Well-Known Member
Apr 21, 2017
918
695
93
Germany
So it "scales" to two drives but then levels off completely? Wow, bad.

How are those Kioxias connected to the CPUs? I.e. full path looks like what?

What happens if you do NOT use a RAID but single disks and run CrystalDiskMark on each of them? To eliminiate any RAID slowdowns/issues.
 

chocamo

New Member
Sep 27, 2021
5
0
1
That's reflected in this image. They are pointing to 1 drive each (no raid), but just being run at the same time. You can see, compared to just running against 1 disk by itself, neither quite hit the same speed as when run alone, like they are hitting some sort of bottleneck that isn't the drive itself. Connection is x4 pcie lanes each back to CPU


2 at same time, neither in RAID::



single disk by itself (no raid) again for reference

 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
918
695
93
Germany
Did you boot with "pcie_aspm=off rcu_nocbs=0-127 pci=noaer pci=nomsi processor.max_cstate=1" yet?

Turn off PCIe power management, offload RCU callbacks to threads on all your CPUs, no AER, no MSI in case Dell has that also buggy implemented, force CPU into C1 powersave max. If things improve, delete the options one by one to see which one is responsible.

If that all doesn't help, sell that Dell machine and buy a more recent one... (I know, I know)
 

chocamo

New Member
Sep 27, 2021
5
0
1
Did you boot with "pcie_aspm=off rcu_nocbs=0-127 pci=noaer pci=nomsi processor.max_cstate=1" yet?
I've tried a mix of these on and off. pci=nomsi I end up in busybox on reboot because NVME's won't poll to boot off off (NVME only system). Current kernel params are: `nvme.poll_queues=64 nvme_core.io_timeout=2 nopti pcie_aspm=off rcu_nocbs=0-127 pci=noaer processor.max_cstate=1

Have you tried benching with io_uring vs libaio?
I have, speed wise it did better, but the iops throughput drop occurs with both.


I did find, that overriding the "performance" profile in the BIOS for custom, has led to major progress. I'm thinking some of the OS overrides were not actually applying. I'm not sure if I am maxing the throughput when pinning cores now. It seems like I am, but I still have the same drop off if I am not pinned to those cores specifically. I would be ok with that (not a lot of options) if I could somehow get MDRAID or even better ZFS to stick to those cores. It is a little discouraging that if I take this same fio config (non numa) below, I get the same iops from 8 Samsung 850 EVOs.

Code:
[global]
name=NO NUMA CORE PINNING
ioengine=io_uring
direct=1
hipri
readwrite=randread
bs=4k
iodepth=32
buffered=0
size=100%
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
log_max_value=1
group_reporting
size=5G
numjobs=4

[job1]
filename=/dev/nvme0n1

[job2]
filename=/dev/nvme1n1

[job3]
filename=/dev/nvme9n1

[job4]
filename=/dev/nvme4n1

[job5]
filename=/dev/nvme5n1

[job6]
filename=/dev/nvme6n1

[job7]
filename=/dev/nvme7n1

[job8]
filename=/dev/nvme8n1

Running a single job from above::::
Code:
3# fio --section=job1 read-4k-anycpu.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=3396MiB/s][r=869k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=4): err= 0: pid=5742: Wed Sep 29 20:20:44 2021
  read: IOPS=867k, BW=3388MiB/s (3552MB/s)(99.2GiB/30001msec)
    slat (nsec): min=1754, max=347172, avg=4007.47, stdev=3781.39
    clat (usec): min=55, max=641, avg=143.09, stdev=16.56
     lat (usec): min=58, max=645, avg=147.19, stdev=16.58
    clat percentiles (usec):
     |  1.00th=[  113],  5.00th=[  124], 10.00th=[  128], 20.00th=[  133],
     | 30.00th=[  135], 40.00th=[  139], 50.00th=[  141], 60.00th=[  145],
     | 70.00th=[  149], 80.00th=[  153], 90.00th=[  161], 95.00th=[  169],
     | 99.00th=[  206], 99.50th=[  223], 99.90th=[  260], 99.95th=[  277],
     | 99.99th=[  314]
   bw (  MiB/s): min= 3263, max= 3472, per=100.00%, avg=3388.77, stdev=18.43, samples=240
   iops        : min=835488, max=888836, avg=867524.50, stdev=4718.08, samples=240
  lat (usec)   : 100=0.41%, 250=99.45%, 500=0.15%, 750=0.01%
  cpu          : usr=16.78%, sys=83.15%, ctx=6283, majf=0, minf=345
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=26017233,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3388MiB/s (3552MB/s), 3388MiB/s-3388MiB/s (3552MB/s-3552MB/s), io=99.2GiB (107GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=34713237/0, merge=0/0, ticks=3239559/0, in_queue=3239559, util=99.77%
Running 2 jobs at once
Code:
3# fio --section=job1 --section=job2 read-4k-anycpu.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 8 processes
Jobs: 8 (f=8): [r(8)][100.0%][r=2275MiB/s][r=582k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=7076: Wed Sep 29 20:33:51 2021
  read: IOPS=577k, BW=2252MiB/s (2362MB/s)(65.0GiB/30001msec)
    slat (nsec): min=1663, max=330217, avg=4003.64, stdev=1144.66
    clat (usec): min=42, max=920, avg=439.49, stdev=57.51
     lat (usec): min=49, max=924, avg=443.58, stdev=57.50
    clat percentiles (usec):
     |  1.00th=[  314],  5.00th=[  347], 10.00th=[  363], 20.00th=[  388],
     | 30.00th=[  408], 40.00th=[  424], 50.00th=[  441], 60.00th=[  457],
     | 70.00th=[  474], 80.00th=[  490], 90.00th=[  515], 95.00th=[  529],
     | 99.00th=[  570], 99.50th=[  578], 99.90th=[  611], 99.95th=[  627],
     | 99.99th=[  652]
   bw (  MiB/s): min= 2130, max= 2551, per=100.00%, avg=2253.73, stdev=15.16, samples=480
   iops        : min=545416, max=653066, avg=576954.38, stdev=3879.99, samples=480
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=84.61%, 750=15.39%
  lat (usec)   : 1000=0.01%
  cpu          : usr=6.22%, sys=93.76%, ctx=11115, majf=0, minf=1066
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=17297314,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2252MiB/s (2362MB/s), 2252MiB/s-2252MiB/s (2362MB/s-2362MB/s), io=65.0GiB (70.8GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=12572578/0, merge=0/0, ticks=5061965/0, in_queue=5061965, util=99.65%
  nvme1n1: ios=12489646/0, merge=0/0, ticks=5063550/0, in_queue=5063550, util=99.81%
Running all jobs at once

Code:
3# fio read-4k-anycpu.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job5: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job6: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job7: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job8: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 32 processes
Jobs: 11 (f=1): [E(13),f(4),r(1),E(1),f(5),E(3),f(1),E(4)][32.8%][r=1907MiB/s][r=488k IOPS][eta 01m:24s]
job1: (groupid=0, jobs=32): err= 0: pid=5880: Wed Sep 29 20:21:48 2021
  read: IOPS=490k, BW=1912MiB/s (2005MB/s)(56.0GiB/30003msec)
    slat (nsec): min=1663, max=249213, avg=3977.72, stdev=879.51
    clat (usec): min=105, max=4618, avg=2086.98, stdev=510.77
     lat (usec): min=111, max=4623, avg=2091.05, stdev=510.84
    clat percentiles (usec):
     |  1.00th=[ 1565],  5.00th=[ 1598], 10.00th=[ 1631], 20.00th=[ 1663],
     | 30.00th=[ 1680], 40.00th=[ 1713], 50.00th=[ 1745], 60.00th=[ 1795],
     | 70.00th=[ 2671], 80.00th=[ 2737], 90.00th=[ 2769], 95.00th=[ 2835],
     | 99.00th=[ 2900], 99.50th=[ 2900], 99.90th=[ 2966], 99.95th=[ 2966],
     | 99.99th=[ 3032]
   bw (  MiB/s): min= 1884, max= 1949, per=100.00%, avg=1915.66, stdev= 0.46, samples=1888
   iops        : min=482359, max=499088, avg=490398.73, stdev=118.67, samples=1888
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=61.66%, 4=38.32%, 10=0.01%
  cpu          : usr=1.39%, sys=98.56%, ctx=53210, majf=0, minf=4408
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=14688177,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1912MiB/s (2005MB/s), 1912MiB/s-1912MiB/s (2005MB/s-2005MB/s), io=56.0GiB (60.2GB), run=30003-30003msec

Disk stats (read/write):
  nvme0n1: ios=3137283/0, merge=0/0, ticks=5075485/0, in_queue=5075485, util=99.08%
  nvme1n1: ios=3136687/0, merge=0/0, ticks=5074916/0, in_queue=5074916, util=99.24%
  nvme9n1: ios=3137085/0, merge=0/0, ticks=5075467/0, in_queue=5075467, util=99.32%
  nvme4n1: ios=1957565/0, merge=0/0, ticks=5081620/0, in_queue=5081620, util=99.49%
  nvme5n1: ios=1969055/0, merge=0/0, ticks=5113610/0, in_queue=5113610, util=99.60%
  nvme6n1: ios=1958087/0, merge=0/0, ticks=5081660/0, in_queue=5081660, util=99.65%
  nvme7n1: ios=1957581/0, merge=0/0, ticks=5081702/0, in_queue=5081702, util=99.85%
  nvme8n1: ios=3136283/0, merge=0/0, ticks=5075420/0, in_queue=5075420, util=99.99%
With CPU pinning::::

Code:
[global]
name=NUMA CORE PINNING (MORE JOBS THAN CORES)
ioengine=io_uring
direct=1
hipri
readwrite=randread
bs=4k
iodepth=64
buffered=0
size=100%
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
log_max_value=1
group_reporting
size=5G
numjobs=4

[job1]
filename=/dev/nvme0n1
cpus_allowed=3,11

[job2]
filename=/dev/nvme1n1
cpus_allowed=19,27

[job3]
filename=/dev/nvme9n1
cpus_allowed=35,43

[job4]
filename=/dev/nvme4n1
cpus_allowed=51,59

[job5]
filename=/dev/nvme5n1
cpus_allowed=67,75

[job6]
filename=/dev/nvme6n1
cpus_allowed=83,91

[job7]
filename=/dev/nvme7n1
cpus_allowed=99,107

[job8]
filename=/dev/nvme8n1
cpus_allowed=115,123
Running a single job from above::::
Code:
# fio --section=job1 read-4k-numa.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 2 (f=2): [_(1),r(1),_(1),r(1)][100.0%][r=2227MiB/s][r=570k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=4): err= 0: pid=6209: Wed Sep 29 20:24:17 2021
  read: IOPS=573k, BW=2238MiB/s (2347MB/s)(65.6GiB/30001msec)
    slat (nsec): min=1713, max=40096k, avg=5827.76, stdev=198652.91
    clat (nsec): min=180, max=40215k, avg=216661.59, stdev=1198233.12
     lat (usec): min=51, max=40216, avg=222.65, stdev=1214.73
    clat percentiles (usec):
     |  1.00th=[   94],  5.00th=[   98], 10.00th=[  100], 20.00th=[  102],
     | 30.00th=[  103], 40.00th=[  105], 50.00th=[  106], 60.00th=[  109],
     | 70.00th=[  111], 80.00th=[  113], 90.00th=[  119], 95.00th=[  128],
     | 99.00th=[  204], 99.50th=[12125], 99.90th=[15139], 99.95th=[18220],
     | 99.99th=[25297]
   bw (  MiB/s): min= 1995, max= 2451, per=100.00%, avg=2238.70, stdev=22.48, samples=236
   iops        : min=510806, max=627612, avg=573107.05, stdev=5754.61, samples=236
  lat (nsec)   : 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=12.40%, 250=86.67%
  lat (usec)   : 500=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.16%, 20=0.68%, 50=0.04%
  cpu          : usr=10.90%, sys=39.03%, ctx=7086, majf=0, minf=241
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=17188005,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2238MiB/s (2347MB/s), 2238MiB/s-2238MiB/s (2347MB/s-2347MB/s), io=65.6GiB (70.4GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=23096941/0, merge=0/0, ticks=1456670/0, in_queue=1456670, util=99.78%
Running all jobs at once::::
Code:
# fio read-4k-numa.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job5: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job6: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job7: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job8: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 32 processes
Jobs: 32 (f=32): [r(32)][100.0%][r=6818MiB/s][r=1745k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=32): err= 0: pid=6883: Wed Sep 29 20:31:45 2021
  read: IOPS=1744k, BW=6812MiB/s (7143MB/s)(200GiB/30011msec)
    slat (nsec): min=1663, max=33024k, avg=8663.91, stdev=271964.89
    clat (nsec): min=201, max=33415k, avg=577228.64, stdev=2224622.61
     lat (usec): min=24, max=33420, avg=586.12, stdev=2240.86
    clat percentiles (usec):
     |  1.00th=[  178],  5.00th=[  194], 10.00th=[  202], 20.00th=[  217],
     | 30.00th=[  249], 40.00th=[  269], 50.00th=[  285], 60.00th=[  306],
     | 70.00th=[  330], 80.00th=[  367], 90.00th=[  392], 95.00th=[  408],
     | 99.00th=[17171], 99.50th=[18220], 99.90th=[24249], 99.95th=[24249],
     | 99.99th=[26084]
   bw (  MiB/s): min= 6669, max= 6973, per=100.00%, avg=6818.56, stdev= 2.36, samples=1888
   iops        : min=1707299, max=1785275, avg=1745549.75, stdev=603.20, samples=1888
  lat (nsec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=30.21%, 500=67.80%, 750=0.14%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.02%, 10=0.28%, 20=1.16%, 50=0.36%
  cpu          : usr=5.91%, sys=44.00%, ctx=57339, majf=0, minf=1933
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=52332396,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=6812MiB/s (7143MB/s), 6812MiB/s-6812MiB/s (7143MB/s-7143MB/s), io=200GiB (214GB), run=30011-30011msec

Disk stats (read/write):
  nvme0n1: ios=10222256/0, merge=0/0, ticks=2503935/0, in_queue=2503935, util=99.12%
  nvme1n1: ios=10222991/0, merge=0/0, ticks=2502820/0, in_queue=2502820, util=99.27%
  nvme9n1: ios=10215854/0, merge=0/0, ticks=2502131/0, in_queue=2502131, util=99.32%
  nvme4n1: ios=7018503/0, merge=0/0, ticks=2544213/0, in_queue=2544213, util=99.52%
  nvme5n1: ios=7010068/0, merge=0/0, ticks=2541596/0, in_queue=2541596, util=99.59%
  nvme6n1: ios=7013893/0, merge=0/0, ticks=2540536/0, in_queue=2540536, util=99.64%
  nvme7n1: ios=7012250/0, merge=0/0, ticks=2541552/0, in_queue=2541552, util=99.86%
  nvme8n1: ios=10218930/0, merge=0/0, ticks=2511614/0, in_queue=2511614, util=99.99%

It seems like I need 4 cores to fully utilize the drive at 870k+ iops. Given that::

Code:
[global]
name=Only some disks, enough cores pinned for each job
ioengine=io_uring
direct=1
hipri
readwrite=randread
bs=4k
iodepth=32
buffered=0
size=100%
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
log_max_value=1
group_reporting
size=5G
numjobs=4

[job1]
filename=/dev/nvme0n1
cpus_allowed=3,11,19,27

[job2]
filename=/dev/nvme1n1
cpus_allowed=35,43,51,59

[job3]
filename=/dev/nvme9n1
cpus_allowed=67,75,83,91

[job4]
filename=/dev/nvme4n1
cpus_allowed=99,107,115,123
Running a single job from above:::
Code:
# fio --section=job1 read-4k-alt-numa.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=3402MiB/s][r=871k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=4): err= 0: pid=7651: Wed Sep 29 20:38:41 2021
  read: IOPS=871k, BW=3401MiB/s (3566MB/s)(99.6GiB/30002msec)
    slat (nsec): min=1653, max=151282, avg=3336.21, stdev=1198.50
    clat (usec): min=56, max=526, avg=143.18, stdev=40.17
     lat (usec): min=60, max=531, avg=146.60, stdev=40.16
    clat percentiles (usec):
     |  1.00th=[   89],  5.00th=[   94], 10.00th=[   98], 20.00th=[  102],
     | 30.00th=[  108], 40.00th=[  115], 50.00th=[  155], 60.00th=[  165],
     | 70.00th=[  172], 80.00th=[  178], 90.00th=[  188], 95.00th=[  208],
     | 99.00th=[  243], 99.50th=[  262], 99.90th=[  297], 99.95th=[  314],
     | 99.99th=[  355]
   bw (  MiB/s): min= 3396, max= 3408, per=100.00%, avg=3402.00, stdev= 0.64, samples=240
   iops        : min=869424, max=872587, avg=870912.72, stdev=164.90, samples=240
  lat (usec)   : 100=15.50%, 250=83.74%, 500=0.76%, 750=0.01%
  cpu          : usr=17.71%, sys=82.20%, ctx=7919, majf=0, minf=247
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=26119867,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3401MiB/s (3566MB/s), 3401MiB/s-3401MiB/s (3566MB/s-3566MB/s), io=99.6GiB (107GB), run=30002-30002msec

Disk stats (read/write):
  nvme0n1: ios=34825485/0, merge=0/0, ticks=4901935/0, in_queue=4901935, util=99.77%

Running 2 jobs from above:::
Code:
# fio --section=job1 --section=job3 read-4k-alt-numa.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 8 processes
Jobs: 8 (f=8): [r(8)][100.0%][r=6247MiB/s][r=1599k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=7789: Wed Sep 29 20:39:52 2021
  read: IOPS=1598k, BW=6243MiB/s (6546MB/s)(183GiB/30001msec)
    slat (usec): min=2, max=230, avg= 4.11, stdev= 3.34
    clat (usec): min=31, max=655, avg=155.32, stdev=14.54
     lat (usec): min=35, max=657, avg=159.57, stdev=14.56
    clat percentiles (usec):
     |  1.00th=[  133],  5.00th=[  141], 10.00th=[  143], 20.00th=[  147],
     | 30.00th=[  149], 40.00th=[  151], 50.00th=[  153], 60.00th=[  157],
     | 70.00th=[  159], 80.00th=[  163], 90.00th=[  167], 95.00th=[  176],
     | 99.00th=[  219], 99.50th=[  233], 99.90th=[  269], 99.95th=[  285],
     | 99.99th=[  322]
   bw (  MiB/s): min= 6224, max= 6264, per=100.00%, avg=6246.23, stdev= 1.17, samples=480
   iops        : min=1593504, max=1603610, avg=1599034.60, stdev=299.47, samples=480
  lat (usec)   : 50=0.01%, 100=0.06%, 250=99.71%, 500=0.23%, 750=0.01%
  cpu          : usr=22.54%, sys=77.32%, ctx=15490, majf=0, minf=483
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=47946274,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=6243MiB/s (6546MB/s), 6243MiB/s-6243MiB/s (6546MB/s-6546MB/s), io=183GiB (196GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=32022142/0, merge=0/0, ticks=3173141/0, in_queue=3173141, util=99.73%
  nvme9n1: ios=31912933/0, merge=0/0, ticks=3159031/0, in_queue=3159031, util=99.80%
Running 4 jobs from above:::
Code:
# fio read-4k-alt-numa.fio
job1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job2: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job3: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
job4: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 16 processes
Jobs: 2 (f=2): [E(8),r(2),E(6)][100.0%][r=6789MiB/s][r=1738k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=16): err= 0: pid=7963: Wed Sep 29 20:40:46 2021
  read: IOPS=1739k, BW=6792MiB/s (7121MB/s)(199GiB/30002msec)
    slat (usec): min=2, max=656, avg= 4.28, stdev= 1.41
    clat (usec): min=58, max=1314, avg=289.40, stdev=41.59
     lat (usec): min=85, max=1318, avg=293.81, stdev=41.57
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  229], 10.00th=[  237], 20.00th=[  249],
     | 30.00th=[  260], 40.00th=[  273], 50.00th=[  289], 60.00th=[  306],
     | 70.00th=[  318], 80.00th=[  330], 90.00th=[  343], 95.00th=[  355],
     | 99.00th=[  379], 99.50th=[  392], 99.90th=[  420], 99.95th=[  433],
     | 99.99th=[  469]
   bw (  MiB/s): min= 6784, max= 6810, per=100.00%, avg=6797.18, stdev= 0.41, samples=944
   iops        : min=1736903, max=1743534, avg=1740076.97, stdev=103.85, samples=944
  lat (usec)   : 100=0.01%, 250=20.97%, 500=79.02%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=13.13%, sys=86.77%, ctx=32525, majf=0, minf=969
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=52161948,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=6792MiB/s (7121MB/s), 6792MiB/s-6792MiB/s (7121MB/s-7121MB/s), io=199GiB (214GB), run=30002-30002msec

Disk stats (read/write):
  nvme0n1: ios=17219823/0, merge=0/0, ticks=4983216/0, in_queue=4983216, util=99.54%
  nvme1n1: ios=17206803/0, merge=0/0, ticks=4983594/0, in_queue=4983594, util=99.69%
  nvme9n1: ios=17324095/0, merge=0/0, ticks=5012664/0, in_queue=5012664, util=99.76%
  nvme4n1: ios=17188364/0, merge=0/0, ticks=4982518/0, in_queue=4982518, util=99.90%
 

acquacow

Well-Known Member
Feb 15, 2017
783
437
63
42
Will your BIOS let you go in and disable P-states and C-states? Disabling those can speed IOPS up quite a bit.
 

chocamo

New Member
Sep 27, 2021
5
0
1
Will your BIOS let you go in and disable P-states and C-states? Disabling those can speed IOPS up quite a bit.
The last results I posted were with C-states disabled in the BIOS. I'm not sure if A states are labeled as " PCI ASPM L1 Link Power Management " in the Dell BIOS, but that was disabled as well

A tldr; for my last post:

- With CPU pinning in FIO, I max somewhere between 1739k and 2100k 4k read iops for all 8 disks at once. I'm not sure if I'm just hitting an overall CPU/PCIE3 bottleneck
- Without CPU pinning, I get WORSE throughput than a single disk. How much worse just depends on how many disks I run FIO against at the same time