Proxmox identify cause of poor performance in Dual Socket system

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Harmony

Member
Oct 6, 2022
134
12
18
I have a system with Supermicro H12DSi-N6 with 2x EPYC 7642 once the system starts going above 30% CPU usage in the overall graph I noticed a steep decrease in memory performance and the system feels sluggish overall. How can I identify what might be the cause?

sysbench --test=memory --memory-block-size=4G --memory-total-size=32G run

Code:
24576.00 MiB transferred (2370.00 MiB/sec)


General statistics:
    total time:                          10.3682s
    total number of events:              6

Latency (ms):
         min:                                 1189.33
         avg:                                 1727.95
         max:                                 2697.68
         95th percentile:                     2680.11
         sum:                                10367.71

Threads fairness:
    events (avg/stddev):           6.0000/0.00
    execution time (avg/stddev):   10.3677/0.00


The RAM is DDR4 2666 gets 13,000.00 MiB/sec compared to the above with noting running
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
This may have something to do with NUMA-awareness or lack thereof in either your workload or the benchmarking tool, or possibly non-ideal BIOS settings around NUMA.

NUMA nodes group sockets/cores together with their fastest RAM. This can cause bottlenecks when one core needs to access memory across NUMA nodes.

I don't have an exact answer for you, but maybe one of these will get you started:

- Is there an actual problem? This effect may be simply an artifact of the way the benchmarking is being done. Try to artificially scale-up your prod workload to see if the benchmark reflects reality. How does your prod workload handle NUMA?

- Take a look at NUMA-related settings in your board manual and read about what they mean, and optimize them.

- Try to find a tool which can give you statistics about the NUMA status and resource usage. This would help confirm the root cause.

Good Luck!

Edit: LOL looks like someone beat me to it :)
 
  • Like
Reactions: Harmony

Harmony

Member
Oct 6, 2022
134
12
18
This may have something to do with NUMA-awareness or lack thereof in either your workload or the benchmarking tool, or possibly non-ideal BIOS settings around NUMA.

NUMA nodes group sockets/cores together with their fastest RAM. This can cause bottlenecks when one core needs to access memory across NUMA nodes.

I don't have an exact answer for you, but maybe one of these will get you started:

- Is there an actual problem? This effect may be simply an artifact of the way the benchmarking is being done. Try to artificially scale-up your prod workload to see if the benchmark reflects reality. How does your prod workload handle NUMA?

- Take a look at NUMA-related settings in your board manual and read about what they mean, and optimize them.

- Try to find a tool which can give you statistics about the NUMA status and resource usage. This would help confirm the root cause.

Good Luck!

Edit: LOL looks like someone beat me to it :)
You took too long to type that out :) But thanks for the answer.

Rebooting the VM gives great results, for how long I don't know, even without NUMA enabled I will try both. I still don't understand why the host machine has poor performance though. The result in the first post was taken from Proxmox node.

Do you know any "non-ideal BIOS settings around NUMA" I should be aware of?
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Yeah I'm not quite sure what's going on either. For some reason the system appears to be allocating memory to a process that's outside of its CPU's NUMA node. Maybe another benchmarking took would work better, or some parameters to make it better NUMA-aware.

I think this is your manual: Manual_Supermicro_H12DSi-NT6_v1.0_20210901.pdf - On the bottom of pg. 61 there is a setting called "NUMA nodes per socket". It would be interesting to see what it's set to, and try some different settings.

On the next page there is another NUMA setting about cache. Maybe try changing that one too.

My knowledge level here isn't good enough to really help but maybe this weekend if I have some time I'll have a deeper look for you if needed.
 
  • Like
Reactions: Harmony

Harmony

Member
Oct 6, 2022
134
12
18
Yeah I'm not quite sure what's going on either. For some reason the system appears to be allocating memory to a process that's outside of its CPU's NUMA node. Maybe another benchmarking took would work better, or some parameters to make it better NUMA-aware.

I think this is your manual: Manual_Supermicro_H12DSi-NT6_v1.0_20210901.pdf - On the bottom of pg. 61 there is a setting called "NUMA nodes per socket". It would be interesting to see what it's set to, and try some different settings.

On the next page there is another NUMA setting about cache. Maybe try changing that one too.

My knowledge level here isn't good enough to really help but maybe this weekend if I have some time I'll have a deeper look for you if needed.
NUMA nodes per socket is set to "Auto"

Rebooted the VM again, seems to sometimes boot in a bad condition



If it helps


root@proxmoxhost3:~# numastat
Code:
                           node0           node1
numa_hit            195958341659    214547960812
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit            154685          153642
local_node          195929675706    214517495564
other_node              28411951        29865698
root@proxmoxhost3:~# numactl -H

Code:
 available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 515867 MB
node 0 free: 266897 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 1 size: 516015 MB
node 1 free: 262410 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10