Memory (config) diagnosis tool

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
So in the meantime the tests completed - still no significant difference between

6 x 8GB + a 16 GB NVDimm module (with PGem) with 36 Threads (arraysize 4GB:)

upload_2019-10-28_20-44-18.png

6 x 8GB with 36 Threads:

upload_2019-10-28_20-45-6.png

Will run a final test with 8GB and 36 threads as a final comparison point (to see whether there is any difference at all).
If that is not showing any difference then I'll need to check bios if there is something odd there - this was a nvdimm test box so maybe i micsonfigured while trying to get nvdimm firmware updates working (don't ask).

p.s. Cant attach the pdfs nor statsfile since they are too large. Let me know if you want to see different results
 

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
Now that would be a particular stupid mistake which is not impossible :eek:

Edit:
pmbw.exe -S 4194304
Running benchmarks with array size up to 4194304.
CPUID: mmx sse avx
Detected 7873 MiB physical RAM and 36 CPUs.

Allocating 4096 MiB for testing.

4096 MiB should be 4 GiB, shouldnt it? ... I need more sleep
 
Last edited:
  • Like
Reactions: alex_stief

alex_stief

Active Member
May 31, 2016
603
180
43
35
The relevant quantity here is the array size.
You can also see it from the results...the bandwidth values are definitely in cache territory. The first plateau with up to 1500MB/s should be L2, then the step down to 850MB/s is L3.
That's what I was going for with an earlier post. You can leave out the smaller test sizes, as they never touch memory, and run in the CPU cache instead.
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
What you say sounds plausible, but still don't get it.

Array size relevant are the -s/ -S parameters (min max). I currently use -S 4194304 which limits the Max Size to 4GiB.
Do you mean that by running on smaller sizes the Cache based results are so dominant that we can't see the mem based values properly?
So I need to also set -s to at least 22MiB (L3 size) ?
 

alex_stief

Active Member
May 31, 2016
603
180
43
35
I am saying that -S 4194304 is 4MB, not 4GB.
The amount of memory the program allocates is not directly related to this, and can be set separately (-M). And it does not determine whether the test arrays fit into cache or not.
What you need for testing memory is -s 33554432. Note that this is the lowercase letter, setting the minimum array size to a value (probably) larger than the largest cache of your CPU.
And you can set the upper limit for array size (-S) to some value 16-32 times this lower value. The bandwidth won't change that much once in memory, and running the larger tests will just eat up more time.
I would need to have a closer look at the source to tell you how -M affects the duration of your tests
 

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
well then why does
pmbw.exe -S 4194304 result in "Allocating 4096 MiB for testing." I wonder - I agree that 4194304 is 4 MB only.
Confusing tool for the uninitiated :p

Will run as suggested - same 4096 response interestingly - so probably does not echo the Max/Min array Size

pmbw.exe -s 33554432
Running benchmarks with array size at least 33554432.
CPUID: mmx sse avx
Detected 7873 MiB physical RAM and 36 CPUs.

Allocating 4096 MiB for testing.
 

alex_stief

Active Member
May 31, 2016
603
180
43
35
Like I said, I would need to look at the source code to see how memory allocation is handled, and how it affects the benchmark.
But I can say with absolute certainty that in this benchmark, array size is what determines whether it runs in cache or not. Not the total amount of memory allocated for the benchmark.

Confusing tool for the uninitiated :p
we already touched the subject of which qualification is required to get meaningful results, and make sense of them :D
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
So been running 8GB and 48GB with "pmbw.exe -s 33554432" now, but for 8GB the impact of the cache still seems prevalent:
upload_2019-10-30_21-20-38.png

48GB Ram (no NVDimm)
upload_2019-10-30_21-23-15.png

So will go to -s 134217728
 

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
Alright, so now we are talking - - using a nvdimm-n causes 20% memory bandwidth loss (as 7th module)


8GB (single module) ~6GiB/s

upload_2019-11-1_20-12-3.png


24 GB (3 modules ) ~17GiBs
upload_2019-11-1_20-12-51.png


48GB (6 modules ) ~16GiBs

upload_2019-11-1_20-13-55.png

48GB + NVDimm ~13 GiBs

upload_2019-11-1_20-14-46.png




So despite the nvdimm being out of the memory channel there is a performance discrepancy to the hex channel setup...

next questions are - how does it look with 5+1 modules (back to hexa?) and how would a full complement 7+1 work out...
 

alex_stief

Active Member
May 31, 2016
603
180
43
35
It might be worth taking a step back and checking the plausibility of the results so far.
The measured bandwidth is way below the theoretical value for each of the DIMM configurations tested. Now reaching this theoretical limit is not easy, even highly optimized benchmarks like stream struggle to get to 90% of that.
But with results much lower than that, I would be hesitant to draw the conclusion that hardware changes were the main cause of a change in the measured value. And even if we could be certain about the cause, it would be impossible to measure the effect quantitatively. Using a car analogy: I would not measure the top speed of a supercar on a kart circuit.
PMBW measures a lot of different results. Is there any series of tests that gets closer to the maximum bandwidth of your platform? Otherwise, I would recommend switching to a different benchmark if you are interested in quantitative results.
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
You are only trying to selling me the stream benchmark :p

So what is the actual maximum theoretical bandwidth for the 3 configurations?
Found this formula
<64 IOs per channel> * <X channels> * <2,133 megabits per second per IO pin>
eg for 4 channels = 64 * 4 * 2133E6 = 54.6GiB/s

Now with a single module I am limited to a single channel I assume so that would be 14GiB/s ( but o/c i have 2666 mem so we are at 17GiBs)

Got this result for 8GB

upload_2019-11-1_23-55-38.png

And this for 48 GB (6 modules)

upload_2019-11-1_23-57-26.png

Doubled but not times 6...



Found also this:

Max Bandwidth 119.21 GiB/s
Bandwidth
Single
19.87 GiB/s
Double 39.74 GiB/s
Quad 79.47 GiB/s
Hexa 119.21 GiB/s


That would match for single channel but be far off for hexa channel...
Might need to read some more...


But in the the end the absolute numbers are possibly not even relevant as the main question was originally - will have using an nvdimm have an impact on memory bandwidth (yes) or is the system smart enough to correct it after the driver has been loaded (no)

Everything following that (identifying best setup & optimization) is a secondary topic I guess
 

alex_stief

Active Member
May 31, 2016
603
180
43
35
Theoretical maximum memory bandwidth for DDR:
64 bit (bus width) x 2 (for DDR) x "real" memory frequency (1333MHz for DDR4-2666) x number of channels.
1 channel DDR4-2666: 19.86 GiB/s
...so same as the numbers you got. The stream benchmark can get close to that under the right circumstances. E.g. being compiled with the right instruction set, and using dual-rank DIMMs.

In retrospect, I hope you can see why I tried to push you towards the stream benchmark. It only needs to run for a few seconds (not hours) and can achieve the maximum bandwidth possible with the given hardware. Hence its use in performance analysis for HPC codes.
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
I'd be happy to try it if I could find a compiled binary for win or linux...;)
But the stream page looks to be active (and unchanged) since 1988 (layout wise;))
 

alex_stief

Active Member
May 31, 2016
603
180
43
35
It definitely has that old-school vibe to it :cool: But don't worry, it still does what it is supposed to do.
Part of the point of not making a pre-compiled binary available directly for download: you can only unleash its full potential when compiling for your specific system.
 

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
Yeah thats what I was afraid of :p

Installed Aida64 just for testers the other night...
upload_2019-11-2_8-23-13.png

upload_2019-11-2_8-24-57.png

will check if that provides more accurate info (without having to actually test everything myself) (this is the 5 dimm+1 nvdimm config)
 

Rand__

Well-Known Member
Mar 6, 2014
4,418
861
113
Good idea, did that. Under moderator review...

7 modules AIDA (6+nvdimm)
Significantly worse - from this a 5+1 setup looks preferable...

upload_2019-11-2_10-45-30.png

upload_2019-11-2_10-45-46.png


3+1 Modules:
upload_2019-11-2_19-10-39.png

upload_2019-11-2_19-10-54.png
 
Last edited: