Memory (config) diagnosis tool

Discussion in 'Software Stuff' started by Rand__, Oct 24, 2019.

  1. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    So in the meantime the tests completed - still no significant difference between

    6 x 8GB + a 16 GB NVDimm module (with PGem) with 36 Threads (arraysize 4GB:)

    upload_2019-10-28_20-44-18.png

    6 x 8GB with 36 Threads:

    upload_2019-10-28_20-45-6.png

    Will run a final test with 8GB and 36 threads as a final comparison point (to see whether there is any difference at all).
    If that is not showing any difference then I'll need to check bios if there is something odd there - this was a nvdimm test box so maybe i micsonfigured while trying to get nvdimm firmware updates working (don't ask).

    p.s. Cant attach the pdfs nor statsfile since they are too large. Let me know if you want to see different results
     
    #21
  2. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    From the looks of that chart, you tested only up to 4MB, not 4GB. Thus only testing the CPU caches.
     
    #22
    Rand__ likes this.
  3. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    Now that would be a particular stupid mistake which is not impossible :eek:

    Edit:
    pmbw.exe -S 4194304
    Running benchmarks with array size up to 4194304.
    CPUID: mmx sse avx
    Detected 7873 MiB physical RAM and 36 CPUs.

    Allocating 4096 MiB for testing.

    4096 MiB should be 4 GiB, shouldnt it? ... I need more sleep
     
    #23
    Last edited: Oct 28, 2019
    alex_stief likes this.
  4. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    The relevant quantity here is the array size.
    You can also see it from the results...the bandwidth values are definitely in cache territory. The first plateau with up to 1500MB/s should be L2, then the step down to 850MB/s is L3.
    That's what I was going for with an earlier post. You can leave out the smaller test sizes, as they never touch memory, and run in the CPU cache instead.
     
    #24
    Last edited: Oct 28, 2019
  5. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    What you say sounds plausible, but still don't get it.

    Array size relevant are the -s/ -S parameters (min max). I currently use -S 4194304 which limits the Max Size to 4GiB.
    Do you mean that by running on smaller sizes the Cache based results are so dominant that we can't see the mem based values properly?
    So I need to also set -s to at least 22MiB (L3 size) ?
     
    #25
  6. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    I am saying that -S 4194304 is 4MB, not 4GB.
    The amount of memory the program allocates is not directly related to this, and can be set separately (-M). And it does not determine whether the test arrays fit into cache or not.
    What you need for testing memory is -s 33554432. Note that this is the lowercase letter, setting the minimum array size to a value (probably) larger than the largest cache of your CPU.
    And you can set the upper limit for array size (-S) to some value 16-32 times this lower value. The bandwidth won't change that much once in memory, and running the larger tests will just eat up more time.
    I would need to have a closer look at the source to tell you how -M affects the duration of your tests
     
    #26
  7. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    well then why does
    pmbw.exe -S 4194304 result in "Allocating 4096 MiB for testing." I wonder - I agree that 4194304 is 4 MB only.
    Confusing tool for the uninitiated :p

    Will run as suggested - same 4096 response interestingly - so probably does not echo the Max/Min array Size

    pmbw.exe -s 33554432
    Running benchmarks with array size at least 33554432.
    CPUID: mmx sse avx
    Detected 7873 MiB physical RAM and 36 CPUs.

    Allocating 4096 MiB for testing.
     
    #27
  8. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    Like I said, I would need to look at the source code to see how memory allocation is handled, and how it affects the benchmark.
    But I can say with absolute certainty that in this benchmark, array size is what determines whether it runs in cache or not. Not the total amount of memory allocated for the benchmark.

    we already touched the subject of which qualification is required to get meaningful results, and make sense of them :D
     
    #28
    Last edited: Oct 29, 2019
  9. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    Yeah. I'll blame the documentation;)
     
    #29
  10. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    So been running 8GB and 48GB with "pmbw.exe -s 33554432" now, but for 8GB the impact of the cache still seems prevalent:
    upload_2019-10-30_21-20-38.png

    48GB Ram (no NVDimm)
    upload_2019-10-30_21-23-15.png

    So will go to -s 134217728
     
    #30
  11. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    Alright, so now we are talking - - using a nvdimm-n causes 20% memory bandwidth loss (as 7th module)


    8GB (single module) ~6GiB/s

    upload_2019-11-1_20-12-3.png


    24 GB (3 modules ) ~17GiBs
    upload_2019-11-1_20-12-51.png


    48GB (6 modules ) ~16GiBs

    upload_2019-11-1_20-13-55.png

    48GB + NVDimm ~13 GiBs

    upload_2019-11-1_20-14-46.png




    So despite the nvdimm being out of the memory channel there is a performance discrepancy to the hex channel setup...

    next questions are - how does it look with 5+1 modules (back to hexa?) and how would a full complement 7+1 work out...
     
    #31
  12. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    It might be worth taking a step back and checking the plausibility of the results so far.
    The measured bandwidth is way below the theoretical value for each of the DIMM configurations tested. Now reaching this theoretical limit is not easy, even highly optimized benchmarks like stream struggle to get to 90% of that.
    But with results much lower than that, I would be hesitant to draw the conclusion that hardware changes were the main cause of a change in the measured value. And even if we could be certain about the cause, it would be impossible to measure the effect quantitatively. Using a car analogy: I would not measure the top speed of a supercar on a kart circuit.
    PMBW measures a lot of different results. Is there any series of tests that gets closer to the maximum bandwidth of your platform? Otherwise, I would recommend switching to a different benchmark if you are interested in quantitative results.
     
    #32
    Last edited: Nov 1, 2019
  13. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    You are only trying to selling me the stream benchmark :p

    So what is the actual maximum theoretical bandwidth for the 3 configurations?
    Found this formula
    <64 IOs per channel> * <X channels> * <2,133 megabits per second per IO pin>
    eg for 4 channels = 64 * 4 * 2133E6 = 54.6GiB/s

    Now with a single module I am limited to a single channel I assume so that would be 14GiB/s ( but o/c i have 2666 mem so we are at 17GiBs)

    Got this result for 8GB

    upload_2019-11-1_23-55-38.png

    And this for 48 GB (6 modules)

    upload_2019-11-1_23-57-26.png

    Doubled but not times 6...



    Found also this:

    Max Bandwidth 119.21 GiB/s
    Bandwidth
    Single
    19.87 GiB/s
    Double 39.74 GiB/s
    Quad 79.47 GiB/s
    Hexa 119.21 GiB/s


    That would match for single channel but be far off for hexa channel...
    Might need to read some more...


    But in the the end the absolute numbers are possibly not even relevant as the main question was originally - will have using an nvdimm have an impact on memory bandwidth (yes) or is the system smart enough to correct it after the driver has been loaded (no)

    Everything following that (identifying best setup & optimization) is a secondary topic I guess
     
    #33
  14. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    Theoretical maximum memory bandwidth for DDR:
    64 bit (bus width) x 2 (for DDR) x "real" memory frequency (1333MHz for DDR4-2666) x number of channels.
    1 channel DDR4-2666: 19.86 GiB/s
    ...so same as the numbers you got. The stream benchmark can get close to that under the right circumstances. E.g. being compiled with the right instruction set, and using dual-rank DIMMs.

    In retrospect, I hope you can see why I tried to push you towards the stream benchmark. It only needs to run for a few seconds (not hours) and can achieve the maximum bandwidth possible with the given hardware. Hence its use in performance analysis for HPC codes.
     
    #34
    Last edited: Nov 1, 2019
  15. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    I'd be happy to try it if I could find a compiled binary for win or linux...;)
    But the stream page looks to be active (and unchanged) since 1988 (layout wise;))
     
    #35
  16. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    It definitely has that old-school vibe to it :cool: But don't worry, it still does what it is supposed to do.
    Part of the point of not making a pre-compiled binary available directly for download: you can only unleash its full potential when compiling for your specific system.
     
    #36
  17. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    Yeah thats what I was afraid of :p

    Installed Aida64 just for testers the other night...
    upload_2019-11-2_8-23-13.png

    upload_2019-11-2_8-24-57.png

    will check if that provides more accurate info (without having to actually test everything myself) (this is the 5 dimm+1 nvdimm config)
     
    #37
  18. alex_stief

    alex_stief Active Member

    Joined:
    May 31, 2016
    Messages:
    406
    Likes Received:
    104
    In case you want to avoid all the hassle and get a more direct answer to your question, you can always go over to Software Tuning, Performance Optimization & Platform Monitoring
    Some of the people there really know what they are talking about, and chances are, they can tell you what is going on with NVDIMM under the hood. Without poking through any benchmarks.
     
    #38
  19. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,547
    Likes Received:
    528
    Good idea, did that. Under moderator review...

    7 modules AIDA (6+nvdimm)
    Significantly worse - from this a 5+1 setup looks preferable...

    upload_2019-11-2_10-45-30.png

    upload_2019-11-2_10-45-46.png


    3+1 Modules:
    upload_2019-11-2_19-10-39.png

    upload_2019-11-2_19-10-54.png
     
    #39
    Last edited: Nov 2, 2019

Share This Page