Is there an efficient way to move data inside RAM to another RAM location in C++ or C#?

holee · Apr 12, 2023

I'm going to move about 1 to 3GB of data in RAM to another location in RAM.

(Repeat several times)

When I Used Buffer.MemoryCopy function in the Parallel.For loop, the CPU Load was too high, and it took a long time

I'm already using 8-90% of the CPU Load because I'm performing other calculation in the program. so it seems to wait for resources, and I think it's taking a long time.

I've also looked for ways like DMA, but it seems to be only possible when communicating with peripherals.

Does anyone know how to minimize CPU Load or move data inside RAM to another location at a high speed??

Sean Ho · Apr 12, 2023

Memory management is complex, and there are a lot of layers between a memcpy in application software and actual bits in SDRAM. Are you able to share a bit more about the use case, what is producing the 1-3GB of data, and what is consuming it? Also, what hardware platform you're using, are NUMA issues relevant, etc.?

holee · Apr 12, 2023

션호 said:
Memory management is complex, and there are a lot of layers between a memcpy in application software and actual bits in SDRAM. Are you able to share a bit more about the use case, what is producing the 1-3GB of data, and what is consuming it? Also, what hardware platform you're using, are NUMA issues relevant, etc.?
[/인용하다]

Data is an image that is non-continuously drawn on 1TB of RAM.
Therefore, for an image of 50000*50000, Copy 50000 bytes and move the pointer to the next location, and copy 50000 bytes again.
this process must be repeated 50000 times.
The purpose of copying is to transmit this information to another PC using RDMA communication.
Since a memory buffer for RDMA transmission must be allocated separately.
the main purpose is to transfer information from memory to this RDMA buffer.

Stephan · Apr 13, 2023

On AMD64 close to optimal should be some unrolled, prefetching AVX2 memmove. Check out dpdk/rte_memcpy.h at master · scylladb/dpdk as an example.

On the hardware side, check that your CPU has all memory channels properly populated for best performance. Check Intel ARK for how many memory channels the CPU has, and/or the computer's manual.

Popular public libraries like glibc all have solved this problem years/decades ago. Don't forget to always benchmark your code.

RolloZ170 · Apr 13, 2023

what OS ?
C# is slow (exept the fully compiled)
C++ in Debugger is slower than compiled alone execution.

paf · Apr 13, 2023

First, be sure you REALLY want/need to move all those bytes.
Why do you need to move all the bytes?

Are you sure a buffer for RDMA sending must be allocated separately?
Can't you send each "image chunk" without allocating any new memory?
Can't you generate the image in a continuous RAM location?

krista · Apr 13, 2023

see if your rdma hw/lib/driver supports zero-copy: this *should* allow you to let the hw/lib/driver/kernel use your already allocated memory.

if you just need a read-only copy, you can get funky and map its pages into a different process, so all that gets updated is the system and process pagetables + bookkeeping.

memory optimizations are tricksey... so talk to me. i need details if i am to help you.

unwind-protect · Apr 13, 2023

I have a feeling that I/O is already happening when copying to this specially allocated buffer. That would also explain the slowness.

krista · Apr 13, 2023

unwind-protect said:
I have a feeling that I/O is already happening when copying to this specially allocated buffer. That would also explain the slowness.

entirely possible as a lot of rdma api/lib sort of automagically do their i/o.

unfortunately it's been a minute since i've played around with it.

hopefully i'll get a chance to screw with my ib-fdr setup and ue5 and see if i can do something interesting using multiple servers for extremely large and populated worlds while i look for a job...

Sean Ho · Apr 13, 2023

Agree with paf and krista that streaming may be more appropriate for your use case. Fewer/smaller buffers, not more.

krista · Apr 14, 2023

holee said:
Data is an image that is non-continuously drawn on 1TB of RAM.
Therefore, for an image of 50000*50000, Copy 50000 bytes and move the pointer to the next location, and copy 50000 bytes again.
this process must be repeated 50000 times.
The purpose of copying is to transmit this information to another PC using RDMA communication.
Since a memory buffer for RDMA transmission must be allocated separately.
the main purpose is to transfer information from memory to this RDMA buffer.

etymology first! i have a terrible addiction to curiosity which is rapidly leading to terminal end-stage fascination. speaking of fascination, it's changed much since it was first used to describe a sort of evil spell

From Latin fascinare ("to bewitch"), possibly from Ancient Greek βασκαίνιεν (baskaínien, “to speak ill of; to curse”)[1] Morphologically fascinate +‎ -ion

from wiktionary

"terabyte" has an interesting root etymology as well: tera

From Ancient Greek τέρας (téras, “monster”). Also from Ancient Greek τέτταρες (téttares, “four”), by analogy with tetra- for being the fourth power of 103. Doublet of terato-.

source wiktionary

this can mean [magic/sorcery,] or [sign, marvel, wonder] or [divine sign, omen, portent], or monster depending on context.

but don't go away just yet! if we recurse into the etymology of ancient greek's τέρας. we get

From Proto-Indo-European *kʷer-. See also Proto-Slavic *čarъ (“magic, sorcery”).

source wiktionary

... but i'll leave that exercise to the reader (hint: click on the link).

see? words change *lots* over time, and my fascination with nearly everything is indeed a curse!

plus "terabyte" means "magic marvelous monster + byte"

---

just as *words* change with time, programming concepts do as well.

what you are describing as your process is *a* way of doing what you are asking in a vague manner, but it's most likely a naïve solution and/or out-of-date with modern practices, let alone *best* practices.

unfortunately you have not provided enough information to enable us to make a useful comment and actually help. like johnny 5 says, "need more input".

- what are you trying to accomplish?

- is it remote screen or gpu sharing? if so, there are better ways of doing this besides a parallel loop copy bomb absolutely making a total CF of your memory subsystem.

- is it remote rendering? if so, same as above.

- if it's for large-world gaming, same thing with a much different solution (i am working on this problem, personally)

- so do you physically have more than a tib of ram on each machine? or are you planning on relying on an enormous pagefile?

- *why* isn't your tib-size buffer contiguous in the processes' view of memory?

answer these questions verbosely and we can probably help a fair bit!

holee · Apr 16, 2023

krista said:
etymology first! i have a terrible addiction to curiosity which is rapidly leading to terminal end-stage fascination. speaking of fascination, it's changed much since it was first used to describe a sort of evil spell

from wiktionary

"terabyte" has an interesting root etymology as well: tera

source wiktionary

this can mean [magic/sorcery,] or [sign, marvel, wonder] or [divine sign, omen, portent], or monster depending on context.

but don't go away just yet! if we recurse into the etymology of ancient greek's τέρας. we get

source wiktionary

... but i'll leave that exercise to the reader (hint: click on the link).

see? words change *lots* over time, and my fascination with nearly everything is indeed a curse!

plus "terabyte" means "magic marvelous monster + byte"

---

just as *words* change with time, programming concepts do as well.

what you are describing as your process is *a* way of doing what you are asking in a vague manner, but it's most likely a naïve solution and/or out-of-date with modern practices, let alone *best* practices.

unfortunately you have not provided enough information to enable us to make a useful comment and actually help. like johnny 5 says, "need more input".

- what are you trying to accomplish?

- is it remote screen or gpu sharing? if so, there are better ways of doing this besides a parallel loop copy bomb absolutely making a total CF of your memory subsystem.

- is it remote rendering? if so, same as above.

- if it's for large-world gaming, same thing with a much different solution (i am working on this problem, personally)

- so do you physically have more than a tib of ram on each machine? or are you planning on relying on an enormous pagefile?

- *why* isn't your tib-size buffer contiguous in the processes' view of memory?

answer these questions verbosely and we can probably help a fair bit!

The reason why i need to do that copy is related to RDMA memory limitation.

RDMA pin memory to read or write data with opposite processor.

But, Windows OS limit None-paged-Pool size up to 128GB, so i can only use RDMA memory within 128GB.

My program finds and sends an image of 1 to 3 GB in the entire 600 GB image

It is impossible to predict where the 1-3 gb partial image will be located in the 600 gb image.

if RDMA can pin 600GB enitre memory, the buffering is not essential because RDMA is zero-copy.

But, below windows OS limitation, RDMA can pin only 128GB so it is impossible to find and send partial image directly.
(RDMA can send data directly within 128GB of entire Image..)

Therefore the way I thought of it, is to copy the partial image that exists inside the 600GB full image into a buffer within a separate 128GB
and then transfer it to the opposite PC
(BufferCopy -> RDMACopy -> BufferCopy -> RDMACopy ....)
(RDMACopy -> BufferCopy -> RDMACopy -> BufferCopy ....) (Double Buffering)

There are several uncontinuous partial images in entire image about 100 to 200.

So, I need to copy memory to buffer constantly until all RDMA Process Done.

CyklonDX · Apr 16, 2023

While i haven't ready everything;
*i'm also not going to include any code

I think best way to move data inside ram to another ram location is to use buffers.

Though that is very inefficient for whatever purpose you plan to use this...

You would be much better off to create that single buffer with original data, and then write only changes as a separate buffer that will act as an index, and only write refers to original data point rather than copy whole thing.

so in short your original buffer may look like that (simplified)

Origin
A 1 = something123 B 2 = something231
A 2 = something B1 =somethingb

Change
Pos_A2 = something+1

(whatever type of 'file' / content you are working on you can create segments to accomplish that, or if you have deeper access to this content you can just have saved differences on line and position of the file.

This way when you happen to flush the buffer to disk or whatever you get much better performance while using much less memory.

Just copying for sake of duplication makes no sense. If you are looking to solve local server inter-cpu transfers to access memory on different cpu to pciid of network etc... just use memory mirroring mode on your server... it will cut memory in half but leave you with exact same memory on both cpu's locally.

Search

Is there an efficient way to move data inside RAM to another RAM location in C++ or C#?

holee

New Member

Sean Ho

seanho.com

holee

New Member

Stephan

Well-Known Member

RolloZ170

Well-Known Member

paf

New Member

krista

New Member

unwind-protect

Active Member

krista

New Member

Sean Ho

seanho.com

krista

New Member

holee

New Member

CyklonDX

Well-Known Member