At least on my case, when running Kimi K2 Q3_K_M (a mix between RAM on a consumer 9900X and 272GB VRAM), I get about 300-400 t/s PP and 12-14 t/s TG, using llamacpp with:
Code:
./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-Q3_K_M-00001-of-00011.gguf' -c 32768 --no-mmap -mg 0 -ub 2048
I.e.
Code:
prompt eval time = 11646.12 ms / 4394 tokens ( 2.65 ms per token, 377.29 tokens per second)
eval time = 50754.89 ms / 633 tokens ( 80.18 ms per token, 12.47 tokens per second)
It's not much but pretty decent to have things on RAM! I would guess an Epyc/Threadripper would be noticeably faster on TG, as I'm limited to about 70-75GB/s of bandwidth with my RAM. PP seems to be limited by both compute and PCIe (it maxes transfers to 64GB/s from the CPU to the main (CUDA 0) GPU, so I guess if PCIe 6.0 X16 existed on consumer boards and GPUs it would be even faster)
I was planning to get a Threadripper but now to get 256GB DDR5 RDIMM 6000Mhz RAM, is more expensive than the total amount of GPUs I have gotten (8), just insane.
When RAM stops being that overpriced, probably in some years I will get a Epyc/Threadripper next gen or something.