I previously used a two-socket system in conjunction with several GPUs. Through KTransformers, one model can be loaded onto each socket, thereby significantly increasing the speed of access from GPU to RAM.
In any case, you need to disable NUMA virtualization. Under Windows OS, you should not wait for normal speeds. You can also use the ik_llama.cpp fork for model interference. If you have a two-socket system, then you will not be able to fully load it due to UPI speed limits.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.