Search results

  1. I

    ES Xeon Discussion

    I previously used a two-socket system in conjunction with several GPUs. Through KTransformers, one model can be loaded onto each socket, thereby significantly increasing the speed of access from GPU to RAM.
  2. I

    ES Xeon Discussion

    In any case, you need to disable NUMA virtualization. Under Windows OS, you should not wait for normal speeds. You can also use the ik_llama.cpp fork for model interference. If you have a two-socket system, then you will not be able to fully load it due to UPI speed limits.
  3. I

    ES Xeon Discussion

    Hi? Will this bios be installed for SP2C741D16X-2T?
  4. I

    ES Xeon Discussion

    Adding: DeepSeek-R1-Q4_K_M - build ik_llama.cpp under ubuntu Generation: ~10 t/s - up to 100 tokens ~8 t/s - up to 2000 tokens ~7 t/s - up to 5,000 tokens ~6 t/s - up to 10,000 tokens Bench: pp1000 - 105.46t/s pp512 - 109.69t/s tg128 - 8.74t/s