Subject: Gemma 31B on a Minisforum MS-01 with RTX 4000 SFF Ada: what actually helped
I spent a few days trying to turn a small workstation/homelab box into a useful local LLM host. The end result is not "this replaces cloud/frontier models." It does not. But the tuning results were concrete enough that they may be useful to others trying to run local GGUF models on 20 GB class NVIDIA cards.
The short version:
- The useful win was increasing GPU layer offload after moving display ownership off the NVIDIA card.
- The best safe 31B default on this machine moved from -ngl 48 to -ngl 54.
- Decode improved from about 7.47 tok/s to about 8.81 tok/s.
- Long prompt prefill improved from about 44-45 seconds to about 37-39 seconds on a 14.5k token synthetic prompt.
- More aggressive batch/microbatch and checkpoint settings help specific workloads, but I would keep them as manual profiles, not defaults.
- A 20 GB card can run a usable local 31B model, but it still does not feel like a strong remote coding model.
Hardware
========
Host:
- Minisforum MS-01 class system
- CPU: Intel Core i9-12900H
- 20 logical CPUs
- RAM: 96 GB installed, about 93 GiB visible
- GPU: NVIDIA RTX 4000 SFF Ada Generation, 20 GB VRAM
- iGPU: Intel Alder Lake-P integrated graphics
- OS: Debian 12 with backported 6.12 kernel
- Kernel during tests: 6.12.74+deb12-amd64
- NVIDIA driver observed during tests: 535.261.03, CUDA 12.2 reported by nvidia-smi
The NVIDIA card is an RTX 4000 SFF Ada, not an A4000. It is a compact workstation card with 20 GB VRAM and modest power draw. That makes it attractive for a small box, but the 20 GB VRAM ceiling is very real for local LLMs.
Software stack
==============
Runtime:
- llama.cpp
- CUDA backend
- OpenAI-compatible llama-server
Primary text model:
- Gemma 4 31B instruct GGUF
- Quant: Q4_K_M
- Context: 16384
- KV cache: q8_0 / q8_0
- Reasoning disabled
- One parallel slot
Secondary service:
- Gemma 4 E4B multimodal GGUF Q8_0
- Used as a local image classification / OCR triage backend
- This is separate from the 31B tuning described below
Why the iGPU mattered
=====================
Initially the NVIDIA card was also carrying desktop/display duties. After enabling the Intel iGPU and forcing Xorg/SDDM onto Intel graphics, NVIDIA memory dropped to essentially idle when no model was running.
That reclaimed enough VRAM headroom to make a higher 31B offload setting viable.
This was not free. It cost some display convenience, including dual-monitor capability in the current setup. That tradeoff matters. If the machine is primarily a daily desktop, the display loss may not be worth the LLM gain. If it is primarily an inference box, it was worth testing.
Benchmark shape
===============
These are not broad benchmarks. They are controlled, practical measurements against one local deployment.
The main benchmark used synthetic text-only chat prompts calibrated to about 14.5k tokens.
Two prompt layouts were used:
1. Stable-prefix pair
- Same long prefix
- Short changed suffix
- This measures normal prompt-cache reuse when the large unchanged material is at the front.
2. Middle-mutation pair
- Same beginning and same tail
- Small changed text in the middle
- This is a harder case for normal prefix caching.
Responses were capped very short. The benchmark is mostly about prefill/cache behavior, not answer quality.
A separate short-prompt decode test generated 256 tokens to estimate decode throughput.
Starting baseline
=================
The working 31B baseline before the final tuning pass was:
-c 16384
-ngl 48
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
At -ngl 48, the 31B server used about 15.9 GiB VRAM.
Measured at -ngl 48:
Stable prefix cold: 43.9s
Stable prefix reuse: 1.85s
Middle mutation cold: 45.3s
Middle mutation reuse: 45.0s
Decode average: 7.47 tok/s
The stable-prefix result is important: llama.cpp's normal prompt cache was already very effective when the prompt shape was cache-friendly.
GPU layer offload sweep: the real win
=====================================
After freeing the NVIDIA card from display use, I swept higher -ngl values.
Load results:
-ngl 49: healthy, 16170 MiB used, 2711 MiB free
-ngl 50: healthy, 16526 MiB used, 2355 MiB free
-ngl 51: healthy, 16796 MiB used, 2085 MiB free
-ngl 52: healthy, 17100 MiB used, 1781 MiB free
-ngl 53: healthy, 17370 MiB used, 1511 MiB free
-ngl 54: healthy, 17640 MiB used, 1241 MiB free
-ngl 55: healthy, 17942 MiB used, 939 MiB free
-ngl 56: healthy, 18326 MiB used, 555 MiB free
-ngl 57: healthy, 18630 MiB used, 251 MiB free
-ngl 58: failed during CUDA compute-buffer allocation
The -ngl 58 failure was:
ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 522.50 MiB on device 0: cudaMalloc failed: out of memory
graph_reserve: failed to allocate compute buffers
I treated 1 GiB free as the practical floor for a default. That made -ngl 54 the highest comfortable setting. The higher values load, but the margin is too small for a service I expect to restart reliably.
Comparison:
Setting Stable cold Stable reuse Middle cold Middle reuse Decode
-ngl 48 43.9s 1.85s 45.3s 45.0s 7.47 tok/s
-ngl 54 37.0s 1.35s 38.5s 38.2s 8.81 tok/s
This is the one change I promoted to the launcher default.
Current safe default:
-c 16384
-ngl 54
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
SWA / cache-reuse experiment
============================
I also tested:
--swa-full --cache-reuse 256
This was not a default-worthy result.
At the normal -ngl 48 shape, adding full SWA failed before health:
llama_kv_cache_iswa: using full-size SWA cache
ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 5304.00 MiB on device 0: cudaMalloc failed: out of memory
To make it fit, I had to reduce offload to -ngl 40.
At -ngl 40:
Setting Stable cold Stable reuse Middle cold Middle reuse
baseline -ngl 40 53.8s 2.47s 55.0s 54.5s
-ngl 40 + SWA/cache-reuse 63.3s 2.36s 66.8s 37.4s
So yes, it helped the middle-mutation retry case. But it slowed cold passes and required giving up GPU offload. I would not use it as the general default on this hardware.
Batch and microbatch sweep
==========================
With -ngl 54 fixed, I tested larger batch settings.
Baseline upstream behavior is effectively:
--batch-size 2048
--ubatch-size 512
Results:
Batch/ubatch Stable cold Stable reuse Middle cold Middle reuse Default-safe
2048/512 37.0s 1.35s 38.5s 38.2s yes
4096/512 36.9s 1.35s 38.5s 38.1s yes, no real gain
4096/1024 32.1s 1.40s 33.6s 18.1s no, too tight
8192/1024 31.9s 1.41s 33.5s 17.9s no, too tight
The ubatch 1024 cases are interesting. They accelerate prefill and make the middle-mutation retry much faster. But the memory margin is poor.
For 8192/1024, the stopped-server memory breakdown showed CUDA free memory down to about 247 MiB, with compute buffers around 1045 MiB.
That is too tight for an everyday default. I kept it as a manual high-throughput profile:
--batch-size 8192 --ubatch-size 1024
Context checkpoint spacing
==========================
The ubatch 1024 result suggested that context checkpoint behavior might be part of the middle-mutation improvement, so I tested checkpoint spacing with default batch settings.
Default upstream checkpoint interval is 8192 tokens.
Results:
Checkpoint interval Stable cold Stable reuse Middle cold Middle reuse
default 8192 37.0s 1.35s 38.5s 38.2s
4096 37.1s 1.34s 38.8s 28.6s
2048 37.9s 1.34s 39.7s 23.8s
1024 38.1s 1.35s 40.0s 24.0s
The 2048 interval was best for middle-mutation retry. It did not increase resident VRAM the way ubatch 1024 did, and the stopped-server memory breakdown still showed about 953 MiB CUDA free memory.
The cost is slower cold prompts and larger prompt-cache state. At 2048, the server reported 8 checkpoints and about 6329 MiB cache state for one 14.5k-token prompt.
I kept the default checkpoint interval unchanged, but added this as a manual profile:
--checkpoint-every-n-tokens 2048
For a balanced manual option:
--checkpoint-every-n-tokens 4096
Cache RAM
=========
I also tested increasing cache RAM from 32768 MiB to 65536 MiB before this final tuning pass. It did not produce a meaningful speed gain on the repeated long-prefix test.
Measured earlier:
cache-ram 32768: about 58.8s cold, 12.0s cached
cache-ram 65536: about 59.8s cold, 12.1s cached
The conclusion was simple: 32768 MiB was already enough for the tested single cached prefix. More cache RAM used system RAM but did not make that workload faster.
What I would keep
=================
Safe default:
-c 16384
-ngl 54
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
Manual high-throughput profile, if tight VRAM margin is acceptable:
--batch-size 8192 --ubatch-size 1024
Manual middle-mutation profile:
--checkpoint-every-n-tokens 2048
Experimental only:
-ngl 55 through -ngl 57
--swa-full --cache-reuse 256
Do not bother without another memory-saving change:
-ngl 58
Operational notes
=================
A few practical lessons from this run:
1. 20 GB VRAM is the ceiling.
The system has 96 GB RAM and 20 CPU threads. That helps with loading, caching, and running hybrid configurations, but it does not turn a dense 31B model into a fast local coding assistant. GPU memory bandwidth and VRAM capacity dominate.
2. Moving display off NVIDIA can matter.
Moving Xorg/SDDM to the Intel iGPU gave the RTX 4000 SFF Ada a clean compute role and made -ngl 54 viable. But this had user-experience cost. In my case, it affected dual-monitor behavior. Whether that is worth it depends on whether the box is a workstation or an inference appliance.
3. Prompt shape matters more than some exotic flags.
Stable-prefix caching was already excellent. If the large stable material is before the changing query, cache reuse is strong. Middle edits are much harder unless you use checkpoint or batch profiles tuned for that workload.
4. The "loads successfully" line is not enough.
-ngl 57 loaded, but left only about 251 MiB free. That is not a healthy default. You need enough margin for compute buffers, fragmentation, restarts, and slightly different prompt shapes.
5. This is usable, not magical.
The final safe profile gives about 8.8 tok/s decode on this setup. That is useful for local/private/offline work. It is not competitive with strong hosted models for serious coding throughput.
Final take
==========
The good news: this MS-01 plus RTX 4000 SFF Ada can run a useful local 31B GGUF model, and there are real tuning gains a vailable. The best general gain was not an exotic trick; it was reclaiming VRAM and increasing layer offload from -ngl 48 to -ngl 54.
The bad news: the machine still does not become a frontier-class local coding box. The hardware is good enough for a local lab model, classifier sidecars, privacy-sensitive work, and experiments. It is not enough to make a 31B dense model feel like a modern hosted coding model.
If I were building around this class of hardware again, I would treat 20 GB Ada as a very capable edge/homelab inference card, not as a substitute for a larger VRAM GPU when the goal is comfortable local LLM coding.
Corrections and suggestions welcome. I would be especially interested in comparable llama.cpp numbers from other 20 GB cards, RTX 3090/4090 class cards, and anyone running similar long-prompt cache tests with Gemma-family GGUFs.
I spent a few days trying to turn a small workstation/homelab box into a useful local LLM host. The end result is not "this replaces cloud/frontier models." It does not. But the tuning results were concrete enough that they may be useful to others trying to run local GGUF models on 20 GB class NVIDIA cards.
The short version:
- The useful win was increasing GPU layer offload after moving display ownership off the NVIDIA card.
- The best safe 31B default on this machine moved from -ngl 48 to -ngl 54.
- Decode improved from about 7.47 tok/s to about 8.81 tok/s.
- Long prompt prefill improved from about 44-45 seconds to about 37-39 seconds on a 14.5k token synthetic prompt.
- More aggressive batch/microbatch and checkpoint settings help specific workloads, but I would keep them as manual profiles, not defaults.
- A 20 GB card can run a usable local 31B model, but it still does not feel like a strong remote coding model.
Hardware
========
Host:
- Minisforum MS-01 class system
- CPU: Intel Core i9-12900H
- 20 logical CPUs
- RAM: 96 GB installed, about 93 GiB visible
- GPU: NVIDIA RTX 4000 SFF Ada Generation, 20 GB VRAM
- iGPU: Intel Alder Lake-P integrated graphics
- OS: Debian 12 with backported 6.12 kernel
- Kernel during tests: 6.12.74+deb12-amd64
- NVIDIA driver observed during tests: 535.261.03, CUDA 12.2 reported by nvidia-smi
The NVIDIA card is an RTX 4000 SFF Ada, not an A4000. It is a compact workstation card with 20 GB VRAM and modest power draw. That makes it attractive for a small box, but the 20 GB VRAM ceiling is very real for local LLMs.
Software stack
==============
Runtime:
- llama.cpp
- CUDA backend
- OpenAI-compatible llama-server
Primary text model:
- Gemma 4 31B instruct GGUF
- Quant: Q4_K_M
- Context: 16384
- KV cache: q8_0 / q8_0
- Reasoning disabled
- One parallel slot
Secondary service:
- Gemma 4 E4B multimodal GGUF Q8_0
- Used as a local image classification / OCR triage backend
- This is separate from the 31B tuning described below
Why the iGPU mattered
=====================
Initially the NVIDIA card was also carrying desktop/display duties. After enabling the Intel iGPU and forcing Xorg/SDDM onto Intel graphics, NVIDIA memory dropped to essentially idle when no model was running.
That reclaimed enough VRAM headroom to make a higher 31B offload setting viable.
This was not free. It cost some display convenience, including dual-monitor capability in the current setup. That tradeoff matters. If the machine is primarily a daily desktop, the display loss may not be worth the LLM gain. If it is primarily an inference box, it was worth testing.
Benchmark shape
===============
These are not broad benchmarks. They are controlled, practical measurements against one local deployment.
The main benchmark used synthetic text-only chat prompts calibrated to about 14.5k tokens.
Two prompt layouts were used:
1. Stable-prefix pair
- Same long prefix
- Short changed suffix
- This measures normal prompt-cache reuse when the large unchanged material is at the front.
2. Middle-mutation pair
- Same beginning and same tail
- Small changed text in the middle
- This is a harder case for normal prefix caching.
Responses were capped very short. The benchmark is mostly about prefill/cache behavior, not answer quality.
A separate short-prompt decode test generated 256 tokens to estimate decode throughput.
Starting baseline
=================
The working 31B baseline before the final tuning pass was:
-c 16384
-ngl 48
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
At -ngl 48, the 31B server used about 15.9 GiB VRAM.
Measured at -ngl 48:
Stable prefix cold: 43.9s
Stable prefix reuse: 1.85s
Middle mutation cold: 45.3s
Middle mutation reuse: 45.0s
Decode average: 7.47 tok/s
The stable-prefix result is important: llama.cpp's normal prompt cache was already very effective when the prompt shape was cache-friendly.
GPU layer offload sweep: the real win
=====================================
After freeing the NVIDIA card from display use, I swept higher -ngl values.
Load results:
-ngl 49: healthy, 16170 MiB used, 2711 MiB free
-ngl 50: healthy, 16526 MiB used, 2355 MiB free
-ngl 51: healthy, 16796 MiB used, 2085 MiB free
-ngl 52: healthy, 17100 MiB used, 1781 MiB free
-ngl 53: healthy, 17370 MiB used, 1511 MiB free
-ngl 54: healthy, 17640 MiB used, 1241 MiB free
-ngl 55: healthy, 17942 MiB used, 939 MiB free
-ngl 56: healthy, 18326 MiB used, 555 MiB free
-ngl 57: healthy, 18630 MiB used, 251 MiB free
-ngl 58: failed during CUDA compute-buffer allocation
The -ngl 58 failure was:
ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 522.50 MiB on device 0: cudaMalloc failed: out of memory
graph_reserve: failed to allocate compute buffers
I treated 1 GiB free as the practical floor for a default. That made -ngl 54 the highest comfortable setting. The higher values load, but the margin is too small for a service I expect to restart reliably.
Comparison:
Setting Stable cold Stable reuse Middle cold Middle reuse Decode
-ngl 48 43.9s 1.85s 45.3s 45.0s 7.47 tok/s
-ngl 54 37.0s 1.35s 38.5s 38.2s 8.81 tok/s
This is the one change I promoted to the launcher default.
Current safe default:
-c 16384
-ngl 54
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
SWA / cache-reuse experiment
============================
I also tested:
--swa-full --cache-reuse 256
This was not a default-worthy result.
At the normal -ngl 48 shape, adding full SWA failed before health:
llama_kv_cache_iswa: using full-size SWA cache
ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 5304.00 MiB on device 0: cudaMalloc failed: out of memory
To make it fit, I had to reduce offload to -ngl 40.
At -ngl 40:
Setting Stable cold Stable reuse Middle cold Middle reuse
baseline -ngl 40 53.8s 2.47s 55.0s 54.5s
-ngl 40 + SWA/cache-reuse 63.3s 2.36s 66.8s 37.4s
So yes, it helped the middle-mutation retry case. But it slowed cold passes and required giving up GPU offload. I would not use it as the general default on this hardware.
Batch and microbatch sweep
==========================
With -ngl 54 fixed, I tested larger batch settings.
Baseline upstream behavior is effectively:
--batch-size 2048
--ubatch-size 512
Results:
Batch/ubatch Stable cold Stable reuse Middle cold Middle reuse Default-safe
2048/512 37.0s 1.35s 38.5s 38.2s yes
4096/512 36.9s 1.35s 38.5s 38.1s yes, no real gain
4096/1024 32.1s 1.40s 33.6s 18.1s no, too tight
8192/1024 31.9s 1.41s 33.5s 17.9s no, too tight
The ubatch 1024 cases are interesting. They accelerate prefill and make the middle-mutation retry much faster. But the memory margin is poor.
For 8192/1024, the stopped-server memory breakdown showed CUDA free memory down to about 247 MiB, with compute buffers around 1045 MiB.
That is too tight for an everyday default. I kept it as a manual high-throughput profile:
--batch-size 8192 --ubatch-size 1024
Context checkpoint spacing
==========================
The ubatch 1024 result suggested that context checkpoint behavior might be part of the middle-mutation improvement, so I tested checkpoint spacing with default batch settings.
Default upstream checkpoint interval is 8192 tokens.
Results:
Checkpoint interval Stable cold Stable reuse Middle cold Middle reuse
default 8192 37.0s 1.35s 38.5s 38.2s
4096 37.1s 1.34s 38.8s 28.6s
2048 37.9s 1.34s 39.7s 23.8s
1024 38.1s 1.35s 40.0s 24.0s
The 2048 interval was best for middle-mutation retry. It did not increase resident VRAM the way ubatch 1024 did, and the stopped-server memory breakdown still showed about 953 MiB CUDA free memory.
The cost is slower cold prompts and larger prompt-cache state. At 2048, the server reported 8 checkpoints and about 6329 MiB cache state for one 14.5k-token prompt.
I kept the default checkpoint interval unchanged, but added this as a manual profile:
--checkpoint-every-n-tokens 2048
For a balanced manual option:
--checkpoint-every-n-tokens 4096
Cache RAM
=========
I also tested increasing cache RAM from 32768 MiB to 65536 MiB before this final tuning pass. It did not produce a meaningful speed gain on the repeated long-prefix test.
Measured earlier:
cache-ram 32768: about 58.8s cold, 12.0s cached
cache-ram 65536: about 59.8s cold, 12.1s cached
The conclusion was simple: 32768 MiB was already enough for the tested single cached prefix. More cache RAM used system RAM but did not make that workload faster.
What I would keep
=================
Safe default:
-c 16384
-ngl 54
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off
Manual high-throughput profile, if tight VRAM margin is acceptable:
--batch-size 8192 --ubatch-size 1024
Manual middle-mutation profile:
--checkpoint-every-n-tokens 2048
Experimental only:
-ngl 55 through -ngl 57
--swa-full --cache-reuse 256
Do not bother without another memory-saving change:
-ngl 58
Operational notes
=================
A few practical lessons from this run:
1. 20 GB VRAM is the ceiling.
The system has 96 GB RAM and 20 CPU threads. That helps with loading, caching, and running hybrid configurations, but it does not turn a dense 31B model into a fast local coding assistant. GPU memory bandwidth and VRAM capacity dominate.
2. Moving display off NVIDIA can matter.
Moving Xorg/SDDM to the Intel iGPU gave the RTX 4000 SFF Ada a clean compute role and made -ngl 54 viable. But this had user-experience cost. In my case, it affected dual-monitor behavior. Whether that is worth it depends on whether the box is a workstation or an inference appliance.
3. Prompt shape matters more than some exotic flags.
Stable-prefix caching was already excellent. If the large stable material is before the changing query, cache reuse is strong. Middle edits are much harder unless you use checkpoint or batch profiles tuned for that workload.
4. The "loads successfully" line is not enough.
-ngl 57 loaded, but left only about 251 MiB free. That is not a healthy default. You need enough margin for compute buffers, fragmentation, restarts, and slightly different prompt shapes.
5. This is usable, not magical.
The final safe profile gives about 8.8 tok/s decode on this setup. That is useful for local/private/offline work. It is not competitive with strong hosted models for serious coding throughput.
Final take
==========
The good news: this MS-01 plus RTX 4000 SFF Ada can run a useful local 31B GGUF model, and there are real tuning gains a vailable. The best general gain was not an exotic trick; it was reclaiming VRAM and increasing layer offload from -ngl 48 to -ngl 54.
The bad news: the machine still does not become a frontier-class local coding box. The hardware is good enough for a local lab model, classifier sidecars, privacy-sensitive work, and experiments. It is not enough to make a 31B dense model feel like a modern hosted coding model.
If I were building around this class of hardware again, I would treat 20 GB Ada as a very capable edge/homelab inference card, not as a substitute for a larger VRAM GPU when the goal is comfortable local LLM coding.
Corrections and suggestions welcome. I would be especially interested in comparable llama.cpp numbers from other 20 GB cards, RTX 3090/4090 class cards, and anyone running similar long-prompt cache tests with Gemma-family GGUFs.