Gemma 31B on a Minisforum MS-01 with RTX 4000 SFF Ada: what actually helped

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

LDighera

New Member
Feb 4, 2024
26
3
3
Santa Barbara, CA
Subject: Gemma 31B on a Minisforum MS-01 with RTX 4000 SFF Ada: what actually helped

I spent a few days trying to turn a small workstation/homelab box into a useful local LLM host. The end result is not "this replaces cloud/frontier models." It does not. But the tuning results were concrete enough that they may be useful to others trying to run local GGUF models on 20 GB class NVIDIA cards.

The short version:

- The useful win was increasing GPU layer offload after moving display ownership off the NVIDIA card.
- The best safe 31B default on this machine moved from -ngl 48 to -ngl 54.
- Decode improved from about 7.47 tok/s to about 8.81 tok/s.
- Long prompt prefill improved from about 44-45 seconds to about 37-39 seconds on a 14.5k token synthetic prompt.
- More aggressive batch/microbatch and checkpoint settings help specific workloads, but I would keep them as manual profiles, not defaults.
- A 20 GB card can run a usable local 31B model, but it still does not feel like a strong remote coding model.


Hardware
========

Host:
- Minisforum MS-01 class system
- CPU: Intel Core i9-12900H
- 20 logical CPUs
- RAM: 96 GB installed, about 93 GiB visible
- GPU: NVIDIA RTX 4000 SFF Ada Generation, 20 GB VRAM
- iGPU: Intel Alder Lake-P integrated graphics
- OS: Debian 12 with backported 6.12 kernel
- Kernel during tests: 6.12.74+deb12-amd64
- NVIDIA driver observed during tests: 535.261.03, CUDA 12.2 reported by nvidia-smi

The NVIDIA card is an RTX 4000 SFF Ada, not an A4000. It is a compact workstation card with 20 GB VRAM and modest power draw. That makes it attractive for a small box, but the 20 GB VRAM ceiling is very real for local LLMs.


Software stack
==============

Runtime:
- llama.cpp
- CUDA backend
- OpenAI-compatible llama-server

Primary text model:
- Gemma 4 31B instruct GGUF
- Quant: Q4_K_M
- Context: 16384
- KV cache: q8_0 / q8_0
- Reasoning disabled
- One parallel slot

Secondary service:
- Gemma 4 E4B multimodal GGUF Q8_0
- Used as a local image classification / OCR triage backend
- This is separate from the 31B tuning described below


Why the iGPU mattered
=====================

Initially the NVIDIA card was also carrying desktop/display duties. After enabling the Intel iGPU and forcing Xorg/SDDM onto Intel graphics, NVIDIA memory dropped to essentially idle when no model was running.

That reclaimed enough VRAM headroom to make a higher 31B offload setting viable.

This was not free. It cost some display convenience, including dual-monitor capability in the current setup. That tradeoff matters. If the machine is primarily a daily desktop, the display loss may not be worth the LLM gain. If it is primarily an inference box, it was worth testing.


Benchmark shape
===============

These are not broad benchmarks. They are controlled, practical measurements against one local deployment.

The main benchmark used synthetic text-only chat prompts calibrated to about 14.5k tokens.

Two prompt layouts were used:

1. Stable-prefix pair
- Same long prefix
- Short changed suffix
- This measures normal prompt-cache reuse when the large unchanged material is at the front.

2. Middle-mutation pair
- Same beginning and same tail
- Small changed text in the middle
- This is a harder case for normal prefix caching.

Responses were capped very short. The benchmark is mostly about prefill/cache behavior, not answer quality.

A separate short-prompt decode test generated 256 tokens to estimate decode throughput.


Starting baseline
=================

The working 31B baseline before the final tuning pass was:

-c 16384
-ngl 48
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off

At -ngl 48, the 31B server used about 15.9 GiB VRAM.

Measured at -ngl 48:

Stable prefix cold: 43.9s
Stable prefix reuse: 1.85s
Middle mutation cold: 45.3s
Middle mutation reuse: 45.0s
Decode average: 7.47 tok/s

The stable-prefix result is important: llama.cpp's normal prompt cache was already very effective when the prompt shape was cache-friendly.


GPU layer offload sweep: the real win
=====================================

After freeing the NVIDIA card from display use, I swept higher -ngl values.

Load results:

-ngl 49: healthy, 16170 MiB used, 2711 MiB free
-ngl 50: healthy, 16526 MiB used, 2355 MiB free
-ngl 51: healthy, 16796 MiB used, 2085 MiB free
-ngl 52: healthy, 17100 MiB used, 1781 MiB free
-ngl 53: healthy, 17370 MiB used, 1511 MiB free
-ngl 54: healthy, 17640 MiB used, 1241 MiB free
-ngl 55: healthy, 17942 MiB used, 939 MiB free
-ngl 56: healthy, 18326 MiB used, 555 MiB free
-ngl 57: healthy, 18630 MiB used, 251 MiB free
-ngl 58: failed during CUDA compute-buffer allocation

The -ngl 58 failure was:

ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 522.50 MiB on device 0: cudaMalloc failed: out of memory
graph_reserve: failed to allocate compute buffers

I treated 1 GiB free as the practical floor for a default. That made -ngl 54 the highest comfortable setting. The higher values load, but the margin is too small for a service I expect to restart reliably.

Comparison:

Setting Stable cold Stable reuse Middle cold Middle reuse Decode
-ngl 48 43.9s 1.85s 45.3s 45.0s 7.47 tok/s
-ngl 54 37.0s 1.35s 38.5s 38.2s 8.81 tok/s

This is the one change I promoted to the launcher default.

Current safe default:

-c 16384
-ngl 54
-np 1
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off


SWA / cache-reuse experiment
============================

I also tested:

--swa-full --cache-reuse 256

This was not a default-worthy result.

At the normal -ngl 48 shape, adding full SWA failed before health:

llama_kv_cache_iswa: using full-size SWA cache
ggml_backend_cuda_buffer_type_alloc_buffer:
allocating 5304.00 MiB on device 0: cudaMalloc failed: out of memory

To make it fit, I had to reduce offload to -ngl 40.

At -ngl 40:

Setting Stable cold Stable reuse Middle cold Middle reuse
baseline -ngl 40 53.8s 2.47s 55.0s 54.5s
-ngl 40 + SWA/cache-reuse 63.3s 2.36s 66.8s 37.4s

So yes, it helped the middle-mutation retry case. But it slowed cold passes and required giving up GPU offload. I would not use it as the general default on this hardware.


Batch and microbatch sweep
==========================

With -ngl 54 fixed, I tested larger batch settings.

Baseline upstream behavior is effectively:

--batch-size 2048
--ubatch-size 512

Results:

Batch/ubatch Stable cold Stable reuse Middle cold Middle reuse Default-safe
2048/512 37.0s 1.35s 38.5s 38.2s yes
4096/512 36.9s 1.35s 38.5s 38.1s yes, no real gain
4096/1024 32.1s 1.40s 33.6s 18.1s no, too tight
8192/1024 31.9s 1.41s 33.5s 17.9s no, too tight

The ubatch 1024 cases are interesting. They accelerate prefill and make the middle-mutation retry much faster. But the memory margin is poor.

For 8192/1024, the stopped-server memory breakdown showed CUDA free memory down to about 247 MiB, with compute buffers around 1045 MiB.

That is too tight for an everyday default. I kept it as a manual high-throughput profile:

--batch-size 8192 --ubatch-size 1024


Context checkpoint spacing
==========================

The ubatch 1024 result suggested that context checkpoint behavior might be part of the middle-mutation improvement, so I tested checkpoint spacing with default batch settings.

Default upstream checkpoint interval is 8192 tokens.

Results:

Checkpoint interval Stable cold Stable reuse Middle cold Middle reuse
default 8192 37.0s 1.35s 38.5s 38.2s
4096 37.1s 1.34s 38.8s 28.6s
2048 37.9s 1.34s 39.7s 23.8s
1024 38.1s 1.35s 40.0s 24.0s

The 2048 interval was best for middle-mutation retry. It did not increase resident VRAM the way ubatch 1024 did, and the stopped-server memory breakdown still showed about 953 MiB CUDA free memory.

The cost is slower cold prompts and larger prompt-cache state. At 2048, the server reported 8 checkpoints and about 6329 MiB cache state for one 14.5k-token prompt.

I kept the default checkpoint interval unchanged, but added this as a manual profile:

--checkpoint-every-n-tokens 2048

For a balanced manual option:

--checkpoint-every-n-tokens 4096


Cache RAM
=========

I also tested increasing cache RAM from 32768 MiB to 65536 MiB before this final tuning pass. It did not produce a meaningful speed gain on the repeated long-prefix test.

Measured earlier:

cache-ram 32768: about 58.8s cold, 12.0s cached
cache-ram 65536: about 59.8s cold, 12.1s cached

The conclusion was simple: 32768 MiB was already enough for the tested single cached prefix. More cache RAM used system RAM but did not make that workload faster.


What I would keep
=================

Safe default:

-c 16384
-ngl 54
--threads 12
--threads-batch 16
--cache-type-k q8_0
--cache-type-v q8_0
--cache-ram 32768
--fit off
--reasoning off

Manual high-throughput profile, if tight VRAM margin is acceptable:

--batch-size 8192 --ubatch-size 1024

Manual middle-mutation profile:

--checkpoint-every-n-tokens 2048

Experimental only:

-ngl 55 through -ngl 57
--swa-full --cache-reuse 256

Do not bother without another memory-saving change:

-ngl 58


Operational notes
=================

A few practical lessons from this run:

1. 20 GB VRAM is the ceiling.

The system has 96 GB RAM and 20 CPU threads. That helps with loading, caching, and running hybrid configurations, but it does not turn a dense 31B model into a fast local coding assistant. GPU memory bandwidth and VRAM capacity dominate.

2. Moving display off NVIDIA can matter.

Moving Xorg/SDDM to the Intel iGPU gave the RTX 4000 SFF Ada a clean compute role and made -ngl 54 viable. But this had user-experience cost. In my case, it affected dual-monitor behavior. Whether that is worth it depends on whether the box is a workstation or an inference appliance.

3. Prompt shape matters more than some exotic flags.

Stable-prefix caching was already excellent. If the large stable material is before the changing query, cache reuse is strong. Middle edits are much harder unless you use checkpoint or batch profiles tuned for that workload.

4. The "loads successfully" line is not enough.

-ngl 57 loaded, but left only about 251 MiB free. That is not a healthy default. You need enough margin for compute buffers, fragmentation, restarts, and slightly different prompt shapes.

5. This is usable, not magical.

The final safe profile gives about 8.8 tok/s decode on this setup. That is useful for local/private/offline work. It is not competitive with strong hosted models for serious coding throughput.


Final take
==========

The good news: this MS-01 plus RTX 4000 SFF Ada can run a useful local 31B GGUF model, and there are real tuning gains a vailable. The best general gain was not an exotic trick; it was reclaiming VRAM and increasing layer offload from -ngl 48 to -ngl 54.

The bad news: the machine still does not become a frontier-class local coding box. The hardware is good enough for a local lab model, classifier sidecars, privacy-sensitive work, and experiments. It is not enough to make a 31B dense model feel like a modern hosted coding model.

If I were building around this class of hardware again, I would treat 20 GB Ada as a very capable edge/homelab inference card, not as a substitute for a larger VRAM GPU when the goal is comfortable local LLM coding.

Corrections and suggestions welcome. I would be especially interested in comparable llama.cpp numbers from other 20 GB cards, RTX 3090/4090 class cards, and anyone running similar long-prompt cache tests with Gemma-family GGUFs.
 

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
2,361
877
113
>Corrections and suggestions welcome

Here's one: stop spamming STH and the greater internet with AI generated content like this.
 
  • Like
Reactions: marcoi

bayleyw

Active Member
Jan 8, 2014
347
125
43
(1) I really don't see why people need to have AI generate them a huge story out of something that could be summarized as "Tried to run Gemma-4-31B on a 20GB GPU, got 8 tokens/second with -ngl 54"

(2) If you want to run a dense model, Qwen3.6-27B is a bit smaller and a bit better for code

(3) You should probably run an MoE. Qwen3.6-35B will basically run on anything - I think I had it running on quad-channel DDR4 and a 3080 at something like 50 tokens per second.
 
  • Like
Reactions: marcoi

LDighera

New Member
Feb 4, 2024
26
3
3
Santa Barbara, CA
Thanks, that was useful feedback.

Fair point on the length. I should have made the original post more of a short field note:

MS-01, RTX 4000 SFF Ada, Gemma 4 31B Q4_K_M, stable at -ngl 54, about 8.8 tok/s.

The reason I included the extra detail is that this box has a fairly specific constraint: low-profile GPU space and a practical 70W-class GPU envelope. I am not trying to compare it with a full desktop GPU setup. I am trying to find what works well inside the MS-01 limits. For that use case, the RTX 4000 SFF Ada still looks like a good fit.

Your MoE point was the useful part. I had not spent enough time on Gemma 4 26B A4B, so I tested it after your reply.

On this machine, 26B A4B fully offloads and is much faster than the 31B dense baseline. Text generation was around 63 tok/s in my quick test. In my local image-classifier path it was also faster than the smaller Gemma 4 E4B classifier: about 1.3x better wall-clock on a small fixed test set, with decode around 60-62 tok/s versus about 38 tok/s.

The catch is consistency. For my image-classification use case, 26B A4B handled every test image and gave plausible answers, but its category labels drifted compared with my current classifier. So I need to retune the prompt/schema and run a larger reviewed test before switching defaults.

On Qwen: I understand why you brought it up, and I may test it separately. For this particular deployment, though, I also care about model provenance and trust, not just speed. That is why I am keeping Gemma as my baseline for now. That is not a claim that Qwen is unusable; it is just part of my local decision criteria.

So, useful correction: MoE probably deserves more attention on this box than I gave it in the original post. The next thing I need is a larger reviewed accuracy run, not another speed-only number.

If anyone has comparable numbers from similar low-profile/70W GPUs, I would still be interested.
 

bayleyw

Active Member
Jan 8, 2014
347
125
43
The MoE will certainly be a worse model, and Gemma-26B and Qwen-35B are comparable, so if you're idea of trust is "made by an American tech company" then whatever floats your boat (though I would argue that Google is hardly a beacon of trust, and DeepMind's copyright violations during training are probably numerous).

I think the RTX 4000 Ada is in a class of its own. I have a B50 but that is a 16GB $349 GPU, not a 20GB $1400 GPU. The laptop I'm writing this on has a 5090M in it, which is a 150W GB203, but it has 900GB/sec of memory bandwidth, not 400GB/sec like its SFF sibling.
 

LDighera

New Member
Feb 4, 2024
26
3
3
Santa Barbara, CA
That is a fair way to put it.

I would expect the MoE to be a worse general-purpose model than the 31B dense model too. My interest is more practical t
han theoretical: what is the best fit inside this particular box, with this particular low-profile/70W GPU constraint?

After your earlier reply I tested Gemma 4 26B A4B, and the result was useful but not a clean win.

For text, it is much faster than the 31B dense baseline on this MS-01. I saw around 63 tok/s in a quick local test, vers
us about 8.8 tok/s for Gemma 4 31B Q4_K_M at my current -ngl 54 setting.

For my image-classification use case, 26B A4B was also faster than the smaller Gemma 4 E4B classifier. It handled every
image in a small fixed test set and gave plausible answers. The catch was consistency: its category labels drifted compa
red with my current classifier, so I would need to retune the prompt/schema and run a larger reviewed test before switch
ing defaults.

So I agree with your basic point: MoE probably belongs in the test matrix for this box. I just do not yet know that it s
hould replace the current default for my actual workflow.

On Qwen/Gemma trust: I did not mean that Google is automatically trustworthy. I agree that is too simple. My concern is
more about provenance, operational risk, and what I am comfortable putting into a repeatable local pipeline. For now Gem
ma is the lower-friction baseline for me. That is a deployment preference, not proof that Qwen is bad.

On the GPU side, I also agree that the RTX 4000 SFF Ada is in a class of its own. That was basically the reason for the
original post. It is expensive for the raw performance, but the combination of 20GB VRAM, low-profile size, no auxiliary
power connector, mature CUDA support, and a 70W envelope is unusual.

One small spec note: if we are talking about my current RTX 4000 SFF Ada, NVIDIA lists it as 20GB GDDR6 at 70W; third-pa
rty spec tables put memory bandwidth at about 280 GB/s. The newer RTX PRO 4000 Blackwell SFF is the one NVIDIA lists at
24GB GDDR7, 70W, and 432 GB/s. Either way, a 5090M-class laptop GPU is a different power and bandwidth class, so I would
not expect the MS-01 results to track that closely.

The useful next test on my side is not another speed run. It is a larger reviewed accuracy run comparing E4B, 26B A4B, a
nd possibly Qwen if I decide to add it to the local test matrix.
 

LDighera

New Member
Feb 4, 2024
26
3
3
Santa Barbara, CA
Small update, since this turned into a useful hardware lesson.

The 31B text tests are not the only thing I have been doing with this box. I also have a local image-classifier workflow running on it. The current production path is llama.cpp with Gemma 4 E4B multimodal, behind a small local wrapper, reading image files from a mounted NAS share and returning structured JSON labels. The latest test was a corpus-scale image classification run, not just a short prompt benchmark.

I also tested the official Gemma 4 E4B MTP path through a small local OpenAI-compatible Transformers sidecar. It worked technically, and the existing wrapper could talk to it, but it was not worth switching to for the classifier. In a production-style 13-image A/B, primary labels matched 11/13, watchlist behavior matched 12/13, and the best tuned run only improved end-to-end wrapper time by about 12%. So MTP remains interesting, but it is not my default classifier path.

The bigger discovery was thermal. Once I started a real corpus-scale image classification run, the RTX 4000 SFF Ada hit about 88C at the full 70W limit with the fan already at 100%. I paused the run, stopped the GPU services, and dropped the power cap.

Important caveat: this is not the stock NVIDIA cooler. My card is using the n3rdware aftermarket single-slot cooler, with liquid-metal thermal compound as part of that conversion. The stock NVIDIA cooler may not show the same behavior. In my setup, though, the reduced-size cooler is clearly the limiting factor during sustained CUDA work.

So the next step is not a different model yet. It is a proper low-power sustained-work profile: probably starting around 50W, then measuring throughput and temperature before scaling the corpus run again. Undervolting may be worth investigating, but I am going to treat that as a planned driver/tooling experiment rather than changing it in the middle of a live batch.

The MS-01/RTX 4000 SFF combo is still useful. The lesson is that short benchmark runs and sustained corpus work are not the same thermal wo
rkload, especially after converting the GPU to a single-slot cooler.