-- update: see below for improved benchmark results --
I’ve got a 4-node cluster (3x ASUS GX10 + 1x Dell GB10, all Grace Blackwell with 128GB unified memory each) running Qwen3.5-397B-A17B-NVFP4 — a 397 billion parameter mixture-of-experts model with 17B active parameters per token, quantized to NVFP4.
Nodes are connected via 400GbE RDMA using a MikroTik CRS804-4DDQ-hRM switch. Serving with vLLM using tensor parallelism across all 4 nodes. Here are the llama-benchy results:
Results
Notes
I not entirely sure this is useful, but I thought I’d go large before I went sensible
J
I’ve got a 4-node cluster (3x ASUS GX10 + 1x Dell GB10, all Grace Blackwell with 128GB unified memory each) running Qwen3.5-397B-A17B-NVFP4 — a 397 billion parameter mixture-of-experts model with 17B active parameters per token, quantized to NVFP4.
Nodes are connected via 400GbE RDMA using a MikroTik CRS804-4DDQ-hRM switch. Serving with vLLM using tensor parallelism across all 4 nodes. Here are the llama-benchy results:
Results
| Prompt Size | Output Size | Prompt Speed | Generation Speed | Time to First Token |
|---|---|---|---|---|
| 128 tokens | 128 tokens | 541 t/s | 16.7 t/s | 240 ms |
| 128 tokens | 512 tokens | 579 t/s | 16.8 t/s | 224 ms |
| 128 tokens | 1024 tokens | 598 t/s | 16.8 t/s | 217 ms |
| 512 tokens | 128 tokens | 1,214 t/s | 16.9 t/s | 424 ms |
| 512 tokens | 512 tokens | 887 t/s | 17.2 t/s | 578 ms |
| 512 tokens | 1024 tokens | 1,273 t/s | 16.8 t/s | 404 ms |
| 2048 tokens | 128 tokens | 2,194 t/s | 16.8 t/s | 935 ms |
| 2048 tokens | 1024 tokens | 2,281 t/s | 16.6 t/s | 899 ms |
| 8192 tokens | 128 tokens | 2,052 t/s | 16.7 t/s | 3,994 ms |
- Generation speed is consistent ~17 tokens/sec.
- Prompt processing seems to scale ok (541 t/s at 128 tokens, 2,200+ t/s at 8K tokens)
- Time to first token is sub-second up to 2K token prompts, ~4 seconds at 8K tokens (I need to look at what was happening here more closely, I believe there were instabilities)
- 128K context window, `–enforce-eager`, FP8 KV cache
- CUDA graphs are disabled (`–enforce-eager`) because capturing them requires pre-allocating memory buffers that don’t fit alongside the model in 128GB unified memory — Claude tells me to expect ~20-30% throughput improvement once this is resolved
- Using `VLLM_TEST_FORCE_FP8_MARLIN=1` to force Marlin kernels for NVFP4 quantized weights as the default kernels don’t support SM120 (Blackwell) yet
- Engine crashes at larger prompt + output combinations (8K prompt with 512+ output tokens)
I not entirely sure this is useful, but I thought I’d go large before I went sensible
J
Last edited:


