Running Qwen3.5-397B on 4x DGX Spark

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

datagaucho

New Member
Mar 16, 2026
5
3
3
-- update: see below for improved benchmark results --

I’ve got a 4-node cluster (3x ASUS GX10 + 1x Dell GB10, all Grace Blackwell with 128GB unified memory each) running Qwen3.5-397B-A17B-NVFP4 — a 397 billion parameter mixture-of-experts model with 17B active parameters per token, quantized to NVFP4.

Nodes are connected via 400GbE RDMA using a MikroTik CRS804-4DDQ-hRM switch. Serving with vLLM using tensor parallelism across all 4 nodes. Here are the llama-benchy results:

Results

Prompt SizeOutput SizePrompt SpeedGeneration SpeedTime to First Token
128 tokens128 tokens541 t/s16.7 t/s240 ms
128 tokens512 tokens579 t/s16.8 t/s224 ms
128 tokens1024 tokens598 t/s16.8 t/s217 ms
512 tokens128 tokens1,214 t/s16.9 t/s424 ms
512 tokens512 tokens887 t/s17.2 t/s578 ms
512 tokens1024 tokens1,273 t/s16.8 t/s404 ms
2048 tokens128 tokens2,194 t/s16.8 t/s935 ms
2048 tokens1024 tokens2,281 t/s16.6 t/s899 ms
8192 tokens128 tokens2,052 t/s16.7 t/s3,994 ms
Notes
  • Generation speed is consistent ~17 tokens/sec.
  • Prompt processing seems to scale ok (541 t/s at 128 tokens, 2,200+ t/s at 8K tokens)
  • Time to first token is sub-second up to 2K token prompts, ~4 seconds at 8K tokens (I need to look at what was happening here more closely, I believe there were instabilities)
  • 128K context window, `–enforce-eager`, FP8 KV cache
  • CUDA graphs are disabled (`–enforce-eager`) because capturing them requires pre-allocating memory buffers that don’t fit alongside the model in 128GB unified memory — Claude tells me to expect ~20-30% throughput improvement once this is resolved
  • Using `VLLM_TEST_FORCE_FP8_MARLIN=1` to force Marlin kernels for NVFP4 quantized weights as the default kernels don’t support SM120 (Blackwell) yet
  • Engine crashes at larger prompt + output combinations (8K prompt with 512+ output tokens)
Caveat I have only run the benchmarks at this point, it was a bit of a journey to get it running. This is not going to compete with an 8xH100 setup on throughput, but the total hardware cost is roughly £15K — which seems like a fraction of what you’d normally need to run a 397B parameter model.

I not entirely sure this is useful, but I thought I’d go large before I went sensible
:slight_smile:


J
 
Last edited:

TrashMaster

Active Member
Sep 8, 2024
116
87
28
If you are going to be running an NVFP4 quant of a model, there is a huge ongoing discussion about size performance and quality on SM120 or SM121.

One big issue at play is the floating point non-determinism that seems "worse" on blackwell vs older ada gen h100 sort of GPUs when measured with KLD. Although the consensus is that this is more a software and library issue than the underlying hardware itself.

The TL;DR version is: the W4A16 and INT4 AWQ stuff seems to be performing better faster and smaller than the NVFP4's: 1773676834758.png

The best calibrated version tested so far is this guy: lukealonso/Qwen3.5-397B-A17B-NVFP4 · Hugging Face

Some SM120 performance numbers with all the community tweaks is on my screenshot here: https://forums.servethehome.com/ind...ch-board-gpu-testing.52488/page-4#post-499926

The community testing tool we have been running for inference performance is this bad boy: GitHub - voipmonitor/llm-inference-bench: LLM inference decode throughput benchmark with Rich TUI dashboard. Measures token generation speed across concurrency levels and context lengths. Supports SGLang and vLLM engines. most people use VLLM or SGLANG for inference.
 

datagaucho

New Member
Mar 16, 2026
5
3
3
Thanks for the quick response. I'll get the community testing tool running against my setup. I did read the other thread about aliexpress pci switches earlier ... but I guess I fell into the camp of little hot box collection rather than larger hot box collection :D

J
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
SM120 and SM121 users are all in the back of the bus together. All the libraries, enhancements, and focus is on SM100 still for the datacenter blackwell cards.

Thats why the /r/BlackwellPerformance discord is so valuable. There are special builds of VLLM, pytorch, etc. that all zoom in on our architecture and getting the most out of it.

Also I am curious about the multi-chassis vllm nccl config so if u get a chance dump all that somewhere.
 

datagaucho

New Member
Mar 16, 2026
5
3
3
1773689037027.png

Multi-Chassis vLLM + NCCL Config

Hardware: 4x GB10 nodes connected via 400GbE RDMA (MikroTik CRS804-4DDQ-hRM)

Network: Each node has a dedicated RDMA NIC (ConnectX-7) on a separate 10.200.0.0/24 subnet:
- Head: 10.200.0.1
- Workers: 10.200.0.2, 10.200.0.3, 10.200.0.4

Environment variables (all nodes):

VLLM_TEST_FORCE_FP8_MARLIN=1 RAY_memory_monitor_refresh_ms=0 GLOO_SOCKET_IFNAME=enp1s0f1np1 NCCL_IB_HCA=rocep1s0f1 MASTER_ADDR=10.200.0.1 RAY_ADDRESS=10.200.0.1:6379


vLLM serve command (head node only):

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4
--host 0.0.0.0 --port 8000
--max-model-len 131072
--gpu-memory-utilization 0.85
--max-num-seqs 4
--tensor-parallel-size 4
--distributed-executor-backend ray
--trust-remote-code
--kv-cache-dtype fp8_e4m3
--skip-mm-profiling
--enforce-eager

Switch: MikroTik CRS804-4DDQ-hRM (~£1,000) — 4x QSFP56-DD 400G ports
Cables: 1m DAC/AOC cables, one per node
Link speed: 200Gb/s per node (ConnectX-7 in the GB10 negotiates at 200G, not 400G). Single port is limited to pcie g5 x4 so doesn't reach 200Gb/s
MTU: 9000 (jumbo frames)

RDMA network: Dedicated 10.200.0.0/24 subnet, separate from management LAN
- 10.200.0.1 (head) → 10.200.0.4 (worker 3)
- Each node uses one port of a dual-port ConnectX-7

RoCE config:
- RoCE v1 (IB/RoCE v1 mode)
- PFC (Priority Flow Control): not enabled — works fine on a dedicated switch with no congestion
- No special MikroTik config needed beyond default switching — the CRS804 just bridges the QSFP-DD ports at L2

notes:
- The switch is basically a dumb L2 bridge for this use case — all RDMA traffic stays on the 10.200.0.0/24 subnet
- No VLAN config, no routing, no QoS
 

datagaucho

New Member
Mar 16, 2026
5
3
3
Single node, more modest model. Interesting that the single request TG speed is not much higher than the 397, perhaps I will play about with the 397 after all.

1773694298344.png
 

datagaucho

New Member
Mar 16, 2026
5
3
3
Software Stack

ComponentVersion
vLLM0.16.0rc2.dev236+g3b30e6150.d20260218 (CUDA 13.0)
PyTorch2.10.0+cu130
NCCL2.28.3-1
Ray(bundled with vLLM)
Docker imageavarok/dgx-vllm-nvfp4-kernel:v22

Model

  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4
  • Architecture: Mixture-of-Experts, 397B total parameters, 17B active per token
  • Quantization: NVFP4
  • Size on disk: 233GB (6 safetensor shards, ~47GB each)
  • Memory per node: ~55GB model weights after sharding
  • Storage: Local NVMe on each node at /home/xxxx/.cache/huggingface/
  • Note: Each node reads all 6 shards but only keeps 1/4 of the weights (TP=4)

Container Setup

Head Node (gx10-306e)

Code:
docker run -d --name avarok-head \
  --network host \
  --gpus all \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/xxxx/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_DEBUG=WARN \
  -e NCCL_SOCKET_IFNAME=enp1s0f1np1 \
  -e NCCL_IB_HCA=rocep1s0f1 \
  -e NCCL_NET_GDR_LEVEL=2 \
  -e NCCL_IB_TIMEOUT=23 \
  -e NCCL_IB_GID_INDEX=0 \
  -e NCCL_ASYNC_ERROR_HANDLING=1 \
  -e TORCH_NCCL_BLOCKING_WAIT=1 \
  -e GLOO_SOCKET_IFNAME=enp1s0f1np1 \
  -e VLLM_HOST_IP=10.200.0.1 \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e MASTER_ADDR=10.200.0.1 \
  -e MASTER_PORT=29500 \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e RAY_memory_usage_threshold=1.0 \
  avarok/dgx-vllm-nvfp4-kernel:v22 ray-head
Worker Nodes

Same as head but with ray-worker mode and node-specific IPs:

Code:
docker run -d --name avarok-worker \
  --network host \
  --gpus all \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/xxxx/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_DEBUG=WARN \
  -e NCCL_SOCKET_IFNAME=enp1s0f1np1 \
  -e NCCL_IB_HCA=rocep1s0f1 \
  -e NCCL_NET_GDR_LEVEL=2 \
  -e NCCL_IB_TIMEOUT=23 \
  -e NCCL_IB_GID_INDEX=0 \
  -e NCCL_ASYNC_ERROR_HANDLING=1 \
  -e TORCH_NCCL_BLOCKING_WAIT=1 \
  -e GLOO_SOCKET_IFNAME=enp1s0f1np1 \
  -e VLLM_HOST_IP=<WORKER_RDMA_IP> \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e WORKER_IP=<WORKER_RDMA_IP> \
  -e HEAD_IP=10.200.0.1 \
  -e MASTER_ADDR=10.200.0.1 \
  -e MASTER_PORT=29500 \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e RAY_memory_usage_threshold=1.0 \
  avarok/dgx-vllm-nvfp4-kernel:v22 ray-worker
Replace <WORKER_RDMA_IP> with 10.200.0.2, 10.200.0.3, or 10.200.0.4.

vLLM Serve Command

Run inside the head container after all 4 Ray nodes are connected:

Code:
export VLLM_TEST_FORCE_FP8_MARLIN=1
export RAY_memory_monitor_refresh_ms=0
export GLOO_SOCKET_IFNAME=enp1s0f1np1
export NCCL_IB_HCA=rocep1s0f1
export MASTER_ADDR=10.200.0.1
export RAY_ADDRESS=10.200.0.1:6379

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 4 \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray \
  --trust-remote-code \
  --kv-cache-dtype fp8_e4m3 \
  --skip-mm-profiling \
  --enforce-eager