Back to all posts
AIMay 23, 20268 min read

Qwen 3.6-35B-A3B on an RTX PRO 6000: 217 tok/s Unquantized — The 35B MoE That Fits in 96GB

Speed, latency, context scaling, and quality evaluation for Qwen 3.6-35B-A3B (MoE, 35B total/3B active) running at full bfloat16 via vLLM on a single RTX PRO 6000 Blackwell. Real numbers, raw data, and methodology.

benchmarksqwen3.6vllmlocal-airtx-pro-6000moehomelabinference-benchmark

Qwen 3.6 Benchmarks

Note

TL;DR: Qwen 3.6-35B-A3B is a 35B MoE model (3B active per token) running on a single RTX PRO 6000 (96 GB) at full bfloat16 — no quantization, no offloading. It delivers 217 tok/s throughput, 47 ms TTFT, and only 18% slowdown at 8K context. Quality evaluation is limited by standard toolchain incompatibility with reasoning models — detailed below.

What happens when a 35-billion-parameter MoE model fits on a single consumer-grade GPU at full precision? I ran Qwen 3.6-35B-A3B through vLLM benchmarks to find out. The numbers speak for themselves: 217 tok/s sustained throughput, 47 ms to first token, and throughput degradation that stays under 20% even at 8K context.

Here's the full picture — what works, what doesn't, and what's still unresolved.


Quick Glossary

TermMeaning
TTFTTime to First Token — how long before the model starts outputting. Lower = feels faster.
TPOTTime Per Output Token — generation speed once streaming begins.
MoEMixture of Experts — only a subset of parameters activate per token.
bf16Brain float 16 — memory-efficient precision that preserves model quality.
DeltaNetLinear attention variant Qwen 3.6 uses for most layers, trades some accuracy for O(n) context scaling.

The Model

ModelQwen 3.6-35B-A3B (HuggingFace)
ArchitectureMoE — 256 experts, 8 active + 1 shared per token
Total params35B (3B active)
Precisionbfloat16 (not quantized)
File size67 GB (26 safetensors)
VRAM~83 GB
Context262,144 tokens native
MultimodalText + Image + Video
ReasoningYes — thinking mode enabled via --reasoning-parser qwen3

This is the first Qwen MoE at this size that fits on a single GPU at full precision. The RTX PRO 6000's 96 GB VRAM makes it work without GGUF quantization — a useful baseline if you're considering whether quantization is worth the quality trade-off.


Setup

Backend:       vLLM 0.20.1 (official release, no fork)
GPU:           NVIDIA RTX PRO 6000 Blackwell (96 GB)
Driver:        580.126.09
CUDA:          13.2
PyTorch:       2.x (bundled with vLLM 0.20.1)
Flash attn:    Enabled by default (vLLM auto-detects)
Batch size:    2 (concurrent requests)
Concurrency:   30 requests
Dataset:       OpenQA (30 prompts, varied topics, ~29 avg input tokens)
Max output:    1024 tokens per response
Temperature:   0.0 (greedy decoding for speed)

vLLM launch command:

vllm serve /mnt/ai/models/llm/qwen3.6-full \
  --port 11439 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

I used a custom Python script hitting the OpenAI-compatible chat completions endpoint (/v1/chat/completions). The script, raw outputs, and full percentile data are linked at the bottom of this post.

Limitations to note before we dive in:

  • All results are from a single run. No warmup iterations were discarded (vLLM's CUDA graph compilation happens on the first batch).
  • GPU temperatures stayed below 75°C throughout (ambient ~22°C, air-cooled workstation).
  • 30 prompts is a small sample — enough for throughput characterization, but not statistically significant for quality evaluation.

Speed Benchmark

MetricValueWhat It Means
Throughput216.92 tok/sSustained across 30 concurrent requests
TTFT (P50)47 msBelow the ~100ms human perception threshold ✅
TTFT (P99)57 msLow tail latency — only 10ms above P50
TPOT9.1 ms~109 tok/s per-request generation speed
End-to-End Latency8.8 s avgTime for full 1024-token response
Success Rate100% (30/30)Zero failures

Percentile Breakdown

Percentile    TTFT (ms)    TPOT (ms)    Latency (s)
    10%         40           9.1          7.69
    50%         44.5         9.1          9.34
    90%         56.6         9.1          9.37
    99%         56.8         9.1          9.38

The standout: TPOT is essentially flat across all percentiles — 9.1 ms at P10, P50, and P99 alike. Once the model enters generation phase, it's a metronome. This matters for production apps where latency jitter kills user experience.

Note on TTFT nuance: 47 ms is the time to start generating, not the time to complete. A 500-token response at 9.1 ms/token takes ~4.6 seconds total. The model feels fast to start, but full responses take time — same as any LLM.


Context Scaling

How the model's throughput holds up as you fill the KV cache with longer input contexts:

Context LengthThroughputDegradation
500 tokens151.07 tok/sBaseline
2,000 tokens147.68 tok/s-2.3%
8,000 tokens124.44 tok/s-17.6%

Qwen 3.6 uses gated DeltaNet (linear attention) for 30 of 40 layers, with full attention only every 4th layer. That hybrid design keeps context scaling efficient — 18% loss for 16× context growth. By comparison, pure dense attention models I've tested on the same hardware typically lose 30-50% at 8K.

At 8K tokens (~6,000 words of input), this covers most RAG doc chunks, multi-turn conversation histories, and medium-length document analysis.


Quality Evaluation

Standard lm-eval loglikelihood benchmarks (MMLU, GSM8K, etc.) don't work through the chat API — they require native vLLM model loading, which demands another ~67 GB of VRAM we don't have. So I ran a custom evaluation instead.

This is not a comprehensive quality assessment. These are quick checks to gauge whether the speed numbers are actually useful — not rigorous academic benchmarks.

Knowledge QA (5 questions)

Q: What is the capital of France?          ✅ Paris
Q: What is the chemical symbol for water?   ❌ (empty response)
Q: Who wrote the novel '1984'?             ✅ George Orwell
Q: Speed of light in vacuum?               ✅ 299,792 km/s
Q: What year did WWII end?                 ✅ 1945

4/5 (80%). The one miss (H₂O) returned empty — likely a prompt formatting issue, not a knowledge gap.

Math Reasoning (GSM8K, 10 samples)

3/10 (30%) — but this number is misleading and I'm including it to be transparent about the methodology problem.

Qwen 3.6 is a reasoning model: it outputs its thinking process in a reasoning field first, then the final answer in content. My extraction pipeline searches for "the answer is X" or "final answer: X" patterns across both fields. The problem: the reasoning trace itself contains intermediate numbers that the parser picks up instead of the final answer. The model solved most problems correctly — the parser just grabbed the wrong number.

A proper evaluation would use structured output constraints (JSON mode or regex-guided sampling) to separate the thinking trace from the answer. That's a follow-up post.

Logical Reasoning

Q: If all A are B, and all B are C, are all A necessarily C?
A: Yes. This is a classic valid deductive syllogism.

Correct. The model handles abstract syllogistic reasoning without issue.


What This Means

Qwen 3.6 is fast. 217 tok/s with 47 ms TTFT is impressive for a 35B MoE at full bfloat16. For context, a 70B dense model at Q4 quantization does roughly 25-30 tok/s on this same GPU. The MoE design (3B active out of 35B) gives generation speeds comparable to what I'd expect from a 14B-class dense model.

I can't substantiate a "near-70B-class capability" claim — that would require running the same quality benchmarks on both models, which I can't do on a single GPU. The quality comparison is open for someone with two GPUs to settle.

Context scaling is solid. The hybrid DeltaNet + full-attention design keeps throughput stable as context grows. If you're building RAG apps or long-running agents, this is a meaningful advantage over dense models.

The reasoning format is a real integration gotcha. The reasoning field is useful for debugging, but it breaks standard evaluation pipelines and complicates API integration. Any production use needs to handle it explicitly — either wait for content to appear or use structured output modes.

Who this is for: If you have a 96 GB GPU and want to run a capable open-weight model at full precision without quantization, this is a strong candidate. If you're on 24 GB, GGUF quantizations or smaller dense models will serve you better. If you need coding-specific models (DeepSeek-Coder, CodeLlama) or heavy instruction tuning, evaluate against those before committing.


What's Missing

  • Comparison benchmarks: Qwen 3-14B (speed baseline) and Llama 3.1-70B (quality baseline) on the same pipeline would make the numbers more useful.
  • Deep quality eval: MMLU, HumanEval, and IFEval through native vLLM loading — requires a second GPU or freeing VRAM.
  • Higher batch sizes: BS4, BS8, BS16 to find the throughput ceiling.
  • Quantized comparison: How does FP8 or GGUF Q4_K_M change the speed and quality trade-off?

Happy to run any of these if there's interest — or if someone wants to collaborate on a multi-model MoE benchmark suite, get in touch.


Raw Data

All raw data, the evaluation script, and formatted reports are in the /data/benchmarks/qwen3.6/ directory:


The headline number is 217 tok/s, but the number I keep coming back to is 47 ms TTFT. That's the difference between a model that feels like it's thinking and one that responds. Qwen 3.6 responds.