Qwen 3.6-35B-A3B on an RTX PRO 6000: 217 tok/s at Full bf16
Speed, latency, context scaling, and quality notes for Qwen 3.6-35B-A3B running at full bfloat16 on a single RTX PRO 6000 Blackwell via vLLM.
TL;DR: Qwen 3.6-35B-A3B runs on a single RTX PRO 6000 (96 GB) at full bfloat16 - no quantization, no offloading. I measured 217 tok/s throughput, 47 ms TTFT, and only 18% slowdown at 8K context. Speculative decoding improved token cadence but did not improve total throughput, and it hurt tail latency.
What happens when a 35-billion-parameter MoE model fits on a single GPU at full precision? I ran Qwen 3.6-35B-A3B through vLLM benchmarks to find out.
The short version: it is fast, the first token arrives almost immediately, and context scaling is better than I expected. The quality results are less conclusive because standard eval tooling does not handle Qwen's reasoning format cleanly through the chat API.
Quick Glossary
| Term | Meaning |
|---|---|
| TTFT | Time to First Token - how long before the model starts outputting. Lower = feels faster. |
| TPOT | Time Per Output Token - generation speed once streaming begins. |
| MoE | Mixture of Experts - only a subset of parameters activate per token. |
| bf16 | Brain float 16 - memory-efficient precision that preserves model quality. |
| DeltaNet | Linear attention variant Qwen 3.6 uses for most layers, trades some accuracy for O(n) context scaling. |
Results Summary
| Result | Measurement |
|---|---|
| Best throughput | 217.84 tok/s |
| Baseline TTFT avg | 47 ms |
| Baseline TTFT P99 | 57 ms |
| 8K context throughput | 124.44 tok/s baseline, 146.73 tok/s tuned |
| 8K slowdown | 17.6% baseline, 14.7% tuned |
| Speculative decoding | +0.4% throughput, worse P99 latency |
| Precision | bfloat16 model weights, no quantization |
The headline is not just throughput. It is the combination of full bf16 weights, single-GPU deployment, 47 ms average time to first token, and usable throughput even at 8K context.
The Model
| Model | Qwen 3.6-35B-A3B (HuggingFace) |
| Architecture | MoE - 256 experts, 8 active + 1 shared per token |
| Total params | 35B (3B active) |
| Precision | bfloat16 (not quantized) |
| File size | 71.9 GB (26 safetensors) |
| VRAM | ~83 GB |
| Context | 262,144 tokens native |
| Multimodal | Text + Image + Video (image and video use a unified vision encoder with temporal patching for video) |
| Reasoning | Yes - thinking mode enabled via --reasoning-parser qwen3 |
The RTX PRO 6000's 96 GB VRAM is what makes this interesting: the model fits at full precision without GGUF quantization or CPU offload. That gives a useful baseline before deciding whether quantization is worth the trade-off.
Setup
Backend: vLLM 0.20.1 (official release, no fork)
GPU: NVIDIA RTX PRO 6000 Blackwell (96 GB)
Driver: 580.126.09
CUDA: 13.2
PyTorch: 2.x (bundled with vLLM 0.20.1)
Flash attn: Enabled by default (vLLM auto-detects)
Batch size: 2 (concurrent requests)
Concurrency: 30 requests
Dataset: OpenQA (30 prompts, varied topics, ~29 avg input tokens)
Max output: 1024 tokens per response
Temperature: 0.0 (greedy decoding for speed)
I used a custom Python script against the OpenAI-compatible chat completions endpoint (/v1/chat/completions). Full commands, raw outputs, and the script are linked at the bottom of this post.
Limitations to note before we dive in:
- All results are from a single run. No warmup iterations were discarded (vLLM's CUDA graph compilation happens on the first batch).
- GPU temperatures stayed below 75°C throughout (ambient ~22°C, air-cooled workstation).
- 30 prompts is a small sample - enough for throughput characterization, but not statistically significant for quality evaluation.
Speed Benchmark
I compared a baseline vLLM launch against a tuned speculative-decoding launch. The tuned run also used FP8 KV cache, FlashInfer, chunked prefill, prefix caching, and a smaller sequence cap, so treat it as a full serving-profile comparison rather than a pure speculative-decoding A/B test.
Results
| Metric | Without Spec. Decode | With Spec. Decode | Delta |
|---|---|---|---|
| Throughput | 216.92 tok/s | 217.84 tok/s | +0.4% |
| TTFT avg | 47 ms | 162 ms | +245% |
| TTFT P50 | 44.5 ms | 65 ms | +46% |
| TTFT P99 | 57 ms | 1,038 ms | +1,721% |
| TPOT avg | 9.1 ms | 8.8 ms | −3.3% |
| TPOT P99 | 9.1 ms | 14.3 ms | +57% |
| Avg latency | 8.819 s | 8.684 s | −1.5% |
| P99 latency | 9.386 s | 14.736 s | +57% |
| Success rate | 100% (30/30) | 100% (30/30) | — |
| Decoded tok/iter | 1.00 | 2.37 | +137% |
| Spec acceptance | 0.2% | 57.8% | +57.6pp |
Speculative decoding reached 57.8% acceptance and produced 2.37 decoded tokens per iteration, but overall throughput barely moved. The draft-head overhead appears to cancel out the generation gain.
The bigger issue is latency. Baseline TTFT was extremely flat: 47 ms average, 57 ms P99. The tuned speculative run had slightly faster average latency, but P99 latency rose 57% and TTFT P99 jumped to 1,038 ms. For interactive chat, I would keep the baseline profile unless throughput under heavier batch sizes changes the trade-off.
Context Scaling
How throughput holds up as context length grows:
Results
| Context Length | Throughput (w/o spec) | Degradation (w/o spec) | Throughput (w/ spec) | Degradation (w/ spec) | Delta |
|---|---|---|---|---|---|
| 500 tokens | 151.07 tok/s | Baseline | 172.09 tok/s | Baseline | +13.9% |
| 2,000 tokens | 147.68 tok/s | −2.3% | 173.12 tok/s | +0.6% | +17.2% |
| 8,000 tokens | 124.44 tok/s | −17.6% | 146.73 tok/s | −14.7% | +17.9% |
- KV cache FP8 quantization + FlashInfer improve throughput by 14–18% across all context lengths.
- 500→8K degradation drops from 17.6% to 14.7%.
- At 2,000 tokens, the tuned run slightly exceeds the 500-token baseline.
Qwen 3.6 uses gated DeltaNet (linear attention) for 30 of 40 layers, with full attention only every 4th layer. That hybrid design, combined with FP8 KV cache and flashinfer, keeps context scaling efficient — 14.7% loss for 16× context growth. By comparison, pure dense attention models tested on the same hardware typically lose 30–50% at 8K.
At 8K tokens (~6,000 words of input), this covers most RAG doc chunks, multi-turn conversation histories, and medium-length document analysis.
Quality Evaluation
Standard lm-eval loglikelihood benchmarks (MMLU, GSM8K, etc.) do not work cleanly through the chat API. Native vLLM loading would need another large VRAM allocation, so I ran a small custom check instead.
This is not a comprehensive quality assessment. It is a quick sanity check to see whether the speed numbers are attached to useful outputs.
Knowledge QA (5 questions)
Q: What is the capital of France? ✅ Paris
Q: What is the chemical symbol for water? ❌ (empty response)
Q: Who wrote the novel '1984'? ✅ George Orwell
Q: Speed of light in vacuum? ✅ 299,792 km/s
Q: What year did WWII end? ✅ 1945
4/5 (80%). The one miss (H₂O) returned empty - likely a prompt formatting issue, not a knowledge gap.
Math Reasoning (GSM8K, 10 samples)
3/10 (30%), but the number is not reliable. Qwen 3.6 returns reasoning separately from final content, and my answer extractor sometimes grabbed intermediate numbers from the reasoning trace instead of the final answer.
A proper evaluation needs structured output constraints or a parser that treats reasoning and content separately.
Logical Reasoning
Q: If all A are B, and all B are C, are all A necessarily C?
A: Yes. This is a classic valid deductive syllogism.
Correct. The model handles abstract syllogistic reasoning without issue.
What This Means
Qwen 3.6 is fast. 217 tok/s with 47 ms TTFT is impressive for a 35B MoE at full bfloat16. For context, a 70B dense model at Q4 quantization does roughly 25-30 tok/s on this same GPU. The MoE design (3B active out of 35B) gives generation speeds comparable to what I'd expect from a 14B-class dense model.
I can't substantiate a "near-70B-class capability" claim - that would require running the same quality benchmarks on both models, which I can't do on a single GPU. The quality comparison is open for someone with two GPUs to settle.
Context scaling is solid. The hybrid DeltaNet + full-attention design keeps throughput stable as context grows. If you're building RAG apps or long-running agents, this is a meaningful advantage over dense models.
The reasoning format is a real integration gotcha. The reasoning field is useful for debugging, but it breaks standard evaluation pipelines and complicates API integration. Any production use needs to handle it explicitly - either wait for content to appear or use structured output modes.
Who this is for: If you have a 96 GB GPU and want to run a capable open-weight model at full precision without quantization, this is a strong candidate. If you're on 24 GB, GGUF quantizations or smaller dense models will serve you better. If you need coding-specific models (DeepSeek-Coder, CodeLlama) or heavy instruction tuning, evaluate against those before committing.
What's Missing
- Comparison benchmarks: Qwen 3-14B (speed baseline) and Llama 3.1-70B (quality baseline) on the same pipeline would make the numbers more useful.
- Deep quality eval: MMLU, HumanEval, and IFEval through native vLLM loading - requires a second GPU or freeing VRAM.
- Higher batch sizes: BS4, BS8, BS16 to find the throughput ceiling.
- Quantized comparison: How does FP8 or GGUF Q4_K_M change the speed and quality trade-off?
Those would make the result more useful as a buying or deployment guide. For now, this is best read as a single-GPU serving baseline.
Serving Configs
These are the two vLLM configurations I compared.
Baseline:
vllm serve /mnt/ai/models/llm/qwen3.6-full \
--port 11439 \
--gpu-memory-utilization 0.85 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Tuned/speculative:
vllm serve /mnt/ai/models/llm/qwen3.6-full \
--host 0.0.0.0 \
--port 11439 \
--served-model-name qwen-3.6-full \
--gpu-memory-utilization 0.82 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 16 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--performance-mode throughput \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
The meaningful tuning changes are FP8 KV cache, FlashInfer, chunked prefill, prefix caching, lower --max-num-seqs, and native Qwen3 MTP speculative decoding. The raw benchmark files include the full percentile data.
Raw Data
All raw data, the evaluation script, and formatted reports are in the /data/benchmarks/qwen3.6/ directory:
speed_results.txt- Full speed benchmark percentile datacontext_scaling.json- Context scaling measurementsquality_results.json- Quality evaluation resultsbenchmark_report.md- Formatted reporteval_methodology.py- Custom evaluation script
The headline number is 217 tok/s, but the number I keep coming back to is 47 ms TTFT. That's the difference between a model that feels like it's thinking and one that responds. Qwen 3.6 responds.
Continue in AI