Qwen 3.6-35B-A3B on an RTX PRO 6000: 217 tok/s at Full bf16

Qwen 3.6 Benchmarks

Note

TL;DR: Qwen 3.6-35B-A3B runs on a single RTX PRO 6000 (96 GB) at full bfloat16 - no quantization, no offloading. I measured 217 tok/s throughput, 47 ms TTFT, and only 18% slowdown at 8K context. Speculative decoding improved token cadence but did not improve total throughput, and it hurt tail latency.

What happens when a 35-billion-parameter MoE model fits on a single GPU at full precision? I ran Qwen 3.6-35B-A3B through vLLM benchmarks to find out.

The short version: it is fast, the first token arrives almost immediately, and context scaling is better than I expected. The quality results are less conclusive because standard eval tooling does not handle Qwen's reasoning format cleanly through the chat API.

Quick Glossary

Term	Meaning
TTFT	Time to First Token - how long before the model starts outputting. Lower = feels faster.
TPOT	Time Per Output Token - generation speed once streaming begins.
MoE	Mixture of Experts - only a subset of parameters activate per token.
bf16	Brain float 16 - memory-efficient precision that preserves model quality.
DeltaNet	Linear attention variant Qwen 3.6 uses for most layers, trades some accuracy for O(n) context scaling.

Results Summary

Result	Measurement
Best throughput	217.84 tok/s
Baseline TTFT avg	47 ms
Baseline TTFT P99	57 ms
8K context throughput	124.44 tok/s baseline, 146.73 tok/s tuned
8K slowdown	17.6% baseline, 14.7% tuned
Speculative decoding	+0.4% throughput, worse P99 latency
Precision	bfloat16 model weights, no quantization

The headline is not just throughput. It is the combination of full bf16 weights, single-GPU deployment, 47 ms average time to first token, and usable throughput even at 8K context.

The Model


Model	Qwen 3.6-35B-A3B (HuggingFace)
Architecture	MoE - 256 experts, 8 active + 1 shared per token
Total params	35B (3B active)
Precision	bfloat16 (not quantized)
File size	71.9 GB (26 safetensors)
VRAM	~83 GB
Context	262,144 tokens native
Multimodal	Text + Image + Video (image and video use a unified vision encoder with temporal patching for video)
Reasoning	Yes - thinking mode enabled via `--reasoning-parser qwen3`

The RTX PRO 6000's 96 GB VRAM is what makes this interesting: the model fits at full precision without GGUF quantization or CPU offload. That gives a useful baseline before deciding whether quantization is worth the trade-off.

Setup

Backend:       vLLM 0.20.1 (official release, no fork)
GPU:           NVIDIA RTX PRO 6000 Blackwell (96 GB)
Driver:        580.126.09
CUDA:          13.2
PyTorch:       2.x (bundled with vLLM 0.20.1)
Flash attn:    Enabled by default (vLLM auto-detects)
Batch size:    2 (concurrent requests)
Concurrency:   30 requests
Dataset:       OpenQA (30 prompts, varied topics, ~29 avg input tokens)
Max output:    1024 tokens per response
Temperature:   0.0 (greedy decoding for speed)

I used a custom Python script against the OpenAI-compatible chat completions endpoint (/v1/chat/completions). Full commands, raw outputs, and the script are linked at the bottom of this post.

Limitations to note before we dive in:

All results are from a single run. No warmup iterations were discarded (vLLM's CUDA graph compilation happens on the first batch).
GPU temperatures stayed below 75°C throughout (ambient ~22°C, air-cooled workstation).
30 prompts is a small sample - enough for throughput characterization, but not statistically significant for quality evaluation.

Speed Benchmark

I compared a baseline vLLM launch against a tuned speculative-decoding launch. The tuned run also used FP8 KV cache, FlashInfer, chunked prefill, prefix caching, and a smaller sequence cap, so treat it as a full serving-profile comparison rather than a pure speculative-decoding A/B test.

Results

Metric	Without Spec. Decode	With Spec. Decode	Delta
Throughput	216.92 tok/s	217.84 tok/s	+0.4%
TTFT avg	47 ms	162 ms	+245%
TTFT P50	44.5 ms	65 ms	+46%
TTFT P99	57 ms	1,038 ms	+1,721%
TPOT avg	9.1 ms	8.8 ms	−3.3%
TPOT P99	9.1 ms	14.3 ms	+57%
Avg latency	8.819 s	8.684 s	−1.5%
P99 latency	9.386 s	14.736 s	+57%
Success rate	100% (30/30)	100% (30/30)	—
Decoded tok/iter	1.00	2.37	+137%
Spec acceptance	0.2%	57.8%	+57.6pp

Speculative decoding reached 57.8% acceptance and produced 2.37 decoded tokens per iteration, but overall throughput barely moved. The draft-head overhead appears to cancel out the generation gain.

The bigger issue is latency. Baseline TTFT was extremely flat: 47 ms average, 57 ms P99. The tuned speculative run had slightly faster average latency, but P99 latency rose 57% and TTFT P99 jumped to 1,038 ms. For interactive chat, I would keep the baseline profile unless throughput under heavier batch sizes changes the trade-off.

Context Scaling

How throughput holds up as context length grows:

Results

Context Length	Throughput (w/o spec)	Degradation (w/o spec)	Throughput (w/ spec)	Degradation (w/ spec)	Delta
500 tokens	151.07 tok/s	Baseline	172.09 tok/s	Baseline	+13.9%
2,000 tokens	147.68 tok/s	−2.3%	173.12 tok/s	+0.6%	+17.2%
8,000 tokens	124.44 tok/s	−17.6%	146.73 tok/s	−14.7%	+17.9%

KV cache FP8 quantization + FlashInfer improve throughput by 14–18% across all context lengths.
500→8K degradation drops from 17.6% to 14.7%.
At 2,000 tokens, the tuned run slightly exceeds the 500-token baseline.

Qwen 3.6 uses gated DeltaNet (linear attention) for 30 of 40 layers, with full attention only every 4th layer. That hybrid design, combined with FP8 KV cache and flashinfer, keeps context scaling efficient — 14.7% loss for 16× context growth. By comparison, pure dense attention models tested on the same hardware typically lose 30–50% at 8K.

At 8K tokens (~6,000 words of input), this covers most RAG doc chunks, multi-turn conversation histories, and medium-length document analysis.

Quality Evaluation

Standard lm-eval loglikelihood benchmarks (MMLU, GSM8K, etc.) do not work cleanly through the chat API. Native vLLM loading would need another large VRAM allocation, so I ran a small custom check instead.

This is not a comprehensive quality assessment. It is a quick sanity check to see whether the speed numbers are attached to useful outputs.

Knowledge QA (5 questions)

Q: What is the capital of France?          ✅ Paris
Q: What is the chemical symbol for water?   ❌ (empty response)
Q: Who wrote the novel '1984'?             ✅ George Orwell
Q: Speed of light in vacuum?               ✅ 299,792 km/s
Q: What year did WWII end?                 ✅ 1945

4/5 (80%). The one miss (H₂O) returned empty - likely a prompt formatting issue, not a knowledge gap.

Math Reasoning (GSM8K, 10 samples)

3/10 (30%), but the number is not reliable. Qwen 3.6 returns reasoning separately from final content, and my answer extractor sometimes grabbed intermediate numbers from the reasoning trace instead of the final answer.

A proper evaluation needs structured output constraints or a parser that treats reasoning and content separately.

Logical Reasoning

Q: If all A are B, and all B are C, are all A necessarily C?
A: Yes. This is a classic valid deductive syllogism.

Correct. The model handles abstract syllogistic reasoning without issue.

What This Means

Qwen 3.6 is fast. 217 tok/s with 47 ms TTFT is impressive for a 35B MoE at full bfloat16. For context, a 70B dense model at Q4 quantization does roughly 25-30 tok/s on this same GPU. The MoE design (3B active out of 35B) gives generation speeds comparable to what I'd expect from a 14B-class dense model.

I can't substantiate a "near-70B-class capability" claim - that would require running the same quality benchmarks on both models, which I can't do on a single GPU. The quality comparison is open for someone with two GPUs to settle.

Context scaling is solid. The hybrid DeltaNet + full-attention design keeps throughput stable as context grows. If you're building RAG apps or long-running agents, this is a meaningful advantage over dense models.

The reasoning format is a real integration gotcha. The reasoning field is useful for debugging, but it breaks standard evaluation pipelines and complicates API integration. Any production use needs to handle it explicitly - either wait for content to appear or use structured output modes.

Who this is for: If you have a 96 GB GPU and want to run a capable open-weight model at full precision without quantization, this is a strong candidate. If you're on 24 GB, GGUF quantizations or smaller dense models will serve you better. If you need coding-specific models (DeepSeek-Coder, CodeLlama) or heavy instruction tuning, evaluate against those before committing.

What's Missing

Comparison benchmarks: Qwen 3-14B (speed baseline) and Llama 3.1-70B (quality baseline) on the same pipeline would make the numbers more useful.
Deep quality eval: MMLU, HumanEval, and IFEval through native vLLM loading - requires a second GPU or freeing VRAM.
Higher batch sizes: BS4, BS8, BS16 to find the throughput ceiling.
Quantized comparison: How does FP8 or GGUF Q4_K_M change the speed and quality trade-off?

Those would make the result more useful as a buying or deployment guide. For now, this is best read as a single-GPU serving baseline.

Serving Configs

These are the two vLLM configurations I compared.

Baseline:

vllm serve /mnt/ai/models/llm/qwen3.6-full \
  --port 11439 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Tuned/speculative:

vllm serve /mnt/ai/models/llm/qwen3.6-full \
  --host 0.0.0.0 \
  --port 11439 \
  --served-model-name qwen-3.6-full \
  --gpu-memory-utilization 0.82 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 16 \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code \
  --performance-mode throughput \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

The meaningful tuning changes are FP8 KV cache, FlashInfer, chunked prefill, prefix caching, lower --max-num-seqs, and native Qwen3 MTP speculative decoding. The raw benchmark files include the full percentile data.

Raw Data

All raw data, the evaluation script, and formatted reports are in the /data/benchmarks/qwen3.6/ directory:

speed_results.txt - Full speed benchmark percentile data
context_scaling.json - Context scaling measurements
quality_results.json - Quality evaluation results
benchmark_report.md - Formatted report
eval_methodology.py - Custom evaluation script

The headline number is 217 tok/s, but the number I keep coming back to is 47 ms TTFT. That's the difference between a model that feels like it's thinking and one that responds. Qwen 3.6 responds.

Qwen 3.6-35B-A3B on an RTX PRO 6000: 217 tok/s at Full bf16

Quick Glossary

Results Summary

The Model

Setup

Speed Benchmark

Results

Context Scaling

Results

Quality Evaluation

Knowledge QA (5 questions)

Math Reasoning (GSM8K, 10 samples)

Logical Reasoning

What This Means

What's Missing

Serving Configs

Raw Data

Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers

Nemotron 3 Super vs Qwen 3.5: Tested on a Real Agentic Workflow