Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers
Prompt processing, token generation, and long-context benchmarks for 8 open-weight models on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), measured with llama-bench at context depths up to 65K tokens.

96 GB of VRAM. 8 models. 4 context depths. Here are the numbers — with the flags, the reps, and the surprises.
Multiple context lengths, flash attention on, 3 reps per measurement, full methodology. Numbers are for a single user running locally — batch size 1, no concurrency.
The Rig
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition |
| VRAM | 96 GB (95.6 GB / 97,247 MiB usable) |
| Driver | 580.126.09 · CUDA 13.2 |
| CPU | 30-core (VM host) |
| RAM | 172 GB |
Does CPU/RAM matter? For these tests, no. All models loaded fully into VRAM (zero CPU offloading, ngl=99). CPU only handles tokenization and the initial file load - both negligible. RAM is used while reading the GGUF from disk into VRAM; irrelevant once loaded. The bottleneck is GPU memory bandwidth.
Methodology
Tool: llama-bench (llama.cpp d23355af / build b8352)
GPU layers: -ngl 99 (all layers on GPU, no CPU offload)
Flash attn: -fa 1 (enabled - improves pp speed 5–11%)
Repetitions: -r 3 (mean ± stddev reported)
Models: Ollama GGUF blobs, loaded directly
Batch size: BS=1 (single user, not server throughput)
All models are Ollama-pulled GGUFs accessed directly via blob path - see Where Ollama Stores Your Models for how that works.
Note: TTFT (time-to-first-token) is not measured here - llama-bench measures throughput, not wall-clock latency. For TTFT you'd use the llama.cpp server + benchmark_serving.py or a tool like llama-benchy. That's a separate post.
Reading the Labels
The notation used throughout this post:
| Label | Meaning |
|---|---|
pp128 | Prompt processing at 128 input tokens - short prompt, e.g. a quick question |
pp512 | Prompt processing at 512 input tokens - a paragraph or short document chunk |
pp2048 | Prompt processing at 2,048 input tokens - a few pages of context |
tg128 | Token generation of 128 output tokens - a brief response |
tg512 | Token generation of 512 output tokens - a longer reply or code block |
@ d8192 | Measured with 8,192 tokens already in context (KV cache pre-filled) |
@ d32768 | Measured with 32K tokens in context |
@ d65536 | Measured with 65K tokens in context |
All speeds are in tokens per second (t/s). Higher is better.
Numbers are reported as mean ± stddev across 3 runs - e.g. 7,898 ± 208 means the average was 7,898 t/s and results varied by ±208 t/s between runs. A tight stddev (like ± 4) means the result is stable and reliable. A wide one (like ± 616) means the GPU was warming up or there was system jitter - treat those numbers with more caution.
pp (prefill) is compute-bound - the GPU is doing dense matrix multiplications in parallel. tg (decode) is memory-bandwidth-bound - weights get loaded from VRAM on every single token. That's why a massive GPU like this can prefill at 8,000 t/s but only generate at 150 t/s on the same small model.
Prompt Processing (t/s - higher is better)
How fast the model reads your input. Relevant for RAG, document analysis, long-context work.
| Model | Type | Params | Quant | Size | pp128 | pp512 | pp2048 |
|---|---|---|---|---|---|---|---|
| mistral-nemo:12b | Dense | 12B | Q4_0 | 6.6 GiB | 5,467 ± 616 | 8,325 ± 88 | 7,903 ± 4 |
| qwen2.5-coder:14b | Dense | 15B | Q4_K_M | 8.4 GiB | 4,059 ± 499 | 5,918 ± 93 | 5,612 ± 2 |
| qwen3-next:80b | MoE (3B active) | 80B | Q4_K_M | 46.9 GiB | 1,578 ± 37 | 3,274 ± 21 | 3,337 ± 11 |
| qwen2.5:32b | Dense | 33B | Q4_K_M | 18.5 GiB | 2,074 ± 169 | 2,601 ± 36 | 2,530 ± 4 |
| mixtral:8x22b | MoE | 141B total | Q4_0 | 74.1 GiB | 658 ± 10 | 1,448 ± 2 | 1,430 ± 2 |
| llama2-uncensored:70b | Dense | 69B | Q4_0 | 36.2 GiB | 1,129 ± 12 | 1,327 ± 9 | 1,295 ± 3 |
| llama3.1:70b | Dense | 71B | Q4_K_M | 39.6 GiB | 1,013 ± 3 | 1,173 ± 7 | 1,147 ± 3 |
| qwen2.5:72b | Dense | 73B | Q4_K_M | 44.2 GiB | 1,020 ± 6 | 1,176 ± 7 | 1,147 ± 3 |
Token Generation (t/s - higher is better)
How fast the model writes output. The number you feel during chat.
| Model | Type | Params | Quant | Size | tg128 | tg512 |
|---|---|---|---|---|---|---|
| mistral-nemo:12b | Dense | 12B | Q4_0 | 6.6 GiB | 158 ± 0.1 | 156 ± 0.4 |
| qwen2.5-coder:14b | Dense | 15B | Q4_K_M | 8.4 GiB | 117 ± 0.2 | 116 ± 0.6 |
| qwen3-next:80b | MoE (3B active) | 80B | Q4_K_M | 46.9 GiB | 124 ± 0.7 | 125 ± 0.1 |
| qwen2.5:32b | Dense | 33B | Q4_K_M | 18.5 GiB | 56 ± 0.2 | 55 ± 0.5 |
| mixtral:8x22b | MoE | 141B total | Q4_0 | 74.1 GiB | 54 ± 0.2 | 52 ± 0.5 |
| llama2-uncensored:70b | Dense | 69B | Q4_0 | 36.2 GiB | 31 ± 0.1 | 31 ± 0.2 |
| llama3.1:70b | Dense | 71B | Q4_K_M | 39.6 GiB | 27 ± 0.1 | 26 ± 0.2 |
| qwen2.5:72b | Dense | 73B | Q4_K_M | 44.2 GiB | 25 ± 0.1 | 24 ± 0.2 |
Context Depth Scaling
This is where 96 GB VRAM makes a real difference. Most consumer cards (24 GB) OOM before 32K on a 70B model. Here's how speed degrades as the KV cache fills up.
| Model | Context Depth | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| mistral-nemo:12b | 0 | 7,898 ± 208 | 150 ± 0.2 |
| mistral-nemo:12b | 8K | 6,490 ± 218 | 128 ± 0.2 |
| mistral-nemo:12b | 32K | 3,603 ± 66 | 87 ± 0.1 |
| mistral-nemo:12b | 65K | 1,600 ± 14 | 61 ± 0.1 |
| qwen2.5:32b | 0 | 2,548 ± 19 | 54 ± 0.1 |
| qwen2.5:32b | 8K | 2,137 ± 37 | 49 ± 0.1 |
| qwen2.5:32b | 32K | 1,199 ± 12 | 40 ± 0.1 |
| qwen2.5:32b | 65K | 551 ± 2 | 32 ± 0.1 |
| llama3.1:70b | 0 | 1,166 ± 3 | 27 ± 0.1 |
| llama3.1:70b | 8K | 1,040 ± 9 | 25 ± 0.1 |
| llama3.1:70b | 32K | 691 ± 5 | 22 ± 0.1 |
| llama3.1:70b | 65K | 377 ± 1 | 19 ± 0.1 |
What the Numbers Show
Flash attention matters - especially for prompt processing.
Enabling -fa 1 lifted pp512 by 5–11% across all dense models. mistral-nemo gained 9%, qwen2.5-coder gained 10.8%. If you're benchmarking without it, your numbers are understated. MoE models (qwen3-next, mixtral) see smaller gains - they're already structured differently at the kernel level.
MoE breaks the size-speed relationship.
qwen3-next:80b has 80B total parameters, activates ~3B per token, and generates at 124 t/s - faster than qwen2.5-coder:14b (117 t/s) and nearly as fast as mistral-nemo:12b (158 t/s). You're getting an 80B-quality model at 14B-class throughput.
Two things make this weirder. First, its token generation is flat across sequence length - tg128 and tg512 are virtually identical (124.33 vs 124.89). Dense models always degrade slightly. MoE doesn't because the active parameter count stays constant regardless of how much you've generated. Second, its prompt processing actually increases at longer contexts: pp128 → pp512 → pp2048 goes 1,578 → 3,274 → 3,337. The MoE routing becomes more efficient with larger batches.
Smaller models degrade harder at long context.
This is the counter-intuitive part. At 65K context depth, mistral-nemo:12b loses 59% of its token generation speed (150 → 61 t/s). llama3.1:70b only loses 30% (27 → 19 t/s).
The reason: KV cache size scales with context length and number of attention heads - not model size per se. For smaller models the KV cache becomes a larger fraction of total memory bandwidth relative to weight loading. The 70B model's weights dominate the bandwidth budget regardless of context, so the KV cache overhead is proportionally smaller.
Practical implication: if you're doing long-context work (RAG over large documents, extended reasoning chains), a 32B+ model may actually be faster than a 12B at high context depths - and will definitely be more consistent.
What Didn't Load
These models are installed and accessible through Ollama but failed to load in llama-bench (llama.cpp d23355af):
| Model | Size | Error |
|---|---|---|
| qwen3.5:35b | 23 GB | key not found: qwen35moe.rope.dimension_sections |
| qwen3.5:122b | 81 GB | Same - unsupported MoE format |
| qwen3-coder-next:q4_K_M | 51 GB | Failed to load |
| qwen3-coder-next:q8_0 | 84 GB | Failed to load |
| nemotron-3-super:120b | 86 GB | Wrong tensor shape |
| llama4:16x17b | 63 GB | Failed to load |
| gpt-oss:20b | 13 GB | Failed to load |
| gpt-oss:120b | 61 GB | Failed to load |
| glm-4.7-flash | 18 GB | Failed to load |
| qwen3-vl:32b | 20 GB | Failed to load |
The pattern: every model released in the last 2–3 months. Ollama (v0.17.1) bundles GGUFs ahead of llama.cpp's upstream support. As llama.cpp adds support for these architectures I'll re-run and update.
How to Reproduce
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
-DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)
# Baseline throughput
./build/bin/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -fa 1 \
-p 128,512,2048 -n 128,512 \
-r 3
# Context depth scaling
./build/bin/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -fa 1 \
-p 512 -n 128 -r 3 \
-d 0,8192,32768,65536
For loading Ollama's cached GGUFs directly, see Where Ollama Stores Your Models.
The result that stuck with me: at 65K context, the 70B model degraded less than the 12B. More weight in VRAM isn't always slower — sometimes it's more stable. The 10 models that didn't load are the ones I'm most curious about. I'll re-run this post when llama.cpp catches up to the newer architectures.
Continue in AI