Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers
Prompt processing, token generation, and long-context benchmarks for 8 open-weight models on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), measured with llama-bench at context depths up to 65K tokens.

96 GB of VRAM. 8 models. 4 context depth. Here are the numbers — with the flags, the reps, and the surprises.
Multiple context lengths, flash attention on, 3 reps per measurement, full methodology. Numbers are for a single user running locally - batch size 1, no concurrency.
The Rig
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition |
| VRAM | 96 GB (95.6 GB / 97,247 MiB usable) |
| Driver | 580.126.09 · CUDA 13.2 |
| CPU | 30-core (VM host) |
| RAM | 172 GB |
Does CPU/RAM matter? For these tests, no. All models loaded fully into VRAM (zero CPU offloading, ngl=99). CPU only handles tokenization and the initial file load, both negligible. RAM gets used while reading the GGUF from disk into VRAM; irrelevant once loaded. The bottleneck is GPU memory bandwidth.
Methodology
Tool: llama-bench (llama.cpp d23355af / build b8352)
GPU layers: -ngl 99 (all layers on GPU, no CPU offload)
Flash attn: -fa 1 (enabled - improves pp speed 5–11%)
Repetitions: -r 3 (mean ± stddev reported)
Models: Ollama GGUF blobs, loaded directly
Batch size: BS=1 (single user, not server throughput)
All models are Ollama-pulled GGUFs accessed directly via blob path see Where Ollama Stores Your Models for how that works. TTFT (time-to-first-token) is not measured here. llama-bench measures throughput, not wall-clock latency. For TTFT you'd use the llama.cpp server + benchmark_serving.py or a tool like llama-benchy, That's a separate post
Reading the Labels
The notation used throughout this post:
| Label | Meaning |
|---|---|
pp128 | Prompt processing at 128 input tokens - short prompt, e.g. a quick question |
pp512 | Prompt processing at 512 input tokens - a paragraph or short document chunk |
pp2048 | Prompt processing at 2,048 input tokens - a few pages of context |
tg128 | Token generation of 128 output tokens - a brief response |
tg512 | Token generation of 512 output tokens - a longer reply or code block |
@ d8192 | Measured with 8,192 tokens already in context (KV cache pre-filled) |
@ d32768 | Measured with 32K tokens in context |
@ d65536 | Measured with 65K tokens in context |
Low standard deviation (e.g. ± 4) means a stable result, a wide standard deviation (e.g., ± 616), means the GPU is starting to heat up or the system is starting to carry heavy loads, so more care should be taken Numbers are reported as the mean plus or minus the standard deviation across three runs. All speeds are measured in tokens per second (t/s), where higher values are better. pp (prefill) is compute-bound: the GPU is performing dense matrix multiplications in parallel, while tg is memory-bandwidth-bound, as weights are loaded from VRAM for every token.
Prompt Processing (t/s - higher is better)
How fast the model reads your input. Relevant for RAG, document analysis, long-context work.
| Model | Type | Params | Quant | Size | pp128 | pp512 | pp2048 |
|---|---|---|---|---|---|---|---|
| mistral-nemo:12b | Dense | 12B | Q4_0 | 6.6 GiB | 5,467 ± 616 | 8,325 ± 88 | 7,903 ± 4 |
| qwen2.5-coder:14b | Dense | 15B | Q4_K_M | 8.4 GiB | 4,059 ± 499 | 5,918 ± 93 | 5,612 ± 2 |
| qwen3-next:80b | MoE (3B active) | 80B | Q4_K_M | 46.9 GiB | 1,578 ± 37 | 3,274 ± 21 | 3,337 ± 11 |
| qwen2.5:32b | Dense | 33B | Q4_K_M | 18.5 GiB | 2,074 ± 169 | 2,601 ± 36 | 2,530 ± 4 |
| mixtral:8x22b | MoE | 141B total | Q4_0 | 74.1 GiB | 658 ± 10 | 1,448 ± 2 | 1,430 ± 2 |
| llama2-uncensored:70b | Dense | 69B | Q4_0 | 36.2 GiB | 1,129 ± 12 | 1,327 ± 9 | 1,295 ± 3 |
| llama3.1:70b | Dense | 71B | Q4_K_M | 39.6 GiB | 1,013 ± 3 | 1,173 ± 7 | 1,147 ± 3 |
| qwen2.5:72b | Dense | 73B | Q4_K_M | 44.2 GiB | 1,020 ± 6 | 1,176 ± 7 | 1,147 ± 3 |
Token Generation (t/s - higher is better)
How fast the model writes output. This is the number you feel during chat.
| Model | Type | Params | Quant | Size | tg128 | tg512 |
|---|---|---|---|---|---|---|
| mistral-nemo:12b | Dense | 12B | Q4_0 | 6.6 GiB | 158 ± 0.1 | 156 ± 0.4 |
| qwen2.5-coder:14b | Dense | 15B | Q4_K_M | 8.4 GiB | 117 ± 0.2 | 116 ± 0.6 |
| qwen3-next:80b | MoE (3B active) | 80B | Q4_K_M | 46.9 GiB | 124 ± 0.7 | 125 ± 0.1 |
| qwen2.5:32b | Dense | 33B | Q4_K_M | 18.5 GiB | 56 ± 0.2 | 55 ± 0.5 |
| mixtral:8x22b | MoE | 141B total | Q4_0 | 74.1 GiB | 54 ± 0.2 | 52 ± 0.5 |
| llama2-uncensored:70b | Dense | 69B | Q4_0 | 36.2 GiB | 31 ± 0.1 | 31 ± 0.2 |
| llama3.1:70b | Dense | 71B | Q4_K_M | 39.6 GiB | 27 ± 0.1 | 26 ± 0.2 |
| qwen2.5:72b | Dense | 73B | Q4_K_M | 44.2 GiB | 25 ± 0.1 | 24 ± 0.2 |
Context Depth Scaling
This is where 96 GB VRAM actually matters. Most consumer cards (24 GB) OOM before 32K on a 70B model. Here's how speed degrades as the KV cache fills up.
| Model | Context Depth | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| mistral-nemo:12b | 0 | 7,898 ± 208 | 150 ± 0.2 |
| mistral-nemo:12b | 8K | 6,490 ± 218 | 128 ± 0.2 |
| mistral-nemo:12b | 32K | 3,603 ± 66 | 87 ± 0.1 |
| mistral-nemo:12b | 65K | 1,600 ± 14 | 61 ± 0.1 |
| qwen2.5:32b | 0 | 2,548 ± 19 | 54 ± 0.1 |
| qwen2.5:32b | 8K | 2,137 ± 37 | 49 ± 0.1 |
| qwen2.5:32b | 32K | 1,199 ± 12 | 40 ± 0.1 |
| qwen2.5:32b | 65K | 551 ± 2 | 32 ± 0.1 |
| llama3.1:70b | 0 | 1,166 ± 3 | 27 ± 0.1 |
| llama3.1:70b | 8K | 1,040 ± 9 | 25 ± 0.1 |
| llama3.1:70b | 32K | 691 ± 5 | 22 ± 0.1 |
| llama3.1:70b | 65K | 377 ± 1 | 19 ± 0.1 |
What the Numbers Show
Flash attention matters, especially for prompt processing.
Enabling -fa 1 lifted pp512 by 5-11% across all dense models. mistral-nemo gained 9%, qwen2.5-coder gained 10.8%. If you're benchmarking without it, your numbers are understated. MoE models (qwen3-next, mixtral) see smaller gains since they're already structured differently at the kernel level.
MoE breaks the size-speed relationship.
qwen3-next:80b has 80B total parameters, activates ~3B per token, and generates at 124 t/s. That's faster than qwen2.5-coder:14b (117 t/s) and nearly as fast as mistral-nemo:12b (158 t/s). You're getting an 80B-quality model at 14B-class throughput.
Two things make this weirder. First, its token generation is flat: tg128 and tg512 are virtually identical (124.33 vs 124.89). Dense models always degrade slightly; MoE doesn't, because active parameter count stays constant regardless of sequence distance. Second, its prompt processing actually increases at longer contexts: pp128 to pp512 to pp2048 goes 1,578 to 3,274 to 3,337. The MoE routing gets more efficient with larger batches.
Small models degrade harder at long context.
At 65K context depth, mistral-nemo:12b loses 59% of its token generation speed (150 to 61 t/s). llama3.1:70b only loses 30% (27 to 19 t/s).
The reason: KV cache size scales with context length and attention head count, not model size. For smaller models, the KV cache becomes a larger fraction of total memory bandwidth. A 70B model's weights dominate the bandwidth budget regardless of context, so the KV cache overhead is proportionally smaller.
For long-context work (RAG over large documents, extended reasoning chains), a 32B+ model may actually be faster than a 12B at high context depths, and it'll definitely be more consistent.
What Didn't Load
These models are installed in Ollama but failed to load in llama-bench (llama.cpp d23355af):
| Model | Size | Error |
|---|---|---|
| qwen3.5:35b | 23 GB | key not found: qwen35moe.rope.dimension_sections |
| qwen3.5:122b | 81 GB | Same - unsupported MoE format |
| qwen3-coder-next:q4_K_M | 51 GB | Failed to load |
| qwen3-coder-next:q8_0 | 84 GB | Failed to load |
| nemotron-3-super:120b | 86 GB | Wrong tensor shape |
| llama4:16x17b | 63 GB | Failed to load |
| gpt-oss:20b | 13 GB | Failed to load |
| gpt-oss:120b | 61 GB | Failed to load |
| glm-4.7-flash | 18 GB | Failed to load |
| qwen3-vl:32b | 20 GB | Failed to load |
The pattern: every model released in the last 2-3 months. Ollama (v0.17.1) bundles GGUFs ahead of llama.cpp's upstream support. As llama.cpp adds support for these architectures I'll re-run and update.
How to Reproduce
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
-DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)
# Baseline throughput
./build/bin/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -fa 1 \
-p 128,512,2048 -n 128,512 \
-r 3
# Context depth scaling
./build/bin/llama-bench \
-m /path/to/model.gguf \
-ngl 99 -fa 1 \
-p 512 -n 128 -r 3 \
-d 0,8192,32768,65536
For loading Ollama's cached GGUFs directly, see Where Ollama Stores Your Models.
The result that stuck with me: at 65K context, the 70B model degrades less than the 12B. More VRAM isn't always slower; sometimes it's more stable. The 10 models that didn't load are the ones I'm most curious about. I'll re-run this post when llama.cpp catches up to the newer architectures.