Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers

Local LLM Benchmarks on RTX PRO 6000

96 GB of VRAM. 8 models. 4 context depth. Here are the numbers — with the flags, the reps, and the surprises.

Multiple context lengths, flash attention on, 3 reps per measurement, full methodology. Numbers are for a single user running locally - batch size 1, no concurrency.

The Rig


GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
VRAM	96 GB (95.6 GB / 97,247 MiB usable)
Driver	580.126.09 · CUDA 13.2
CPU	30-core (VM host)
RAM	172 GB

Does CPU/RAM matter? For these tests, no. All models loaded fully into VRAM (zero CPU offloading, ngl=99). CPU only handles tokenization and the initial file load, both negligible. RAM gets used while reading the GGUF from disk into VRAM; irrelevant once loaded. The bottleneck is GPU memory bandwidth.

Methodology

Tool:         llama-bench (llama.cpp d23355af / build b8352)
GPU layers:   -ngl 99       (all layers on GPU, no CPU offload)
Flash attn:   -fa 1         (enabled - improves pp speed 5–11%)
Repetitions:  -r 3          (mean ± stddev reported)
Models:       Ollama GGUF blobs, loaded directly
Batch size:   BS=1 (single user, not server throughput)

All models are Ollama-pulled GGUFs accessed directly via blob path see Where Ollama Stores Your Models for how that works. TTFT (time-to-first-token) is not measured here. llama-bench measures throughput, not wall-clock latency. For TTFT you'd use the llama.cpp server + benchmark_serving.py or a tool like llama-benchy, That's a separate post

Reading the Labels

The notation used throughout this post:

Label	Meaning
`pp128`	Prompt processing at 128 input tokens - short prompt, e.g. a quick question
`pp512`	Prompt processing at 512 input tokens - a paragraph or short document chunk
`pp2048`	Prompt processing at 2,048 input tokens - a few pages of context
`tg128`	Token generation of 128 output tokens - a brief response
`tg512`	Token generation of 512 output tokens - a longer reply or code block
`@ d8192`	Measured with 8,192 tokens already in context (KV cache pre-filled)
`@ d32768`	Measured with 32K tokens in context
`@ d65536`	Measured with 65K tokens in context

Low standard deviation (e.g. ± 4) means a stable result, a wide standard deviation (e.g., ± 616), means the GPU is starting to heat up or the system is starting to carry heavy loads, so more care should be taken Numbers are reported as the mean plus or minus the standard deviation across three runs. All speeds are measured in tokens per second (t/s), where higher values are better. pp (prefill) is compute-bound: the GPU is performing dense matrix multiplications in parallel, while tg is memory-bandwidth-bound, as weights are loaded from VRAM for every token.

Prompt Processing (t/s - higher is better)

How fast the model reads your input. Relevant for RAG, document analysis, long-context work.

Model	Type	Params	Quant	Size	pp128	pp512	pp2048
mistral-nemo:12b	Dense	12B	Q4_0	6.6 GiB	5,467 ± 616	8,325 ± 88	7,903 ± 4
qwen2.5-coder:14b	Dense	15B	Q4_K_M	8.4 GiB	4,059 ± 499	5,918 ± 93	5,612 ± 2
qwen3-next:80b	MoE (3B active)	80B	Q4_K_M	46.9 GiB	1,578 ± 37	3,274 ± 21	3,337 ± 11
qwen2.5:32b	Dense	33B	Q4_K_M	18.5 GiB	2,074 ± 169	2,601 ± 36	2,530 ± 4
mixtral:8x22b	MoE	141B total	Q4_0	74.1 GiB	658 ± 10	1,448 ± 2	1,430 ± 2
llama2-uncensored:70b	Dense	69B	Q4_0	36.2 GiB	1,129 ± 12	1,327 ± 9	1,295 ± 3
llama3.1:70b	Dense	71B	Q4_K_M	39.6 GiB	1,013 ± 3	1,173 ± 7	1,147 ± 3
qwen2.5:72b	Dense	73B	Q4_K_M	44.2 GiB	1,020 ± 6	1,176 ± 7	1,147 ± 3

Winnermistral-nemo:12b at 8,325 t/s pp512, by a wide margin. It reads a 512-token prompt in under 62ms. For RAG or long-context work, this is the one if quality holds up. Among 70B+ dense models, they're essentially tied: qwen2.5:72b and llama3.1:70b land within 3 t/s of each other. The surprise: qwen3-next:80b (an 80B MoE) outpaces qwen2.5:32b at pp2048. The MoE routing gets more efficient as context grows.

Token Generation (t/s - higher is better)

How fast the model writes output. This is the number you feel during chat.

Model	Type	Params	Quant	Size	tg128	tg512
mistral-nemo:12b	Dense	12B	Q4_0	6.6 GiB	158 ± 0.1	156 ± 0.4
qwen2.5-coder:14b	Dense	15B	Q4_K_M	8.4 GiB	117 ± 0.2	116 ± 0.6
qwen3-next:80b	MoE (3B active)	80B	Q4_K_M	46.9 GiB	124 ± 0.7	125 ± 0.1
qwen2.5:32b	Dense	33B	Q4_K_M	18.5 GiB	56 ± 0.2	55 ± 0.5
mixtral:8x22b	MoE	141B total	Q4_0	74.1 GiB	54 ± 0.2	52 ± 0.5
llama2-uncensored:70b	Dense	69B	Q4_0	36.2 GiB	31 ± 0.1	31 ± 0.2
llama3.1:70b	Dense	71B	Q4_K_M	39.6 GiB	27 ± 0.1	26 ± 0.2
qwen2.5:72b	Dense	73B	Q4_K_M	44.2 GiB	25 ± 0.1	24 ± 0.2

Winnermistral-nemo:12b at 158 t/s, comfortably the fastest for output. But the real story is row 3: qwen3-next:80b generates at 124 t/s despite being an 80B model. That beats qwen2.5-coder:14b (117 t/s). If you need a large capable model that still feels fast in chat, qwen3-next is the pick. The 70B+ dense models (llama3.1, llama2, qwen2.5:72b) all land between 24-31 t/s, noticeably slower in conversation. Also note the near-zero stddev on token generation (+-0.1-0.7). Once loaded, decode speed is rock solid.

Context Depth Scaling

This is where 96 GB VRAM actually matters. Most consumer cards (24 GB) OOM before 32K on a 70B model. Here's how speed degrades as the KV cache fills up.

Model	Context Depth	pp512 (t/s)	tg128 (t/s)
mistral-nemo:12b	0	7,898 ± 208	150 ± 0.2
mistral-nemo:12b	8K	6,490 ± 218	128 ± 0.2
mistral-nemo:12b	32K	3,603 ± 66	87 ± 0.1
mistral-nemo:12b	65K	1,600 ± 14	61 ± 0.1

qwen2.5:32b	0	2,548 ± 19	54 ± 0.1
qwen2.5:32b	8K	2,137 ± 37	49 ± 0.1
qwen2.5:32b	32K	1,199 ± 12	40 ± 0.1
qwen2.5:32b	65K	551 ± 2	32 ± 0.1

llama3.1:70b	0	1,166 ± 3	27 ± 0.1
llama3.1:70b	8K	1,040 ± 9	25 ± 0.1
llama3.1:70b	32K	691 ± 5	22 ± 0.1
llama3.1:70b	65K	377 ± 1	19 ± 0.1

Winnerllama3.1:70b for long-context work. It degrades the least, only 30% slower at 65K vs baseline. mistral-nemo:12b loses 59% of generation speed by 65K, dropping from 150 to 61 t/s. qwen2.5:32b loses 41%. The 70B model's weights dominate the VRAM bandwidth budget, so the growing KV cache hurts it less proportionally. For large document summarization or long multi-turn agents, bigger holds up better than you'd expect.

What the Numbers Show

Flash attention matters, especially for prompt processing.

Enabling -fa 1 lifted pp512 by 5-11% across all dense models. mistral-nemo gained 9%, qwen2.5-coder gained 10.8%. If you're benchmarking without it, your numbers are understated. MoE models (qwen3-next, mixtral) see smaller gains since they're already structured differently at the kernel level.

MoE breaks the size-speed relationship.

qwen3-next:80b has 80B total parameters, activates ~3B per token, and generates at 124 t/s. That's faster than qwen2.5-coder:14b (117 t/s) and nearly as fast as mistral-nemo:12b (158 t/s). You're getting an 80B-quality model at 14B-class throughput.

Two things make this weirder. First, its token generation is flat: tg128 and tg512 are virtually identical (124.33 vs 124.89). Dense models always degrade slightly; MoE doesn't, because active parameter count stays constant regardless of sequence distance. Second, its prompt processing actually increases at longer contexts: pp128 to pp512 to pp2048 goes 1,578 to 3,274 to 3,337. The MoE routing gets more efficient with larger batches.

Small models degrade harder at long context.

At 65K context depth, mistral-nemo:12b loses 59% of its token generation speed (150 to 61 t/s). llama3.1:70b only loses 30% (27 to 19 t/s).

The reason: KV cache size scales with context length and attention head count, not model size. For smaller models, the KV cache becomes a larger fraction of total memory bandwidth. A 70B model's weights dominate the bandwidth budget regardless of context, so the KV cache overhead is proportionally smaller.

For long-context work (RAG over large documents, extended reasoning chains), a 32B+ model may actually be faster than a 12B at high context depths, and it'll definitely be more consistent.

What Didn't Load

These models are installed in Ollama but failed to load in llama-bench (llama.cpp d23355af):

Model	Size	Error
qwen3.5:35b	23 GB	`key not found: qwen35moe.rope.dimension_sections`
qwen3.5:122b	81 GB	Same - unsupported MoE format
qwen3-coder-next:q4_K_M	51 GB	Failed to load
qwen3-coder-next:q8_0	84 GB	Failed to load
nemotron-3-super:120b	86 GB	Wrong tensor shape
llama4:16x17b	63 GB	Failed to load
gpt-oss:20b	13 GB	Failed to load
gpt-oss:120b	61 GB	Failed to load
glm-4.7-flash	18 GB	Failed to load
qwen3-vl:32b	20 GB	Failed to load

The pattern: every model released in the last 2-3 months. Ollama (v0.17.1) bundles GGUFs ahead of llama.cpp's upstream support. As llama.cpp adds support for these architectures I'll re-run and update.

How to Reproduce

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp

CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
  -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)

# Baseline throughput
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 128,512,2048 -n 128,512 \
  -r 3

# Context depth scaling
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 512 -n 128 -r 3 \
  -d 0,8192,32768,65536

For loading Ollama's cached GGUFs directly, see Where Ollama Stores Your Models.

The result that stuck with me: at 65K context, the 70B model degrades less than the 12B. More VRAM isn't always slower; sometimes it's more stable. The 10 models that didn't load are the ones I'm most curious about. I'll re-run this post when llama.cpp catches up to the newer architectures.