Back to all posts
AIMarch 17, 202611 min read

Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers

Prompt processing, token generation, and long-context benchmarks for 8 open-weight models on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), measured with llama-bench at context depths up to 65K tokens.

benchmarksllmlocal-airtx-pro-6000homelabllama.cppggufqwenllamamistral

Local LLM Benchmarks on RTX PRO 6000

96 GB of VRAM. 8 models. 4 context depths. Here are the numbers — with the flags, the reps, and the surprises.

Multiple context lengths, flash attention on, 3 reps per measurement, full methodology. Numbers are for a single user running locally — batch size 1, no concurrency.


The Rig

GPUNVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
VRAM96 GB (95.6 GB / 97,247 MiB usable)
Driver580.126.09 · CUDA 13.2
CPU30-core (VM host)
RAM172 GB

Does CPU/RAM matter? For these tests, no. All models loaded fully into VRAM (zero CPU offloading, ngl=99). CPU only handles tokenization and the initial file load - both negligible. RAM is used while reading the GGUF from disk into VRAM; irrelevant once loaded. The bottleneck is GPU memory bandwidth.


Methodology

Tool:         llama-bench (llama.cpp d23355af / build b8352)
GPU layers:   -ngl 99       (all layers on GPU, no CPU offload)
Flash attn:   -fa 1         (enabled - improves pp speed 5–11%)
Repetitions:  -r 3          (mean ± stddev reported)
Models:       Ollama GGUF blobs, loaded directly
Batch size:   BS=1 (single user, not server throughput)

All models are Ollama-pulled GGUFs accessed directly via blob path - see Where Ollama Stores Your Models for how that works.

Note: TTFT (time-to-first-token) is not measured here - llama-bench measures throughput, not wall-clock latency. For TTFT you'd use the llama.cpp server + benchmark_serving.py or a tool like llama-benchy. That's a separate post.


Reading the Labels

The notation used throughout this post:

LabelMeaning
pp128Prompt processing at 128 input tokens - short prompt, e.g. a quick question
pp512Prompt processing at 512 input tokens - a paragraph or short document chunk
pp2048Prompt processing at 2,048 input tokens - a few pages of context
tg128Token generation of 128 output tokens - a brief response
tg512Token generation of 512 output tokens - a longer reply or code block
@ d8192Measured with 8,192 tokens already in context (KV cache pre-filled)
@ d32768Measured with 32K tokens in context
@ d65536Measured with 65K tokens in context

All speeds are in tokens per second (t/s). Higher is better.

Numbers are reported as mean ± stddev across 3 runs - e.g. 7,898 ± 208 means the average was 7,898 t/s and results varied by ±208 t/s between runs. A tight stddev (like ± 4) means the result is stable and reliable. A wide one (like ± 616) means the GPU was warming up or there was system jitter - treat those numbers with more caution.

pp (prefill) is compute-bound - the GPU is doing dense matrix multiplications in parallel. tg (decode) is memory-bandwidth-bound - weights get loaded from VRAM on every single token. That's why a massive GPU like this can prefill at 8,000 t/s but only generate at 150 t/s on the same small model.


Prompt Processing (t/s - higher is better)

How fast the model reads your input. Relevant for RAG, document analysis, long-context work.

ModelTypeParamsQuantSizepp128pp512pp2048
mistral-nemo:12bDense12BQ4_06.6 GiB5,467 ± 6168,325 ± 887,903 ± 4
qwen2.5-coder:14bDense15BQ4_K_M8.4 GiB4,059 ± 4995,918 ± 935,612 ± 2
qwen3-next:80bMoE (3B active)80BQ4_K_M46.9 GiB1,578 ± 373,274 ± 213,337 ± 11
qwen2.5:32bDense33BQ4_K_M18.5 GiB2,074 ± 1692,601 ± 362,530 ± 4
mixtral:8x22bMoE141B totalQ4_074.1 GiB658 ± 101,448 ± 21,430 ± 2
llama2-uncensored:70bDense69BQ4_036.2 GiB1,129 ± 121,327 ± 91,295 ± 3
llama3.1:70bDense71BQ4_K_M39.6 GiB1,013 ± 31,173 ± 71,147 ± 3
qwen2.5:72bDense73BQ4_K_M44.2 GiB1,020 ± 61,176 ± 71,147 ± 3
Winnermistral-nemo:12b at 8,325 t/s pp512 - by a wide margin. It reads a 512-token prompt in under 62ms. For anything RAG-heavy or long-context, this is your model if quality is acceptable. Among the 70B+ dense models, they're essentially tied - qwen2.5:72b and llama3.1:70b are within 3 t/s of each other. The standout surprise: qwen3-next:80b (an 80B MoE) outpaces qwen2.5:32b at pp2048 - the MoE routing gets more efficient as context grows.

Token Generation (t/s - higher is better)

How fast the model writes output. The number you feel during chat.

ModelTypeParamsQuantSizetg128tg512
mistral-nemo:12bDense12BQ4_06.6 GiB158 ± 0.1156 ± 0.4
qwen2.5-coder:14bDense15BQ4_K_M8.4 GiB117 ± 0.2116 ± 0.6
qwen3-next:80bMoE (3B active)80BQ4_K_M46.9 GiB124 ± 0.7125 ± 0.1
qwen2.5:32bDense33BQ4_K_M18.5 GiB56 ± 0.255 ± 0.5
mixtral:8x22bMoE141B totalQ4_074.1 GiB54 ± 0.252 ± 0.5
llama2-uncensored:70bDense69BQ4_036.2 GiB31 ± 0.131 ± 0.2
llama3.1:70bDense71BQ4_K_M39.6 GiB27 ± 0.126 ± 0.2
qwen2.5:72bDense73BQ4_K_M44.2 GiB25 ± 0.124 ± 0.2
Winnermistral-nemo:12b at 158 t/s - comfortably the fastest for output. But the real story is row 3: qwen3-next:80b generates at 124 t/s despite being an 80B model. That beats qwen2.5-coder:14b (117 t/s). If you need a large capable model that still feels snappy in chat, qwen3-next is the pick. The 70B+ dense models (llama3.1, llama2, qwen2.5:72b) all land between 24-31 t/s - noticeably slower in conversation. Also note the near-zero stddev on token generation (± 0.1-0.7) - once loaded, decode speed is rock solid.

Context Depth Scaling

This is where 96 GB VRAM makes a real difference. Most consumer cards (24 GB) OOM before 32K on a 70B model. Here's how speed degrades as the KV cache fills up.

ModelContext Depthpp512 (t/s)tg128 (t/s)
mistral-nemo:12b07,898 ± 208150 ± 0.2
mistral-nemo:12b8K6,490 ± 218128 ± 0.2
mistral-nemo:12b32K3,603 ± 6687 ± 0.1
mistral-nemo:12b65K1,600 ± 1461 ± 0.1
qwen2.5:32b02,548 ± 1954 ± 0.1
qwen2.5:32b8K2,137 ± 3749 ± 0.1
qwen2.5:32b32K1,199 ± 1240 ± 0.1
qwen2.5:32b65K551 ± 232 ± 0.1
llama3.1:70b01,166 ± 327 ± 0.1
llama3.1:70b8K1,040 ± 925 ± 0.1
llama3.1:70b32K691 ± 522 ± 0.1
llama3.1:70b65K377 ± 119 ± 0.1
Winnerllama3.1:70b for long-context work. It degrades the least - only 30% slower at 65K vs baseline. mistral-nemo:12b loses 59% of its generation speed by 65K context, dropping from 150 t/s down to 61 t/s. qwen2.5:32b loses 41%. The 70B model's sheer weight size dominates the VRAM bandwidth budget so the growing KV cache hurts it less proportionally. If you're summarising large documents or running multi-turn agents with long history, a bigger dense model holds up better than you'd expect.

What the Numbers Show

Flash attention matters - especially for prompt processing.

Enabling -fa 1 lifted pp512 by 5–11% across all dense models. mistral-nemo gained 9%, qwen2.5-coder gained 10.8%. If you're benchmarking without it, your numbers are understated. MoE models (qwen3-next, mixtral) see smaller gains - they're already structured differently at the kernel level.

MoE breaks the size-speed relationship.

qwen3-next:80b has 80B total parameters, activates ~3B per token, and generates at 124 t/s - faster than qwen2.5-coder:14b (117 t/s) and nearly as fast as mistral-nemo:12b (158 t/s). You're getting an 80B-quality model at 14B-class throughput.

Two things make this weirder. First, its token generation is flat across sequence length - tg128 and tg512 are virtually identical (124.33 vs 124.89). Dense models always degrade slightly. MoE doesn't because the active parameter count stays constant regardless of how much you've generated. Second, its prompt processing actually increases at longer contexts: pp128 → pp512 → pp2048 goes 1,578 → 3,274 → 3,337. The MoE routing becomes more efficient with larger batches.

Smaller models degrade harder at long context.

This is the counter-intuitive part. At 65K context depth, mistral-nemo:12b loses 59% of its token generation speed (150 → 61 t/s). llama3.1:70b only loses 30% (27 → 19 t/s).

The reason: KV cache size scales with context length and number of attention heads - not model size per se. For smaller models the KV cache becomes a larger fraction of total memory bandwidth relative to weight loading. The 70B model's weights dominate the bandwidth budget regardless of context, so the KV cache overhead is proportionally smaller.

Practical implication: if you're doing long-context work (RAG over large documents, extended reasoning chains), a 32B+ model may actually be faster than a 12B at high context depths - and will definitely be more consistent.


What Didn't Load

These models are installed and accessible through Ollama but failed to load in llama-bench (llama.cpp d23355af):

ModelSizeError
qwen3.5:35b23 GBkey not found: qwen35moe.rope.dimension_sections
qwen3.5:122b81 GBSame - unsupported MoE format
qwen3-coder-next:q4_K_M51 GBFailed to load
qwen3-coder-next:q8_084 GBFailed to load
nemotron-3-super:120b86 GBWrong tensor shape
llama4:16x17b63 GBFailed to load
gpt-oss:20b13 GBFailed to load
gpt-oss:120b61 GBFailed to load
glm-4.7-flash18 GBFailed to load
qwen3-vl:32b20 GBFailed to load

The pattern: every model released in the last 2–3 months. Ollama (v0.17.1) bundles GGUFs ahead of llama.cpp's upstream support. As llama.cpp adds support for these architectures I'll re-run and update.


How to Reproduce

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp

CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
  -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)

# Baseline throughput
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 128,512,2048 -n 128,512 \
  -r 3

# Context depth scaling
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 512 -n 128 -r 3 \
  -d 0,8192,32768,65536

For loading Ollama's cached GGUFs directly, see Where Ollama Stores Your Models.


The result that stuck with me: at 65K context, the 70B model degraded less than the 12B. More weight in VRAM isn't always slower — sometimes it's more stable. The 10 models that didn't load are the ones I'm most curious about. I'll re-run this post when llama.cpp catches up to the newer architectures.