Back to all posts
AIMarch 17, 202610 min read

Local LLM Benchmarks on RTX PRO 6000 Blackwell (2026): 8 Models, Real Numbers

Prompt processing, token generation, and long-context benchmarks for 8 open-weight models on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), measured with llama-bench at context depths up to 65K tokens.

benchmarksllmlocal-airtx-pro-6000homelabllama.cppggufqwenllamamistral

Local LLM Benchmarks on RTX PRO 6000

96 GB of VRAM. 8 models. 4 context depth. Here are the numbers — with the flags, the reps, and the surprises.

Multiple context lengths, flash attention on, 3 reps per measurement, full methodology. Numbers are for a single user running locally - batch size 1, no concurrency.


The Rig

GPUNVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
VRAM96 GB (95.6 GB / 97,247 MiB usable)
Driver580.126.09 · CUDA 13.2
CPU30-core (VM host)
RAM172 GB

Does CPU/RAM matter? For these tests, no. All models loaded fully into VRAM (zero CPU offloading, ngl=99). CPU only handles tokenization and the initial file load, both negligible. RAM gets used while reading the GGUF from disk into VRAM; irrelevant once loaded. The bottleneck is GPU memory bandwidth.


Methodology

Tool:         llama-bench (llama.cpp d23355af / build b8352)
GPU layers:   -ngl 99       (all layers on GPU, no CPU offload)
Flash attn:   -fa 1         (enabled - improves pp speed 5–11%)
Repetitions:  -r 3          (mean ± stddev reported)
Models:       Ollama GGUF blobs, loaded directly
Batch size:   BS=1 (single user, not server throughput)

All models are Ollama-pulled GGUFs accessed directly via blob path see Where Ollama Stores Your Models for how that works. TTFT (time-to-first-token) is not measured here. llama-bench measures throughput, not wall-clock latency. For TTFT you'd use the llama.cpp server + benchmark_serving.py or a tool like llama-benchy, That's a separate post


Reading the Labels

The notation used throughout this post:

LabelMeaning
pp128Prompt processing at 128 input tokens - short prompt, e.g. a quick question
pp512Prompt processing at 512 input tokens - a paragraph or short document chunk
pp2048Prompt processing at 2,048 input tokens - a few pages of context
tg128Token generation of 128 output tokens - a brief response
tg512Token generation of 512 output tokens - a longer reply or code block
@ d8192Measured with 8,192 tokens already in context (KV cache pre-filled)
@ d32768Measured with 32K tokens in context
@ d65536Measured with 65K tokens in context

Low standard deviation (e.g. ± 4) means a stable result, a wide standard deviation (e.g., ± 616), means the GPU is starting to heat up or the system is starting to carry heavy loads, so more care should be taken Numbers are reported as the mean plus or minus the standard deviation across three runs. All speeds are measured in tokens per second (t/s), where higher values are better. pp (prefill) is compute-bound: the GPU is performing dense matrix multiplications in parallel, while tg is memory-bandwidth-bound, as weights are loaded from VRAM for every token.


Prompt Processing (t/s - higher is better)

How fast the model reads your input. Relevant for RAG, document analysis, long-context work.

ModelTypeParamsQuantSizepp128pp512pp2048
mistral-nemo:12bDense12BQ4_06.6 GiB5,467 ± 6168,325 ± 887,903 ± 4
qwen2.5-coder:14bDense15BQ4_K_M8.4 GiB4,059 ± 4995,918 ± 935,612 ± 2
qwen3-next:80bMoE (3B active)80BQ4_K_M46.9 GiB1,578 ± 373,274 ± 213,337 ± 11
qwen2.5:32bDense33BQ4_K_M18.5 GiB2,074 ± 1692,601 ± 362,530 ± 4
mixtral:8x22bMoE141B totalQ4_074.1 GiB658 ± 101,448 ± 21,430 ± 2
llama2-uncensored:70bDense69BQ4_036.2 GiB1,129 ± 121,327 ± 91,295 ± 3
llama3.1:70bDense71BQ4_K_M39.6 GiB1,013 ± 31,173 ± 71,147 ± 3
qwen2.5:72bDense73BQ4_K_M44.2 GiB1,020 ± 61,176 ± 71,147 ± 3
Winnermistral-nemo:12b at 8,325 t/s pp512, by a wide margin. It reads a 512-token prompt in under 62ms. For RAG or long-context work, this is the one if quality holds up. Among 70B+ dense models, they're essentially tied: qwen2.5:72b and llama3.1:70b land within 3 t/s of each other. The surprise: qwen3-next:80b (an 80B MoE) outpaces qwen2.5:32b at pp2048. The MoE routing gets more efficient as context grows.

Token Generation (t/s - higher is better)

How fast the model writes output. This is the number you feel during chat.

ModelTypeParamsQuantSizetg128tg512
mistral-nemo:12bDense12BQ4_06.6 GiB158 ± 0.1156 ± 0.4
qwen2.5-coder:14bDense15BQ4_K_M8.4 GiB117 ± 0.2116 ± 0.6
qwen3-next:80bMoE (3B active)80BQ4_K_M46.9 GiB124 ± 0.7125 ± 0.1
qwen2.5:32bDense33BQ4_K_M18.5 GiB56 ± 0.255 ± 0.5
mixtral:8x22bMoE141B totalQ4_074.1 GiB54 ± 0.252 ± 0.5
llama2-uncensored:70bDense69BQ4_036.2 GiB31 ± 0.131 ± 0.2
llama3.1:70bDense71BQ4_K_M39.6 GiB27 ± 0.126 ± 0.2
qwen2.5:72bDense73BQ4_K_M44.2 GiB25 ± 0.124 ± 0.2
Winnermistral-nemo:12b at 158 t/s, comfortably the fastest for output. But the real story is row 3: qwen3-next:80b generates at 124 t/s despite being an 80B model. That beats qwen2.5-coder:14b (117 t/s). If you need a large capable model that still feels fast in chat, qwen3-next is the pick. The 70B+ dense models (llama3.1, llama2, qwen2.5:72b) all land between 24-31 t/s, noticeably slower in conversation. Also note the near-zero stddev on token generation (+-0.1-0.7). Once loaded, decode speed is rock solid.

Context Depth Scaling

This is where 96 GB VRAM actually matters. Most consumer cards (24 GB) OOM before 32K on a 70B model. Here's how speed degrades as the KV cache fills up.

ModelContext Depthpp512 (t/s)tg128 (t/s)
mistral-nemo:12b07,898 ± 208150 ± 0.2
mistral-nemo:12b8K6,490 ± 218128 ± 0.2
mistral-nemo:12b32K3,603 ± 6687 ± 0.1
mistral-nemo:12b65K1,600 ± 1461 ± 0.1
qwen2.5:32b02,548 ± 1954 ± 0.1
qwen2.5:32b8K2,137 ± 3749 ± 0.1
qwen2.5:32b32K1,199 ± 1240 ± 0.1
qwen2.5:32b65K551 ± 232 ± 0.1
llama3.1:70b01,166 ± 327 ± 0.1
llama3.1:70b8K1,040 ± 925 ± 0.1
llama3.1:70b32K691 ± 522 ± 0.1
llama3.1:70b65K377 ± 119 ± 0.1
Winnerllama3.1:70b for long-context work. It degrades the least, only 30% slower at 65K vs baseline. mistral-nemo:12b loses 59% of generation speed by 65K, dropping from 150 to 61 t/s. qwen2.5:32b loses 41%. The 70B model's weights dominate the VRAM bandwidth budget, so the growing KV cache hurts it less proportionally. For large document summarization or long multi-turn agents, bigger holds up better than you'd expect.

What the Numbers Show

Flash attention matters, especially for prompt processing.

Enabling -fa 1 lifted pp512 by 5-11% across all dense models. mistral-nemo gained 9%, qwen2.5-coder gained 10.8%. If you're benchmarking without it, your numbers are understated. MoE models (qwen3-next, mixtral) see smaller gains since they're already structured differently at the kernel level.

MoE breaks the size-speed relationship.

qwen3-next:80b has 80B total parameters, activates ~3B per token, and generates at 124 t/s. That's faster than qwen2.5-coder:14b (117 t/s) and nearly as fast as mistral-nemo:12b (158 t/s). You're getting an 80B-quality model at 14B-class throughput.

Two things make this weirder. First, its token generation is flat: tg128 and tg512 are virtually identical (124.33 vs 124.89). Dense models always degrade slightly; MoE doesn't, because active parameter count stays constant regardless of sequence distance. Second, its prompt processing actually increases at longer contexts: pp128 to pp512 to pp2048 goes 1,578 to 3,274 to 3,337. The MoE routing gets more efficient with larger batches.

Small models degrade harder at long context.

At 65K context depth, mistral-nemo:12b loses 59% of its token generation speed (150 to 61 t/s). llama3.1:70b only loses 30% (27 to 19 t/s).

The reason: KV cache size scales with context length and attention head count, not model size. For smaller models, the KV cache becomes a larger fraction of total memory bandwidth. A 70B model's weights dominate the bandwidth budget regardless of context, so the KV cache overhead is proportionally smaller.

For long-context work (RAG over large documents, extended reasoning chains), a 32B+ model may actually be faster than a 12B at high context depths, and it'll definitely be more consistent.


What Didn't Load

These models are installed in Ollama but failed to load in llama-bench (llama.cpp d23355af):

ModelSizeError
qwen3.5:35b23 GBkey not found: qwen35moe.rope.dimension_sections
qwen3.5:122b81 GBSame - unsupported MoE format
qwen3-coder-next:q4_K_M51 GBFailed to load
qwen3-coder-next:q8_084 GBFailed to load
nemotron-3-super:120b86 GBWrong tensor shape
llama4:16x17b63 GBFailed to load
gpt-oss:20b13 GBFailed to load
gpt-oss:120b61 GBFailed to load
glm-4.7-flash18 GBFailed to load
qwen3-vl:32b20 GBFailed to load

The pattern: every model released in the last 2-3 months. Ollama (v0.17.1) bundles GGUFs ahead of llama.cpp's upstream support. As llama.cpp adds support for these architectures I'll re-run and update.


How to Reproduce

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp

CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
  -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)

# Baseline throughput
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 128,512,2048 -n 128,512 \
  -r 3

# Context depth scaling
./build/bin/llama-bench \
  -m /path/to/model.gguf \
  -ngl 99 -fa 1 \
  -p 512 -n 128 -r 3 \
  -d 0,8192,32768,65536

For loading Ollama's cached GGUFs directly, see Where Ollama Stores Your Models.


The result that stuck with me: at 65K context, the 70B model degrades less than the 12B. More VRAM isn't always slower; sometimes it's more stable. The 10 models that didn't load are the ones I'm most curious about. I'll re-run this post when llama.cpp catches up to the newer architectures.