Back to all posts
EngineeringMarch 16, 20263 min read

Where Ollama Stores Your Models (And How to Use Them Directly)

Ollama stores downloaded models as plain GGUF files. Here's how to find them, use them with llama.cpp tools, and what breaks when you try.

ollamallmhomelabggufllama.cpplocal-ai

Ollama stores every model you pull as a plain GGUF file. No proprietary format. No lock-in. Just a SHA256-named blob sitting in a directory.

Here's how to find it, what to do with it, and the one catch that'll bite you.


The Layout

Default location: ~/.ollama/models/. Override with OLLAMA_MODELS env var.

Two subdirectories:

models/
├── manifests/
│   └── registry.ollama.ai/library/qwen2.5/32b   ← tiny JSON, one per model:tag
└── blobs/
    └── sha256-eabc98a9...                         ← the actual GGUF (19GB)

Manifests are just metadata — they map a model:tag to its blob digests. The blob is the raw model file, named by its SHA256 hash.


Finding the GGUF

Read the manifest, grab the model layer digest, build the path:

MODEL="qwen2.5:32b"
MODELS_DIR="${OLLAMA_MODELS:-$HOME/.ollama/models}"
NAME="${MODEL%:*}"
TAG="${MODEL#*:}"

DIGEST=$(cat "$MODELS_DIR/manifests/registry.ollama.ai/library/$NAME/$TAG" \
  | jq -r '.layers[] | select(.mediaType == "application/vnd.ollama.image.model") | .digest')

BLOB="$MODELS_DIR/blobs/${DIGEST/:/\-}"
echo $BLOB

Confirm it's valid:

file $BLOB
# → data (GGUF format)

What You Can Do With It

Benchmark with llama-bench — Ollama's --verbose flag gives you a single-run number. llama-bench gives you proper pp/tg splits across context lengths:

llama-bench -m $BLOB -p 512 -n 128

I ran this across four models on my rig (RTX PRO 6000 Blackwell, 97GB VRAM). Full hardware breakdown is a separate post — but here's what the numbers look like:

ModelQuantSizepp512 (t/s)tg128 (t/s)
mistral-nemo:12bQ4_06.6 GiB7,702150
qwen2.5-coder:14bQ4_K_M8.4 GiB5,555111
qwen2.5:32bQ4_K_M18.5 GiB2,54755
llama3.1:70bQ4_K_M39.6 GiB1,15627

pp = prompt processing (how fast it reads your input). tg = token generation (how fast it writes output). Both matter — a slow tg is the one you actually feel.

Debug with llama-cli — run the model outside Ollama's layer:

llama-cli -m $BLOB -p "What is 2+2?" -n 64

Audit what's eating your disk — map blobs back to model names:

MODELS_DIR="${OLLAMA_MODELS:-$HOME/.ollama/models}"
for manifest in $(find "$MODELS_DIR/manifests" -type f); do
  name=$(echo $manifest | sed 's|.*/library/||' | tr '/' ':')
  digest=$(cat $manifest | jq -r '.layers[] | select(.mediaType == "application/vnd.ollama.image.model") | .digest')
  blob="$MODELS_DIR/blobs/${digest/:/\-}"
  size=$(du -sh "$blob" 2>/dev/null | cut -f1)
  echo "$size  $name"
done | sort -h

The Catch: Ollama Ships Ahead

This is the part that'll trip you up.

Ollama (v0.17.1) bundles GGUFs for new models faster than llama.cpp's master branch supports them. So when you point a llama.cpp tool at an Ollama blob, it might refuse to load.

Hit this with qwen3.5:35b:

key not found in model: qwen35moe.rope.dimension_sections

The file isn't corrupt. llama.cpp just hasn't implemented that model's new format yet.

Fix: download the GGUF directly from HuggingFace instead of using the Ollama blob. HuggingFace repos typically note which llama.cpp version they're compatible with.


The abstraction Ollama gives you is convenient. But the files are yours — plain, open, reusable. Good to know when you need to reach past the API.


Questions or want to share your setup? Find me on GitHub or X.