background-shape
Running SLMs Locally with Ollama, A Step by Step Tutorial
January 8, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — Ollama 0.5 is the fastest way to put a local SLM behind an HTTP endpoint. Pull a model, write a Modelfile to pin parameters, hit /api/chat from your app, and harden with concurrency limits and a keep-alive policy. Not magic, just llama.cpp with a nice wrapper.

Ollama isn’t doing anything you couldn’t do with llama.cpp directly. What it gives you is a package manager, a daemon, an HTTP API, and a model registry. Those four things together turn “I want to use a local SLM” from a weekend project into an afternoon. The previous post in this series, the SLM landscape survey, covers which model to pick. This one assumes you’ve picked and want it running.

I run Ollama in production on a single Linux box and on my MacBook for dev. Both setups behave nearly identically, which is a small miracle given how different the underlying inference paths are (Metal vs CUDA). The only real catch is that Ollama hides things by default that you sometimes need to control, so this tutorial spends a lot of time on the Modelfile.

Heads up: I’m pinning to Ollama 0.5.x as it ships in January 2025. The CLI has been stable for a year, but the structured outputs feature is new in 0.5 and worth knowing about.

Installation and First Run

Step 1, install Ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On macOS, download the .dmg from ollama.com/download and drag the app to /Applications. The daemon starts automatically on first launch.

Confirm the install:

ollama --version
# ollama version is 0.5.4

Step 2, pull your first model

ollama pull llama3.2:3b-instruct-q4_K_M

The q4_K_M suffix is the quantization. For a 3B model on consumer hardware, q4_K_M is the sweet spot — 4-bit quantization with mixed precision on certain layers, roughly 2.0 GB on disk. If you have more VRAM and want sharper outputs, use q8_0 (around 3.5 GB).

The catalogue is at ollama.com/library. My usual suspects:

ollama pull llama3.2:3b-instruct-q4_K_M
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
ollama pull qwen2.5:3b-instruct-q4_K_M
ollama pull gemma2:2b-instruct-q4_K_M

Step 3, smoke-test from the CLI

ollama run llama3.2:3b-instruct-q4_K_M "Reply with the single word: ready"

If you see ready, the daemon, model loader, and inference path are working.

The Modelfile Is the Real API

The biggest mistake I see people make with Ollama is calling models with default parameters in production. Defaults are tuned for chat demos, not for your task. The Modelfile is how you pin them.

Step 1, dump the existing Modelfile of a pulled model

ollama show --modelfile llama3.2:3b-instruct-q4_K_M > base.Modelfile

You’ll see something like:

FROM /Users/you/.ollama/models/blobs/sha256-...
TEMPLATE "{{ if .System }}<|start_header_id|>system<|end_header_id|>..."
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Step 2, write your own Modelfile

Create a file called extractor.Modelfile:

FROM llama3.2:3b-instruct-q4_K_M

# System prompt baked in
SYSTEM """You are a strict information extractor. You always reply with valid JSON, no commentary."""

# Inference parameters
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.05
PARAMETER num_ctx 8192
PARAMETER num_predict 512

# Stop sequences — keep the ones from base
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Step 3, create the model

ollama create resume-extractor -f extractor.Modelfile

You now have a model called resume-extractor that you can call like any other. The system prompt and parameters travel with it. This is the unit of deployment.

Step 4, version it

I keep Modelfiles in git, next to the application code that uses them. When the prompt changes, the model name gets bumped: resume-extractor:v2. Treat the Modelfile like a database migration.

Calling Ollama from Python

The CLI is for poking. Real work happens over HTTP.

Step 1, install the official client

pip install "ollama==0.4.4"

Step 2, call the chat endpoint

# extract.py
from ollama import Client

client = Client(host="http://localhost:11434")

resp = client.chat(
    model="resume-extractor",
    messages=[
        {"role": "user", "content": "Led the backend team at Acme Corp from 2019 to 2023."}
    ],
    options={"temperature": 0.1},
)
print(resp["message"]["content"])

The options dict overrides Modelfile parameters per request. Use it sparingly — if you’re overriding every time, your Modelfile is wrong.

Step 3, use structured outputs (Ollama 0.5+)

This is the killer feature added in 0.5. Pass a JSON schema and Ollama will constrain decoding to produce valid output.

from ollama import Client
import json

client = Client()

schema = {
    "type": "object",
    "properties": {
        "company": {"type": "string"},
        "role":    {"type": "string"},
        "start":   {"type": "integer"},
        "end":     {"type": "integer"},
    },
    "required": ["company", "role", "start", "end"],
}

resp = client.chat(
    model="resume-extractor",
    messages=[{"role": "user",
               "content": "Led the backend team at Acme Corp from 2019 to 2023."}],
    format=schema,
)

data = json.loads(resp["message"]["content"])
assert isinstance(data["start"], int)
print(data)

The output is guaranteed to be valid JSON matching the schema. This used to require ugly retry loops or a sidecar like Outlines. Now it’s two lines.

Architecture

Here’s how Ollama actually sits in my stack:

+----------------+        HTTP         +----------------+
|  Your service  | ------------------> |    Ollama      |
|  (FastAPI etc) |  POST /api/chat     |    daemon      |
+----------------+                     +----------------+
        |                                      |
        |                                      | loads on demand
        |                                      v
        |                              +----------------+
        |                              |  llama.cpp     |
        |                              |  + GGUF model  |
        |                              +----------------+
        |
        | persists prompts + responses
        v
+----------------+
|  Postgres /    |
|  S3 / Sentry   |
+----------------+

Ollama is just an HTTP server in front of llama.cpp. Your service does all the prompt construction, retries, logging, and observability. Don’t try to make Ollama do those.

Production Hardening

Default Ollama is fine for dev. For production you need to think about three things: concurrency, model residency, and observability.

Step 1, configure environment

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_FLASH_ATTENTION=1"

After editing:

sudo systemctl daemon-reload
sudo systemctl restart ollama
  • OLLAMA_KEEP_ALIVE=24h keeps the model in VRAM. Default is 5m, which kills your latency on the first request after idle.
  • OLLAMA_NUM_PARALLEL=4 lets 4 requests share a single loaded model concurrently. This shares the KV cache space across them, so reduce num_ctx accordingly.
  • OLLAMA_MAX_LOADED_MODELS=2 caps total VRAM. If you have multiple Modelfiles in rotation, this stops them all trying to live in memory at once.
  • OLLAMA_FLASH_ATTENTION=1 enables flash attention if your hardware supports it.

Step 2, put a reverse proxy in front

Never expose :11434 directly. Nginx config:

server {
    listen 443 ssl http2;
    server_name slm.internal.example.com;

    ssl_certificate     /etc/ssl/certs/internal.crt;
    ssl_certificate_key /etc/ssl/private/internal.key;

    location / {
        proxy_pass         http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
        proxy_read_timeout 300s;
        proxy_buffering    off;  # critical for streaming
    }
}

The proxy_buffering off is non-negotiable if you want token streaming to work. I have lost an evening to that.

Step 3, add logging

Wrap every call. Cheapest possible version:

import time, uuid, logging, json
from ollama import Client

log = logging.getLogger("slm")
client = Client()

def chat(model: str, messages: list[dict], **opts) -> dict:
    rid = uuid.uuid4().hex[:8]
    t0 = time.monotonic()
    try:
        resp = client.chat(model=model, messages=messages, options=opts)
        dt = time.monotonic() - t0
        log.info(json.dumps({
            "rid": rid, "model": model, "dt_s": round(dt, 3),
            "prompt_tokens": resp.get("prompt_eval_count"),
            "completion_tokens": resp.get("eval_count"),
        }))
        return resp
    except Exception as e:
        log.exception(f"slm_fail rid={rid} model={model} err={e}")
        raise

The prompt_eval_count and eval_count fields are the only way to track token usage. Ollama doesn’t expose dollar costs because there are none, but you still want to watch them.

Common Pitfalls

  1. Defaulting num_ctx to a small value. The default in Ollama is 2048. If you send a 4000-token prompt, the model silently truncates from the front. Your system prompt vanishes. Fix: always set num_ctx in your Modelfile or per call, sized to your actual maximum prompt.

  2. Re-pulling on every CI run. ollama pull re-downloads layers if your cache is fresh, but in CI you usually have no cache. Models are gigabytes. Fix: mount ~/.ollama/models as a CI cache volume keyed on the model digest.

  3. Confusing temperature 0 with deterministic output. Ollama (via llama.cpp) is not bit-exact deterministic even at temperature 0, because of nondeterministic CUDA kernels. Fix: set a seed parameter as well, and pin OLLAMA_DEBUG=1 while you investigate any flake.

  4. Loading too many models. Each loaded model occupies its full VRAM footprint. If you have three Modelfiles all based on llama3.2:3b, Ollama treats them as the same underlying model but applies the layered config — so they share VRAM. If they’re based on different base models, they don’t. Fix: minimize base model count.

Troubleshooting

Symptom: First request after a quiet hour takes 30+ seconds. Diagnose: Model evicted from VRAM. Set OLLAMA_KEEP_ALIVE=24h or send a periodic warm-up ping.

Symptom: Error: model 'foo' not found, try pulling it first even though you pulled it. Diagnose: You pulled under a different user. Ollama models live in ~/.ollama for the user that ran ollama pull. The systemd daemon runs as ollama user. Run sudo -u ollama ollama pull <name>, or OLLAMA_MODELS=/var/lib/ollama/models consistently.

Symptom: Output is garbage tokens. Diagnose: Wrong chat template. This happens if you create a Modelfile FROM a base GGUF without re-specifying the TEMPLATE. Fix: always inherit from a properly-templated registry model (e.g. FROM llama3.2:3b-instruct-q4_K_M), not from a raw .gguf file.

Streaming Responses

Tokens-as-they-arrive is non-negotiable for chat UIs. The Python client supports it directly.

from ollama import Client

client = Client()
for chunk in client.chat(
    model="resume-extractor",
    messages=[{"role": "user", "content": "Tell me a 3-line story."}],
    stream=True,
):
    print(chunk["message"]["content"], end="", flush=True)

If you’re proxying through nginx, the proxy_buffering off directive I mentioned earlier is what makes this work. Without it, nginx batches the response and the user sees nothing for several seconds.

Pulling and Running Your Own GGUFs

You’re not limited to Ollama’s registry. Any GGUF will load.

Step 1, put the GGUF somewhere readable

mkdir -p ~/local-models
cp /path/to/my-finetune-Q4_K_M.gguf ~/local-models/

Step 2, write a Modelfile that points to it

FROM /Users/me/local-models/my-finetune-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.1
PARAMETER num_ctx 8192

The TEMPLATE block is non-optional when you start from a raw GGUF. Copy it from a known-working Modelfile for the same base model family. If you skip this, you get garbage output and a head-scratching debug session.

Step 3, create and run

ollama create my-finetune -f Modelfile
ollama run my-finetune "test prompt"

This is the path I use to ship LoRA-fine-tuned variants — convert to GGUF, write a Modelfile, deploy as a new Ollama model name. Versioned in git like any other code.

Running Behind a Reverse Proxy with Auth

For anything that leaves localhost, add authentication. Ollama itself has none. Put it behind a proxy that does.

server {
    listen 443 ssl http2;
    server_name slm.example.com;

    ssl_certificate     /etc/letsencrypt/live/slm.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/slm.example.com/privkey.pem;

    location /api/ {
        auth_basic           "SLM API";
        auth_basic_user_file /etc/nginx/.htpasswd;

        proxy_pass         http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
        proxy_buffering    off;
        proxy_read_timeout 300s;

        client_max_body_size 10m;
    }
}

For machine-to-machine auth I’d use API keys via a header check instead of basic auth, but the structural point holds: never expose :11434 raw. A common mistake is binding Ollama to 0.0.0.0 “temporarily for testing” and forgetting to switch back. Keep OLLAMA_HOST=127.0.0.1:11434 in the systemd unit so the daemon physically refuses external connections at the socket layer, regardless of firewall rules.

What’s Next

Ollama covers about 80% of the local SLM use cases I’ve shipped. For the other 20% — production-grade serving with batching, custom quantization, deep performance work — you want llama.cpp directly. That’s the next post. If you only ever use Ollama, you’ll be fine for most things; just write your Modelfiles like they’re code, because they are.