Structured Output and Function Calling on Local SLMs

Slm article cover illustration on a gradient background

January 22, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Use constrained decoding for structured output on local SLMs. Ollama 0.5’s format parameter, vLLM’s guided decoding, or Outlines all work. Function calling is structured output with a calling convention layered on top. Don’t pretend it’s magic — validate everything.

I’ve fine-tuned an extractor with LoRA and I’ve watched a beautifully fine-tuned 3B model hallucinate a trailing comma in production. Fine-tuning improves structural compliance, but it doesn’t guarantee it. If your downstream parser breaks on malformed JSON, you need a guarantee. That’s what constrained decoding gives you.

Constrained decoding rewrites the decoder’s logits at each step so that only tokens consistent with your schema are sampled. If the schema says the next character must be {, only tokens starting with { get probability mass. The model can’t produce invalid output because the decoder won’t let it.

This post covers the three main ways to do constrained decoding on local SLMs in January 2025, plus how to layer function calling on top. Each has tradeoffs.

The Three Tools

The space is consolidating but not consolidated. Here’s what works:

Tool         Where it lives        Speed cost    Schema language
-----------  --------------------  ------------  ----------------
Ollama       Built into 0.5+       ~5%           JSON Schema
vLLM         Built into 0.6+       ~3%           JSON Schema, regex
Outlines     Standalone library    ~10%          JSON Schema, CFG, regex
llama.cpp    Built-in (grammar)    ~5%           GBNF grammar

The right pick depends on what you’re already running. If Ollama is your serving layer, use its format. If vLLM, use guided_json. If you have unusual schema needs (recursive structures, regex constraints), Outlines is the most expressive.

Constrained Output with Ollama 0.5

This is the easiest. Define a JSON schema, pass it as format, get guaranteed output.

from ollama import Client
import json

client = Client(host="http://localhost:11434")

invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "amount_usd":     {"type": "number"},
        "due_date":       {"type": "string", "format": "date"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "qty":         {"type": "integer"},
                    "unit_price":  {"type": "number"},
                },
                "required": ["description", "qty", "unit_price"],
            },
        },
    },
    "required": ["invoice_number", "amount_usd", "due_date", "line_items"],
}

resp = client.chat(
    model="llama3.2:3b-instruct-q4_K_M",
    messages=[
        {"role": "system", "content": "Extract invoice details as strict JSON."},
        {"role": "user",   "content": open("invoice_42.txt").read()},
    ],
    format=invoice_schema,
)

data = json.loads(resp["message"]["content"])

The output is guaranteed valid against the schema. Note: “valid” is not “correct.” If the model hallucinates the wrong amount, the schema doesn’t save you. It saves you from parser errors only.

Step 1, decide what you actually want enforced

Don’t enforce field semantics that the schema can’t express. “Due date must be in the future” is application logic, not a schema constraint. Keep the schema describing structure; do business validation after.

Step 2, set descriptive field names

Schema field names are part of the prompt. The model uses them to figure out what goes in each slot. due_date is much better than dt2.

Step 3, use enums for closed sets

For categorical fields, use "enum": [...]. The decoder will refuse any other value.

"status": {"type": "string", "enum": ["paid", "pending", "overdue"]}

Constrained Output with vLLM

vLLM’s guided decoding lives in extra_body of the OpenAI-style request.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-empty")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "Extract invoice details as strict JSON."},
        {"role": "user",   "content": invoice_text},
    ],
    extra_body={"guided_json": invoice_schema},
)

vLLM supports three guidance modes:

guided_json: a JSON schema
guided_regex: a regex
guided_choice: a list of allowed strings (one-of)

Start the server with --guided-decoding-backend xgrammar for the fastest backend in 0.6.4.

Constrained Output with Outlines

Outlines is a Python library that wraps any HF or local model with a constrained-decoding engine. Use it when you need expressive constraints that neither Ollama nor vLLM support.

Step 1, install

pip install "outlines==0.1.7" "transformers==4.47.1" "torch==2.5.1"

Step 2, define a Pydantic model

from pydantic import BaseModel, conint, confloat
from typing import Literal

class LineItem(BaseModel):
    description: str
    qty: conint(ge=1, le=10000)
    unit_price: confloat(ge=0)

class Invoice(BaseModel):
    invoice_number: str
    amount_usd: confloat(ge=0)
    status: Literal["paid", "pending", "overdue"]
    line_items: list[LineItem]

Step 3, generate

import outlines

model = outlines.models.transformers(
    "meta-llama/Llama-3.2-3B-Instruct",
    device="cuda",
)
generator = outlines.generate.json(model, Invoice)

result = generator(
    "Extract: Invoice #A-42 for $1450, pending. 2x Widget @ $700, 1x Sticker @ $50."
)
print(result)  # Invoice instance, fully typed

The result is a pydantic.BaseModel instance with all the validation Pydantic gives you. This is the friendliest interface of the three.

Function Calling, Explained Honestly

Function calling on SLMs is “structured output with a calling convention.” There is no magic dispatch. The model produces JSON that names a function and its arguments, and your code looks up that function and runs it. The OpenAI API hides the convention behind an SDK, but locally you see it directly.

Approach A, use the model’s native tool template

Llama 3.2, Qwen2.5, and Phi-3.5 each ship a tool-calling chat template. Ollama and vLLM apply it automatically when you pass tools. This is the most natural path.

tools = [{
    "type": "function",
    "function": {
        "name": "get_invoice",
        "description": "Look up an invoice by number.",
        "parameters": {
            "type": "object",
            "properties": {"invoice_number": {"type": "string"}},
            "required": ["invoice_number"],
        },
    },
}]

resp = client.chat(
    model="llama3.2:3b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Look up invoice A-42 for me."}],
    tools=tools,
)

# The model returns a tool_calls array
for call in resp["message"].get("tool_calls", []):
    name = call["function"]["name"]
    args = call["function"]["arguments"]
    print(f"call {name} with {args}")

The catch: 3B-class models are inconsistent at picking the right tool. They hallucinate functions that don’t exist. They format arguments wrong. Always validate.

Approach B, force a single tool with guided JSON

When you know which function to call, skip the dispatch layer. Just constrain the output to the function’s argument schema.

resp = client.chat(
    model="llama3.2:3b-instruct-q4_K_M",
    messages=[
        {"role": "system", "content": "Extract the invoice number to look up."},
        {"role": "user",   "content": "Look up invoice A-42 for me."},
    ],
    format={
        "type": "object",
        "properties": {"invoice_number": {"type": "string"}},
        "required": ["invoice_number"],
    },
)
args = json.loads(resp["message"]["content"])
result = get_invoice(**args)

This is what I use in production. Pick the tool at the application layer based on intent classification or routing. Then constrain the output to that tool’s args. Much more reliable than letting the model pick.

Architecture

How this stitches together in a real app:

+--------------+      +-----------------+      +-----------------+
|  User input  | ---> |  Router (SLM)   | ---> |  Selected tool  |
|              |      |  guided_choice  |      |  (Python fn)    |
+--------------+      +--------+--------+      +--------+--------+
                               |                        |
                               | tool name              | result
                               v                        v
                      +-----------------+      +-----------------+
                      |  Extractor      | ---> |  Response       |
                      |  (SLM + schema  |      |  formatter      |
                      |   for that tool)|      |  (template)     |
                      +-----------------+      +-----------------+

Two SLM calls, both constrained. The router picks one of a known set; the extractor fills in the arguments. This pattern is boring and it works. Trying to do everything in one call with tool_choice="auto" is where small models fall over.

Common Pitfalls

Confusing schema-valid with semantically correct. Constrained decoding guarantees JSON parses. It does not guarantee the values are right. Always do post-extraction validation: dates in range, IDs that exist, amounts non-negative. Fix: keep two validation layers — structural (schema) and semantic (application).
Schemas that the tokenizer can’t tokenize cleanly. Some schema patterns (very deep nesting, very long enum lists) produce token sequences the model has never seen. The constraint engine still works but the model’s output quality collapses. Fix: keep schemas flat, prefer few-fields-many-rows over one-row-many-fields.
Forgetting to lower temperature. Constrained decoding still samples. With temperature=1.0 and a constrained vocabulary, you get the right structure with random-ish content. Fix: temperature 0.0-0.3 for extraction; higher only for generation tasks.
Using additionalProperties: true accidentally. JSON Schema defaults to allowing extra fields. With constrained decoding, this lets the model invent fields you didn’t ask for. Fix: set additionalProperties: false everywhere.

Troubleshooting

Symptom: Constrained output is structurally valid but the field values are nonsense (all empty strings, repeated “null”). Diagnose: The model genuinely doesn’t know the answer and the schema is forcing it to produce something. Either the prompt is missing context, or the constraint is over-eager. Add a nullable variant: "amount_usd": {"type": ["number", "null"]}.

Symptom: Generation is dramatically slower with constrained decoding. Diagnose: Schema compilation overhead. The first call compiles the schema to an automaton (Ollama and vLLM use xgrammar or similar). Cache the schema object across calls; don’t rebuild per request. If you’re using Outlines, hold onto the generator object.

Symptom: Tool calling works in tool_choice="auto" but picks the wrong tool. Diagnose: Either the tool descriptions are ambiguous or you have too many tools. SLMs perform best with ≤5 tools and crisp one-line descriptions. Split the problem into routing + extraction (Approach B above).

Streaming Constrained Output

A subtle gotcha: with constrained decoding, streaming still works but partial outputs are not valid JSON until generation completes. Don’t try to incrementally parse the stream — buffer it and parse at the end.

buffer = []
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": invoice_text}],
    extra_body={"guided_json": invoice_schema},
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    buffer.append(delta)
    # render progress in UI but DO NOT parse yet

result = json.loads("".join(buffer))

If you really need progressive parsing, look at JSON Lines (NDJSON) — emit one object per line and parse line-by-line. The schema constrains each line independently. This pattern is great for batch extraction over a stream of inputs.

Testing Schemas Without a Live Model

When iterating on schemas, you don’t want to wait 30 seconds per attempt. Most constraint engines can validate schemas offline.

import jsonschema

# Validate the schema itself is well-formed JSON Schema
jsonschema.Draft202012Validator.check_schema(invoice_schema)

# Validate a sample object
sample = {"invoice_number": "A-42", "amount_usd": 1450.0,
          "due_date": "2025-02-01", "line_items": []}
jsonschema.validate(sample, invoice_schema)

I keep schema tests in my pytest suite. They fail fast when a schema is malformed, before the model ever runs. This catches more bugs than any model-side eval.

Picking a Backend for the Three Big Engines

A short decision tree based on what I’ve shipped:

You’re already on Ollama and your schema is simple: use the format parameter. Done. Five percent overhead, no extra deps.
You’re on vLLM and need throughput: guided_json with the xgrammar backend. Lowest overhead, highest QPS.
You need regex, recursive schemas, or context-free grammars: Outlines. The expressive power is unmatched, the speed cost is real but acceptable for batch workloads.
You’re on llama.cpp directly: use GBNF grammars via --grammar-file. Powerful but the grammar language has a learning curve.

There is no universally best choice. Match the tool to the runtime.

When Constrained Decoding Hurts

Constrained decoding is not free quality. There are tasks where it makes things worse, and you should know what they are.

Tasks where the model needs to “think.” Chain-of-thought reasoning produces freeform text before the final answer. Constrain only the final output, not the reasoning. A two-stage prompt — let the model think freely, then constrain a follow-up extraction — works much better than constraining the full response.

Tasks where the schema is too rigid for the data. If you constrain phone_number to a strict regex and the input has an international format you didn’t anticipate, the model will produce nonsense to satisfy the regex. Fix: make the constraint permissive at the model layer and validate strictly downstream.

Tasks where the model would rather refuse. Some prompts trigger safety refusals. Constrained decoding forces the model to produce schema-valid output anyway, which sometimes means hallucinated nonsense framed as an answer. Fix: allow a {"refused": true, "reason": "..."} shape in the schema.

The general principle: constrain what your downstream code consumes, not what the model thinks.

What’s Next

Structured output is the single most important capability for shipping local SLMs. It turns a probabilistic language model into a deterministic component your downstream code can rely on. Combined with the fine-tuned models from the previous post, you get extraction pipelines that hit 95%+ accuracy on tasks you’d previously have paid GPT-4 for. The next post stitches everything together into a local RAG system, where structured output and retrieval multiply each other’s value. See the Outlines documentation for the full library reference.

The Three Tools

Constrained Output with Ollama 0.5

Step 1, decide what you actually want enforced

Step 2, set descriptive field names

Step 3, use enums for closed sets

Constrained Output with vLLM

Constrained Output with Outlines

Step 1, install

Step 2, define a Pydantic model

Step 3, generate

Function Calling, Explained Honestly

Approach A, use the model’s native tool template

Approach B, force a single tool with guided JSON

Architecture

Common Pitfalls

Troubleshooting

Streaming Constrained Output

Testing Schemas Without a Live Model

Picking a Backend for the Three Big Engines

When Constrained Decoding Hurts

What’s Next

Related posts

Benchmarking SLMs for Your Use Case, From Lmeval to Custom Suites

Local RAG with SLMs, Private Knowledge Without the Cloud

Fine Tuning SLMs with LoRA and QLoRA, A Hands On Tutorial

Serving SLMs at Scale with vLLM, A Production Guide

llama.cpp Deep Dive, Quantization, GGUF, and Inference Speed

Running SLMs Locally with Ollama, A Step by Step Tutorial

Small Language Models in January 2025, A Practical Survey

Why Small Language Models Belong at the Edge in 2026

Let’s Start a Project