Structured Output and Function Calling on Local SLMs
TL;DR — Use constrained decoding for structured output on local SLMs. Ollama 0.5’s
formatparameter, vLLM’s guided decoding, or Outlines all work. Function calling is structured output with a calling convention layered on top. Don’t pretend it’s magic — validate everything.
I’ve fine-tuned an extractor with LoRA and I’ve watched a beautifully fine-tuned 3B model hallucinate a trailing comma in production. Fine-tuning improves structural compliance, but it doesn’t guarantee it. If your downstream parser breaks on malformed JSON, you need a guarantee. That’s what constrained decoding gives you.
Constrained decoding rewrites the decoder’s logits at each step so that only tokens consistent with your schema are sampled. If the schema says the next character must be {, only tokens starting with { get probability mass. The model can’t produce invalid output because the decoder won’t let it.
This post covers the three main ways to do constrained decoding on local SLMs in January 2025, plus how to layer function calling on top. Each has tradeoffs.
The Three Tools
The space is consolidating but not consolidated. Here’s what works:
Tool Where it lives Speed cost Schema language
----------- -------------------- ------------ ----------------
Ollama Built into 0.5+ ~5% JSON Schema
vLLM Built into 0.6+ ~3% JSON Schema, regex
Outlines Standalone library ~10% JSON Schema, CFG, regex
llama.cpp Built-in (grammar) ~5% GBNF grammar
The right pick depends on what you’re already running. If Ollama is your serving layer, use its format. If vLLM, use guided_json. If you have unusual schema needs (recursive structures, regex constraints), Outlines is the most expressive.
Constrained Output with Ollama 0.5
This is the easiest. Define a JSON schema, pass it as format, get guaranteed output.
from ollama import Client
import json
client = Client(host="http://localhost:11434")
invoice_schema = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"amount_usd": {"type": "number"},
"due_date": {"type": "string", "format": "date"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"qty": {"type": "integer"},
"unit_price": {"type": "number"},
},
"required": ["description", "qty", "unit_price"],
},
},
},
"required": ["invoice_number", "amount_usd", "due_date", "line_items"],
}
resp = client.chat(
model="llama3.2:3b-instruct-q4_K_M",
messages=[
{"role": "system", "content": "Extract invoice details as strict JSON."},
{"role": "user", "content": open("invoice_42.txt").read()},
],
format=invoice_schema,
)
data = json.loads(resp["message"]["content"])
The output is guaranteed valid against the schema. Note: “valid” is not “correct.” If the model hallucinates the wrong amount, the schema doesn’t save you. It saves you from parser errors only.
Step 1, decide what you actually want enforced
Don’t enforce field semantics that the schema can’t express. “Due date must be in the future” is application logic, not a schema constraint. Keep the schema describing structure; do business validation after.
Step 2, set descriptive field names
Schema field names are part of the prompt. The model uses them to figure out what goes in each slot. due_date is much better than dt2.
Step 3, use enums for closed sets
For categorical fields, use "enum": [...]. The decoder will refuse any other value.
"status": {"type": "string", "enum": ["paid", "pending", "overdue"]}
Constrained Output with vLLM
vLLM’s guided decoding lives in extra_body of the OpenAI-style request.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-empty")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "system", "content": "Extract invoice details as strict JSON."},
{"role": "user", "content": invoice_text},
],
extra_body={"guided_json": invoice_schema},
)
vLLM supports three guidance modes:
guided_json: a JSON schemaguided_regex: a regexguided_choice: a list of allowed strings (one-of)
Start the server with --guided-decoding-backend xgrammar for the fastest backend in 0.6.4.
Constrained Output with Outlines
Outlines is a Python library that wraps any HF or local model with a constrained-decoding engine. Use it when you need expressive constraints that neither Ollama nor vLLM support.
Step 1, install
pip install "outlines==0.1.7" "transformers==4.47.1" "torch==2.5.1"
Step 2, define a Pydantic model
from pydantic import BaseModel, conint, confloat
from typing import Literal
class LineItem(BaseModel):
description: str
qty: conint(ge=1, le=10000)
unit_price: confloat(ge=0)
class Invoice(BaseModel):
invoice_number: str
amount_usd: confloat(ge=0)
status: Literal["paid", "pending", "overdue"]
line_items: list[LineItem]
Step 3, generate
import outlines
model = outlines.models.transformers(
"meta-llama/Llama-3.2-3B-Instruct",
device="cuda",
)
generator = outlines.generate.json(model, Invoice)
result = generator(
"Extract: Invoice #A-42 for $1450, pending. 2x Widget @ $700, 1x Sticker @ $50."
)
print(result) # Invoice instance, fully typed
The result is a pydantic.BaseModel instance with all the validation Pydantic gives you. This is the friendliest interface of the three.
Function Calling, Explained Honestly
Function calling on SLMs is “structured output with a calling convention.” There is no magic dispatch. The model produces JSON that names a function and its arguments, and your code looks up that function and runs it. The OpenAI API hides the convention behind an SDK, but locally you see it directly.
Approach A, use the model’s native tool template
Llama 3.2, Qwen2.5, and Phi-3.5 each ship a tool-calling chat template. Ollama and vLLM apply it automatically when you pass tools. This is the most natural path.
tools = [{
"type": "function",
"function": {
"name": "get_invoice",
"description": "Look up an invoice by number.",
"parameters": {
"type": "object",
"properties": {"invoice_number": {"type": "string"}},
"required": ["invoice_number"],
},
},
}]
resp = client.chat(
model="llama3.2:3b-instruct-q4_K_M",
messages=[{"role": "user", "content": "Look up invoice A-42 for me."}],
tools=tools,
)
# The model returns a tool_calls array
for call in resp["message"].get("tool_calls", []):
name = call["function"]["name"]
args = call["function"]["arguments"]
print(f"call {name} with {args}")
The catch: 3B-class models are inconsistent at picking the right tool. They hallucinate functions that don’t exist. They format arguments wrong. Always validate.
Approach B, force a single tool with guided JSON
When you know which function to call, skip the dispatch layer. Just constrain the output to the function’s argument schema.
resp = client.chat(
model="llama3.2:3b-instruct-q4_K_M",
messages=[
{"role": "system", "content": "Extract the invoice number to look up."},
{"role": "user", "content": "Look up invoice A-42 for me."},
],
format={
"type": "object",
"properties": {"invoice_number": {"type": "string"}},
"required": ["invoice_number"],
},
)
args = json.loads(resp["message"]["content"])
result = get_invoice(**args)
This is what I use in production. Pick the tool at the application layer based on intent classification or routing. Then constrain the output to that tool’s args. Much more reliable than letting the model pick.
Architecture
How this stitches together in a real app:
+--------------+ +-----------------+ +-----------------+
| User input | ---> | Router (SLM) | ---> | Selected tool |
| | | guided_choice | | (Python fn) |
+--------------+ +--------+--------+ +--------+--------+
| |
| tool name | result
v v
+-----------------+ +-----------------+
| Extractor | ---> | Response |
| (SLM + schema | | formatter |
| for that tool)| | (template) |
+-----------------+ +-----------------+
Two SLM calls, both constrained. The router picks one of a known set; the extractor fills in the arguments. This pattern is boring and it works. Trying to do everything in one call with tool_choice="auto" is where small models fall over.
Common Pitfalls
-
Confusing schema-valid with semantically correct. Constrained decoding guarantees JSON parses. It does not guarantee the values are right. Always do post-extraction validation: dates in range, IDs that exist, amounts non-negative. Fix: keep two validation layers — structural (schema) and semantic (application).
-
Schemas that the tokenizer can’t tokenize cleanly. Some schema patterns (very deep nesting, very long enum lists) produce token sequences the model has never seen. The constraint engine still works but the model’s output quality collapses. Fix: keep schemas flat, prefer few-fields-many-rows over one-row-many-fields.
-
Forgetting to lower temperature. Constrained decoding still samples. With
temperature=1.0and a constrained vocabulary, you get the right structure with random-ish content. Fix: temperature 0.0-0.3 for extraction; higher only for generation tasks. -
Using
additionalProperties: trueaccidentally. JSON Schema defaults to allowing extra fields. With constrained decoding, this lets the model invent fields you didn’t ask for. Fix: setadditionalProperties: falseeverywhere.
Troubleshooting
Symptom: Constrained output is structurally valid but the field values are nonsense (all empty strings, repeated “null”).
Diagnose: The model genuinely doesn’t know the answer and the schema is forcing it to produce something. Either the prompt is missing context, or the constraint is over-eager. Add a nullable variant: "amount_usd": {"type": ["number", "null"]}.
Symptom: Generation is dramatically slower with constrained decoding.
Diagnose: Schema compilation overhead. The first call compiles the schema to an automaton (Ollama and vLLM use xgrammar or similar). Cache the schema object across calls; don’t rebuild per request. If you’re using Outlines, hold onto the generator object.
Symptom: Tool calling works in tool_choice="auto" but picks the wrong tool.
Diagnose: Either the tool descriptions are ambiguous or you have too many tools. SLMs perform best with ≤5 tools and crisp one-line descriptions. Split the problem into routing + extraction (Approach B above).
Streaming Constrained Output
A subtle gotcha: with constrained decoding, streaming still works but partial outputs are not valid JSON until generation completes. Don’t try to incrementally parse the stream — buffer it and parse at the end.
buffer = []
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": invoice_text}],
extra_body={"guided_json": invoice_schema},
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
buffer.append(delta)
# render progress in UI but DO NOT parse yet
result = json.loads("".join(buffer))
If you really need progressive parsing, look at JSON Lines (NDJSON) — emit one object per line and parse line-by-line. The schema constrains each line independently. This pattern is great for batch extraction over a stream of inputs.
Testing Schemas Without a Live Model
When iterating on schemas, you don’t want to wait 30 seconds per attempt. Most constraint engines can validate schemas offline.
import jsonschema
# Validate the schema itself is well-formed JSON Schema
jsonschema.Draft202012Validator.check_schema(invoice_schema)
# Validate a sample object
sample = {"invoice_number": "A-42", "amount_usd": 1450.0,
"due_date": "2025-02-01", "line_items": []}
jsonschema.validate(sample, invoice_schema)
I keep schema tests in my pytest suite. They fail fast when a schema is malformed, before the model ever runs. This catches more bugs than any model-side eval.
Picking a Backend for the Three Big Engines
A short decision tree based on what I’ve shipped:
- You’re already on Ollama and your schema is simple: use the
formatparameter. Done. Five percent overhead, no extra deps. - You’re on vLLM and need throughput:
guided_jsonwith thexgrammarbackend. Lowest overhead, highest QPS. - You need regex, recursive schemas, or context-free grammars: Outlines. The expressive power is unmatched, the speed cost is real but acceptable for batch workloads.
- You’re on llama.cpp directly: use GBNF grammars via
--grammar-file. Powerful but the grammar language has a learning curve.
There is no universally best choice. Match the tool to the runtime.
When Constrained Decoding Hurts
Constrained decoding is not free quality. There are tasks where it makes things worse, and you should know what they are.
Tasks where the model needs to “think.” Chain-of-thought reasoning produces freeform text before the final answer. Constrain only the final output, not the reasoning. A two-stage prompt — let the model think freely, then constrain a follow-up extraction — works much better than constraining the full response.
Tasks where the schema is too rigid for the data. If you constrain phone_number to a strict regex and the input has an international format you didn’t anticipate, the model will produce nonsense to satisfy the regex. Fix: make the constraint permissive at the model layer and validate strictly downstream.
Tasks where the model would rather refuse. Some prompts trigger safety refusals. Constrained decoding forces the model to produce schema-valid output anyway, which sometimes means hallucinated nonsense framed as an answer. Fix: allow a {"refused": true, "reason": "..."} shape in the schema.
The general principle: constrain what your downstream code consumes, not what the model thinks.
What’s Next
Structured output is the single most important capability for shipping local SLMs. It turns a probabilistic language model into a deterministic component your downstream code can rely on. Combined with the fine-tuned models from the previous post, you get extraction pipelines that hit 95%+ accuracy on tasks you’d previously have paid GPT-4 for. The next post stitches everything together into a local RAG system, where structured output and retrieval multiply each other’s value. See the Outlines documentation for the full library reference.