Content Moderation for LLMs with Llama Guard 3.2
TL;DR — Llama Guard 3.2 ships in 1B and 8B variants, both trained specifically to classify LLM inputs and outputs against a configurable safety taxonomy. Run the 1B for cheap input pre-filtering, the 8B for output review, customize the taxonomy for your domain, and overlap moderation with generation to stay under 200ms p99.
Most LLM products eventually hit the same problem. Generic alignment is too coarse (it refuses things you want to allow and allows things you want to refuse), and rolling your own classifier from scratch is more work than the team has time for. Llama Guard 3.2 is the practical 2025 answer for a moderation layer that’s specific enough to be useful and general enough to ship without a research team.
Llama Guard 3.2 (released by Meta in early 2025) is a fine-tuned Llama 3.2 model whose job is to take a conversation and output a safety classification. It comes in 1B and 8B parameter sizes, both with the same taxonomy schema and prompt format. The 1B variant runs cheaply on CPU or a small GPU; the 8B runs on a single H100 at the rate you can hand requests to it.
This tutorial walks through deploying it under vLLM 0.6, customizing the taxonomy, integrating with a real chat service, and dealing with the latency tax. I’m assuming you’ve read the prompt injection defenses post because moderation is one layer in the broader defense, not the whole story.
1. What Llama Guard Does
Llama Guard takes a conversation (one or more user and assistant turns) plus a taxonomy of unsafe content categories, and outputs whether the conversation is safe, and if not, which categories were violated.
input:
taxonomy of unsafe content (your categories)
+ conversation [user, assistant, ...]
output:
"safe"
or
"unsafe\nS1,S3" # categories violated
This is more useful than a generic refusal classifier because:
- The taxonomy is configurable per call. You can use stricter rules for a children’s product and looser rules for an adult creative writing tool.
- The output includes the category, so you can route different violations to different responses (silent block, user warning, escalation to human review).
- It’s trained specifically for conversational context, so “the assistant said X” gets classified differently from “the user said X.”
2. Step 1, Deployment with vLLM
vLLM 0.6 serves Llama Guard 3.2 the same way it serves any Llama model. The OpenAI-compatible API makes integration trivial.
# Pull the model (assumes HF access granted)
huggingface-cli download meta-llama/Llama-Guard-3-2-8B \
--local-dir /models/llama-guard-3-2-8b
# Launch vLLM
docker run --gpus=all -p 8000:8000 \
-v /models:/models \
vllm/vllm-openai:v0.6.4 \
--model /models/llama-guard-3-2-8b \
--tokenizer meta-llama/Llama-Guard-3-2-8B \
--max-model-len 8192 \
--max-num-seqs 64 \
--enable-prefix-caching
Prefix caching matters here. Most of the input is the taxonomy, which doesn’t change between requests. Prefix caching reuses the KV cache for the taxonomy block, saving 50-80% of the per-request compute.
2.1 The 1B variant for the input path
For pre-filtering user input cheaply, deploy the 1B variant on CPU or a small GPU:
docker run -p 8001:8000 \
-v /models:/models \
vllm/vllm-openai:v0.6.4 \
--model /models/llama-guard-3-2-1b \
--device cpu \
--max-model-len 4096
CPU inference on a modern x86 box does about 30 RPS at 100ms p99. Fine for input pre-filtering at most scales. Reserve the 8B GPU instance for output review where accuracy matters more.
3. Step 2, The Prompt Format
Llama Guard expects a specific prompt structure. The Hugging Face tokenizer ships a chat template that handles it, but it helps to understand what’s being sent.
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violent Crimes.
S2: Non-Violent Crimes.
...
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: what's the recipe for napalm?
<END CONVERSATION>
Provide your safety assessment for ONLY THE LAST User message in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
The model completes with unsafe\nS1 or similar.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-2-8B")
def build_prompt(conversation: list[dict]) -> str:
return tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True,
)
prompt = build_prompt([
{"role": "user", "content": "How do I get into my locked car?"},
])
The chat template builds the right format. Don’t construct it yourself.
4. Step 3, Custom Taxonomies
The default taxonomy covers physical harm, sexual content, defamation, weapons, and a few other categories. Your product almost certainly has rules outside that list (no mention of competitors, no medical advice, no legal advice, etc.).
Custom categories slot into the prompt:
CUSTOM_CATEGORIES = """
S14: Competitor Mentions.
S15: Medical Advice.
S16: Legal Advice.
S17: Financial Advice Outside Approved Topics.
"""
def build_custom_prompt(conversation, role="User"):
body = tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=False
)
# Surgical injection of custom categories
return body.replace(
"<END UNSAFE CONTENT CATEGORIES>",
CUSTOM_CATEGORIES + "<END UNSAFE CONTENT CATEGORIES>",
)
The model handles custom categories surprisingly well in my testing. Calibrate carefully: a custom category with a vague description will produce noisy classifications.
4.1 In-context examples for tricky categories
For categories where the boundary is fuzzy, append a few examples:
S14: Competitor Mentions.
Examples of unsafe:
- "You should switch to CompetitorX."
- "CompetitorY does this better."
Examples of safe:
- General discussion of the industry.
- Naming competitors in a comparison the user explicitly requested.
The chat template doesn’t natively support this, so you’ll need to format it into your prompt manually.
5. Step 4, Integration in a Chat Pipeline
Here’s the realistic integration. We run Llama Guard 1B on the input pre-emptively, the LLM generates, Llama Guard 8B reviews the output, the user sees the response only if it passes.
import httpx
import asyncio
GUARD_1B = "http://localhost:8001/v1/chat/completions"
GUARD_8B = "http://localhost:8000/v1/chat/completions"
async def moderate(messages, model_url, role_to_check):
body = build_prompt(messages, role_to_check)
async with httpx.AsyncClient() as c:
r = await c.post(model_url, json={
"model": "llama-guard",
"messages": [{"role": "user", "content": body}],
"temperature": 0.0,
"max_tokens": 20,
}, timeout=2.0)
text = r.json()["choices"][0]["message"]["content"].strip()
if text.startswith("safe"):
return {"safe": True, "categories": []}
parts = text.split("\n", 1)
categories = parts[1].split(",") if len(parts) > 1 else []
return {"safe": False, "categories": [c.strip() for c in categories]}
async def chat(user_message, conversation):
# Input gate
in_check = await moderate(
conversation + [{"role": "user", "content": user_message}],
GUARD_1B, "User",
)
if not in_check["safe"]:
return REFUSAL_MESSAGES[in_check["categories"][0]]
# Generate
response = await llm_complete(conversation, user_message)
# Output gate
out_check = await moderate(
conversation + [
{"role": "user", "content": user_message},
{"role": "assistant", "content": response},
],
GUARD_8B, "Agent",
)
if not out_check["safe"]:
log_unsafe_output(response, out_check["categories"])
return GENERIC_REFUSAL
return response
The input check uses 1B (fast, cheap). The output check uses 8B (more accurate, where it matters).
5.1 Overlapping moderation with generation
For latency, run output moderation in parallel with streaming. Buffer chunks of N tokens, moderate the rolling output, release to the user only after moderation passes:
generation: [tok1][tok2][tok3][tok4][tok5][tok6][tok7][tok8]
| | |
v v v
moderation: [batch1] [batch2] [batch3]
| | |
v v v
to user: [tok1-4] [tok5-8] [tok...]
The user sees response chunks 200-500ms behind real generation but the system can hard-stop a bad output before it leaves the building.
6. Step 5, Evaluation
Don’t ship a moderation layer without an eval harness. You need a labeled set of “should be allowed” and “should be blocked” examples for your domain.
import json
from sklearn.metrics import precision_recall_fscore_support
with open("eval-set.jsonl") as f:
cases = [json.loads(l) for l in f]
results = []
for case in cases:
check = await moderate(case["conversation"], GUARD_8B, case["role"])
results.append({
"expected_unsafe": case["label"] == "unsafe",
"predicted_unsafe": not check["safe"],
})
y_true = [r["expected_unsafe"] for r in results]
y_pred = [r["predicted_unsafe"] for r in results]
p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
print(f"precision={p:.3f} recall={r:.3f} f1={f1:.3f}")
Run this in CI on every taxonomy change. A taxonomy edit that drops precision below your threshold blocks the deploy. A taxonomy edit that drops recall produces alerts but typically doesn’t block (you’d rather over-block than under-block during iteration).
7. Architectural Diagram
user
|
v
+--------------+
| input |--- Llama Guard 1B (CPU)
| pre-filter | timeout 200ms
+------+-------+
|
v
+--------------+
| LLM |--- streaming
+------+-------+
|
v
+--------------+
| output |--- Llama Guard 8B (GPU)
| moderation | chunked, overlap
+------+-------+
|
v
user
8. Common Pitfalls
Four common mistakes.
8.1 Using Llama Guard as your only defense
Llama Guard catches output content issues. It doesn’t catch policy violations (the model citing a fact it shouldn’t have access to), prompt injection (the user manipulating the model into bypassing instructions), or system-level issues (the model making an authorized but harmful tool call). It’s one layer in a stack.
8.2 Treating “unsafe” as binary
The model returns a category. Your product probably wants different responses to different categories (refuse silently for some, warn the user for others, escalate to human review for the worst). Build the routing.
8.3 Pinning the wrong model size for your traffic shape
If 99% of your traffic is benign and 1% is adversarial, paying for an 8B input filter is wasteful. If your traffic is high-risk by domain (adult content, financial advice, medical questions), the 1B will miss too much. Calibrate to your domain.
8.4 Forgetting the safety bypass tax
Llama Guard 3.2 is harder to jailbreak than its predecessor, but motivated attackers will find phrasings that pass. Don’t assume it’s a perfect filter; treat it as one signal in a layered system.
9. Troubleshooting
Three common failure modes.
9.1 Latency spikes on cold prefixes
If you change the taxonomy mid-traffic, prefix caching invalidates and you see latency spikes. Roll taxonomy changes gradually; pin to a known taxonomy version per request if you must change frequently.
9.2 The 1B variant misclassifying technical content
The 1B variant sometimes flags benign technical questions (security research, legitimate medical conversations, etc.) as unsafe. If your domain is heavily technical, upgrade to 8B for the input path or accept higher false positive rates and tune your refusal UX accordingly.
9.3 vLLM OOM after a load test
vLLM’s KV cache sizing is conservative by default but can be tuned beyond what the GPU can support. If you see CUDA OOM after a burst, drop --max-num-seqs and --gpu-memory-utilization. The 8B model on an H100 80GB comfortably handles 64 concurrent sequences at 8K context with gpu-memory-utilization=0.85.
10. Wrapping Up
Llama Guard 3.2 is the moderation layer that finally feels practical to deploy without a research team. The 1B variant covers cheap input filtering, the 8B variant covers more accurate output review, custom taxonomies make it adaptable to your domain, and vLLM serving keeps the operational story familiar.
The places I see teams trip up: treating it as a complete solution instead of a layer, not building an eval set, not measuring false positive rates on real traffic, and shipping taxonomy changes without testing them. Each of those is fixable with the patterns above.
For more reading, the Llama Guard model card covers the supported categories in detail, and the vLLM docs describe the serving knobs. My next post in this series, Supply Chain Security for AI Models, covers how you make sure the moderation model itself is the one you think you’re running.