background-shape
Triage Automation with LLMs and Zendesk, A Hands On Tutorial
November 14, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Don’t have an LLM auto-respond to customers. Have it classify, suggest, and pre-fill, and let agents one-click apply. The accuracy bar for full automation is unreachable; the bar for agent assist is easy and the ROI is huge.

Triage is where every support org loses time. A senior engineer reads the ticket, figures out it’s a billing question, reroutes it, and that’s three minutes gone. Multiply by every misrouted ticket times the daily volume and you’ve got a meaningful slice of your team’s capacity going to a task an LLM can do in 200ms for a fraction of a cent.

This tutorial builds a working triage automation pipeline for Zendesk. It’s not the “AI agent that resolves tickets” vendor demo; that’s a fantasy at any volume above ten tickets a day. It’s a system that classifies, suggests routing, drafts a first response, and surfaces relevant knowledge base articles, all in a Zendesk sidebar that the agent reviews before anything goes to the customer. The accuracy bar for that is achievable. The risk profile is bounded.

I’ll target Python 3.12, OpenAI’s gpt-4o-mini for the bulk of classification work and gpt-4o for the harder synthesis, and Zendesk’s Apps Framework for the agent-facing UI. If your org standardized on Claude, swap gpt-4o for claude-3.7-sonnet with the Anthropic SDK 0.42; the structure doesn’t change. You’ll want to have read the earlier piece on RAG systems for technical support teams because the retrieval layer here builds on that.

What “triage” actually means

When a ticket lands, five things have to happen before an agent can do real work:

  1. Classify the issue (billing, technical, account, feature request).
  2. Identify the product area (auth, ingest, dashboard, billing).
  3. Assess urgency (does the stated impact warrant P1, P2, P3).
  4. Route to the right queue.
  5. Retrieve relevant context (similar past tickets, KB articles, customer history).

A human takes two to five minutes to do that. An LLM does it in under a second. The catch is that step 3, urgency, is the one humans are still better at because it requires reading subtle social signals. We’ll automate steps 1, 2, 4, and 5, and present urgency as a suggestion that the human ratifies.

   New ticket
        |
        v
   +----------+      +-------------+
   | classify | ---> | route       |
   | (1, 2, 3)|      | (4)         |
   +----+-----+      +------+------+
        |                   |
        v                   v
   +----------+      +-------------+
   | retrieve | ---> | draft reply |
   | (5)      |      | + KB links  |
   +----------+      +------+------+
                            |
                            v
                     +-------------+
                     | agent sees  |
                     | sidebar w/  |
                     | one click   |
                     +-------------+

That topology keeps the LLM as a recommender. The agent is always in the loop. We’ll build it.

Step 1, the classifier

Start with classification. Structured outputs (OpenAI’s response_format with JSON schema, or Anthropic’s tool use with strict schemas) make this nearly trivial. The prompt is the entire system.

import os
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

oai = OpenAI()

class Triage(BaseModel):
    category: Literal["billing", "technical", "account", "feature_request", "other"]
    product_area: Literal["auth", "ingest", "dashboard", "billing", "api", "mobile", "unknown"]
    urgency_signal: Literal["low", "medium", "high"]
    customer_sentiment: Literal["calm", "frustrated", "angry"]
    suggested_priority: Literal["p1", "p2", "p3", "p4"]
    reasoning: str = Field(description="One sentence, why these labels")

SYSTEM = """You are a triage classifier for an enterprise SaaS support team.
Classify the ticket using only the provided fields. Use 'unknown' or 'other'
when the signal is genuinely ambiguous; do not guess to fill in a confident
answer. Urgency_signal is derived from explicit statements of impact, not
from tone. Customer_sentiment is derived from tone, not from priority."""

def triage_ticket(subject: str, body: str, account_tier: str) -> Triage:
    user = f"Account tier: {account_tier}\nSubject: {subject}\n\nBody:\n{body}"
    resp = oai.beta.chat.completions.parse(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": user},
        ],
        response_format=Triage,
    )
    return resp.choices[0].message.parsed

Two specific design choices. First, urgency and sentiment are separated. Customers can be angry about a P3 issue, and they can be calm about a P1 outage; conflating the two leads to noisy priority assignment. Second, the model has an explicit “unknown” escape hatch. Forcing classification on ambiguous tickets produces overconfident wrong answers; letting the model say “I don’t know” lets the routing layer fall back to a human.

Step 2, evaluating the classifier before you ship

Classification accuracy you can’t measure is classification accuracy you can’t trust. Build a labeled set of 500 to 1000 real tickets with hand-graded labels. Run accuracy by class, not aggregate.

from collections import defaultdict

def evaluate(eval_set: list[dict]) -> dict:
    confusion = defaultdict(lambda: defaultdict(int))
    correct_by_class = defaultdict(int)
    total_by_class = defaultdict(int)
    for ex in eval_set:
        pred = triage_ticket(ex["subject"], ex["body"], ex["account_tier"])
        total_by_class[ex["category"]] += 1
        confusion[ex["category"]][pred.category] += 1
        if pred.category == ex["category"]:
            correct_by_class[ex["category"]] += 1
    per_class = {
        c: correct_by_class[c] / total_by_class[c]
        for c in total_by_class
    }
    overall = sum(correct_by_class.values()) / sum(total_by_class.values())
    return {"overall": overall, "per_class": per_class, "confusion": dict(confusion)}

The number to publish is per-class accuracy, not the aggregate. Aggregate hides the fact that the model nails “billing” at 97% and chokes on “feature_request” at 65%. If “feature_request” is 5% of your volume, the aggregate looks great while half your feature requests get misrouted.

Set a deployment gate. Each category needs at least 90% accuracy before you ship. If “other” is over 15% of the eval set, you need more granular categories. If “unknown” product area is over 20%, your prompt isn’t giving the model enough product context.

Step 3, retrieval for context

The classifier knows what kind of ticket it is. The retriever knows what similar tickets and KB articles look like. Wire them together.

from qdrant_client import QdrantClient, models
qdrant = QdrantClient(url=os.environ["QDRANT_URL"], api_key=os.environ["QDRANT_KEY"])

def retrieve_context(triage: Triage, subject: str, body: str, k: int = 5):
    query = f"{subject}\n\n{body}"
    q_vec = oai.embeddings.create(
        model="text-embedding-3-large",
        input=[query],
    ).data[0].embedding
    filters = []
    if triage.product_area != "unknown":
        filters.append(models.FieldCondition(
            key="product_area",
            match=models.MatchAny(any=[triage.product_area]),
        ))
    if triage.category == "billing":
        filters.append(models.FieldCondition(
            key="source",
            match=models.MatchAny(any=["zendesk_kb", "billing_runbook"]),
        ))
    flt = models.Filter(must=filters) if filters else None
    hits = qdrant.query_points(
        collection_name="support_kb_v3",
        query=q_vec,
        limit=k,
        filter=flt,
        with_payload=True,
    )
    return hits.points

Two filter rules. If the classifier identified a product area, restrict retrieval to that area. If the category is “billing”, restrict to KB and billing-specific runbooks (not ticket archive; billing tickets contain too much account-specific data to be safely retrieved across customers).

Step 4, drafting the agent-visible reply

The draft is a suggestion, never an auto-send. The prompt makes that explicit and includes constraints on what the model is allowed to commit to.

DRAFT_SYSTEM = """You are drafting a suggested first response for a human
support agent to review. The agent will edit before sending. Constraints:
- Never quote prices, timelines, or feature commitments.
- Cite KB articles by URL when you reference them.
- If the retrieved context doesn't answer the question, draft a clarifying
  question instead of guessing.
- Tone: professional, concise, no apologies for things that aren't our fault.
- Maximum 200 words."""

def draft_reply(triage: Triage, subject: str, body: str, context_hits) -> str:
    ctx_block = "\n\n".join(
        f"[KB {i}] {h.payload.get('url', '')}\n{h.payload['text'][:600]}"
        for i, h in enumerate(context_hits)
    )
    user = (
        f"Category: {triage.category}\n"
        f"Product area: {triage.product_area}\n"
        f"Suggested priority: {triage.suggested_priority}\n\n"
        f"Customer subject: {subject}\n"
        f"Customer body:\n{body}\n\n"
        f"Retrieved context:\n{ctx_block}"
    )
    resp = oai.chat.completions.create(
        model="gpt-4o",
        temperature=0.2,
        messages=[
            {"role": "system", "content": DRAFT_SYSTEM},
            {"role": "user", "content": user},
        ],
    )
    return resp.choices[0].message.content

The “never quote prices, timelines, or feature commitments” constraint matters. Without it, the model will helpfully tell the customer “this should be resolved within 24 hours” and now you have a contractual problem. Constraint-based prompts are how you keep the LLM useful and safe at the same time.

Step 5, the Zendesk integration

The Zendesk Apps Framework lets you ship a sidebar app that appears in the agent’s ticket view. The app calls your triage service, gets the structured output, and renders it.

// app.js, runs in the Zendesk sidebar
const client = ZAFClient.init();

client.on('app.registered', async () => {
  const ticketData = await client.get(['ticket.id', 'ticket.subject',
                                        'ticket.description', 'ticket.requester.organization']);
  const tier = ticketData['ticket.requester.organization']?.customField?.tier || 'business';

  const response = await fetch(`${BACKEND_URL}/triage`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json', 'X-Service-Token': SERVICE_TOKEN },
    body: JSON.stringify({
      ticket_id: ticketData['ticket.id'],
      subject: ticketData['ticket.subject'],
      body: ticketData['ticket.description'],
      account_tier: tier,
    }),
  });

  const triage = await response.json();
  renderSidebar(triage);
});

function renderSidebar(triage) {
  document.getElementById('category').textContent = triage.category;
  document.getElementById('priority').textContent = triage.suggested_priority;
  document.getElementById('reasoning').textContent = triage.reasoning;
  document.getElementById('draft').value = triage.draft_reply;
  document.getElementById('apply-routing').onclick = () => applyRouting(triage);
  document.getElementById('insert-draft').onclick = () => insertDraft(triage.draft_reply);
}

async function applyRouting(triage) {
  await client.invoke('ticket.tags.add', `area_${triage.product_area}`);
  await client.invoke('ticket.tags.add', `cat_${triage.category}`);
  await client.set('ticket.priority', triage.suggested_priority);
}

async function insertDraft(text) {
  await client.invoke('ticket.editor.insert', text);
}

Two one-click actions: apply the routing tags and priority, or insert the draft into the reply editor. The agent reviews everything before the customer sees it. The latency from ticket open to sidebar populated should be under two seconds; if it’s longer, agents will ignore it.

The Zendesk Apps Framework reference at developer.zendesk.com covers the available client APIs and the build pipeline.

Step 6, rollout pattern

Don’t enable this for the whole team on day one. The rollout that has worked for me:

Week 1, shadow mode. Backend runs on every new ticket, results are logged to a database, agents see nothing. You compare the LLM’s classification to what the agents actually did, by hand, on a sample of 200 tickets a day.

Week 2, opt-in beta. Five agents see the sidebar. They give feedback via a simple form (“was the category right? was the priority right? was the draft useful?”). You iterate on prompts based on their feedback.

Week 3-4, expanded beta. Half the team. Track classification accuracy in production, agent satisfaction via a weekly survey, time-to-first-response as the proxy ROI metric.

Week 5, default on. Everyone gets the sidebar by default. Add an “off” button per agent so anyone who doesn’t trust it can disable. Track who disables and why.

Ongoing. Weekly review of misclassified tickets. Monthly prompt update. Quarterly eval refresh with new labels.

This pattern works because it surfaces the failure modes in low-stakes settings before exposing them to production. The “shadow mode” week alone has caught more issues than any pre-launch eval set I’ve ever built.

Common Pitfalls

Auto-applying routing without agent review. The temptation is huge. The risk is also huge. One misclassification at scale means hundreds of mis-routed tickets in a day. Always require a human click for the first six months, then revisit with real production data.

Using a single model for everything. Classification is a cheap task; use gpt-4o-mini. Synthesis is harder; use gpt-4o or claude-3.7-sonnet. Mixing them right cuts your bill by 5x with no quality loss.

Skipping the customer sentiment signal. A “calm” P3 and an “angry” P3 deserve different agent attention. Surface sentiment as a tag in the sidebar but never expose it to the customer. Agents will appreciate the heads-up.

Letting the prompt drift without versioning. Every prompt change is a behavioral change. Check prompts into git, tag them with versions, log which version was used on every triage. When something breaks, you can find the diff.

Treating eval accuracy as static. Customer language evolves, product features change, error codes get renamed. Your eval set from January is stale by June. Refresh it quarterly with fresh production samples.

Troubleshooting

Symptom, classification accuracy drops suddenly without a prompt change. Almost always a corpus shift. A new product launched, a marketing campaign brought in different customer segments, or the support form added a field that changed how questions are phrased. Pull the last week of misclassified tickets, find the common pattern, and update the prompt with examples.

Symptom, agents ignore the sidebar suggestions. Either the suggestions are wrong too often (fix accuracy) or the latency is too high (fix backend) or the suggestions are too generic (improve retrieval context). Sample five agents, watch them work for an hour each, and ask them when they used the sidebar and when they didn’t. The answer will be specific.

Symptom, the draft reply mentions features that don’t exist. Hallucination, almost always because the retrieved context didn’t contain a relevant document and the model made something up. Tighten the prompt to require explicit grounding (“if no retrieved document supports the answer, draft a clarifying question instead”), and check that your retrieval filter isn’t too restrictive (which leads to empty context and freewheeling generation).

Wrapping Up

LLM triage isn’t a magic wand and it’s not a research project. It’s plumbing. Classify, retrieve, draft, present, let the human apply. The ROI shows up as a fifteen to forty percent reduction in time-to-first-response and a meaningful drop in misrouted tickets. The accuracy bar is reachable, the rollout pattern is well understood, and the risk profile is bounded when you keep humans in the loop.

Next in this series we widen the lens to measurement. How do you know your support engineering function is actually getting better, and what metrics reliably tell that story without driving the team to game them.