Auto-Triaging PMO Tickets With n8n and OpenAI, Lessons From Three Months In

Openai article cover illustration on a gradient background

May 16, 2023 · 8 min read · by Muhammad Amal programming

TL;DR — LLM triage works well for label suggestion and severity estimation, badly for assignee routing without explicit team-context priming. / Use function calling to get structured outputs; never parse free-form text in production. / Always keep a human gate on anything that closes, reassigns, or de-prioritizes a ticket — at least until you’ve watched it run for a quarter.

In late February I rigged up a quick n8n workflow that piped every new Jira ticket into GPT-3.5 with a prompt asking it to suggest labels and a priority. It was a Friday-afternoon experiment. Three months later, a refined version of that workflow is in production at a thirty-person engineering org, handling roughly 200 incoming tickets a week.

This post is the honest writeup. What worked, what didn’t, what almost caused a real outage, and where I’d recommend you start if you want to try the same thing on your own backlog. It builds on patterns from the Jira REST API v3 post and assumes you’ve got a self-hosted n8n with webhook ingestion humming.

A few notes before I get into it. First, “auto-triage” is doing a lot of work in the title — what I’m actually describing is “AI-assisted triage with explicit human gates.” Pure auto-triage with no human review is irresponsible at the current state of the technology, and I’ll explain why. Second, this is all GPT-4 (gpt-4-0314) as of May 2023. The Code Interpreter beta and ChatGPT plugins from March don’t factor in here — this is plain API. Third, the cost math has changed twice in the last year. Run your own numbers.

The shape of the workflow

The high-level flow is simple:

Jira webhook fires on jira:issue_created.
n8n verifies the signature and pulls the issue.
Strip ADF to plain text. Trim attachments and noise.
Send a prompt to GPT-4 with the issue summary, description, and a list of valid labels for the project.
Receive a structured response: suggested labels, severity, suggested team, confidence score.
If confidence is high, apply suggested labels via Jira API. Always set the severity field.
If confidence is low or the suggestion involves reassignment, post a comment with the suggestion and tag the on-call triage human.

The trick is steps 4 through 7. Everything before that is plumbing covered in earlier posts.

Function calling, not prompt-then-parse

OpenAI shipped function calling on gpt-4-0613 later in June, but as of May 2023 you’re working with gpt-4-0314 and a structured prompt. The pattern that’s worked for me is to ask for JSON explicitly and validate hard on receipt.

A prompt I’ve iterated to:

const systemPrompt = `You are a senior engineering manager triaging incoming tickets.
Respond ONLY with a JSON object matching this exact schema, no markdown, no prose:

{
  "labels": string[],            // subset of provided valid labels
  "severity": "p0" | "p1" | "p2" | "p3",
  "suggested_team": string,      // one of: backend, frontend, mobile, platform, data, security
  "confidence": number,          // 0.0 to 1.0
  "reasoning": string            // one sentence, max 200 chars
}

Valid labels: ${validLabels.join(', ')}
Severity definitions:
  p0 = production down or data loss
  p1 = significant degradation, paying customer impact
  p2 = bug or feature request, no immediate impact
  p3 = nice-to-have, minor polish

If you are uncertain, lower the confidence score. Do not invent labels.`;

const userPrompt = `Title: ${issue.summary}\n\nDescription: ${issue.description}\n\nReporter: ${issue.reporter.email}`;

const { data } = await openai.post('/chat/completions', {
  model: 'gpt-4-0314',
  temperature: 0.2,
  messages: [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: userPrompt },
  ],
});

const raw = data.choices[0].message.content;
const parsed = JSON.parse(raw);   // wrapped in try/catch

Two things to note. temperature: 0.2 is low enough to get consistent output but not so low that the model becomes stubborn about edge cases. And the JSON parse is wrapped in a try/catch that, on failure, retries once with an explicit error message in the conversation.

When OpenAI’s gpt-4-0613 lands later this year with function calling, this whole prompt simplifies dramatically. For now, the structured prompt plus retry-on-parse-error is the working pattern.

Severity calibration is the hard part

Suggesting labels is easy. The model is good at reading “the user can’t log in” and tagging auth, regression, mobile. What it’s bad at, out of the box, is calibrating severity.

The first version of this workflow rated almost every customer-reported issue as p1. After a week, the on-call triager was drowning. The fix was twofold:

First, anchor the severity definitions with real examples in the system prompt. Not abstract definitions, but actual past tickets:

Severity calibration examples from this project:
  p0: "Production API returning 500 on all checkout requests" (full outage)
  p1: "Mobile app crashes on startup for ~5% of users" (partial outage, paying customers)
  p2: "Export to CSV has wrong date format" (annoying bug, workaround exists)
  p3: "Add a tooltip to the settings icon" (polish)

Second, pipe the reporter’s role into the prompt. A bug reported by a customer-success rep on behalf of a paying customer is structurally different from a bug filed by an engineer during exploratory testing. The model can’t infer that without help.

After both changes, severity assignments matched human triage about 78% of the time on a 200-ticket validation set. Not great. Good enough to be a useful suggestion, not good enough to act on without review.

Cost containment

GPT-4 at May 2023 pricing is roughly $0.03 per 1K input tokens and $0.06 per 1K output tokens. A typical triage call is around 1,500 input tokens (system prompt + issue) and 200 output tokens, so call it $0.057 per ticket. At 200 tickets a week that’s $11.40, or about $50 a month. Cheap.

It gets expensive when you do two things I’d recommend against:

Calling GPT-4 on every comment. Tempting, especially when you want to update severity as the conversation evolves. The cost blows up and the value is marginal.
Using GPT-4 to summarize attachments. PDFs and logs balloon the input. Use embeddings or a cheap model (gpt-3.5-turbo) for any first-pass summarization and only escalate to GPT-4 for the final triage call.

I’ve also had good results with a two-tier routing pattern: every ticket goes to gpt-3.5-turbo first for a coarse label-and-severity pass. Only tickets flagged “complex” or “ambiguous” by the cheaper model get escalated to GPT-4. This cut total spend by about 60% with no measurable drop in triage quality.

Guardrails that actually matter

There are three rails I’d consider non-negotiable.

Rail 1: Never let the LLM close, reassign, or de-prioritize. It can suggest these things via a comment. A human approves. The asymmetry is that a wrong label is annoying and reversible; a closed-as-duplicate ticket that wasn’t a duplicate disappears into the void.

Rail 2: Always pass through known-keyword overrides. If the title contains “security” or “data breach” or “PII,” I bypass the LLM and route directly to the security team with severity p1. Same for known regression keywords like “regression,” “broke,” or “stopped working” combined with a recent release tag. The LLM is allowed to upgrade severity but not downgrade past these heuristics.

Rail 3: Sample and audit. I have a weekly cron that picks 10 random triaged tickets from the past 7 days and posts a Slack thread to the triage on-call with the LLM’s suggestion vs the eventual human classification. It takes 15 minutes a week to review and it’s the single most useful piece of feedback for prompt iteration.

The prompt I almost shipped

A near-miss worth sharing. An early version of the prompt included this instruction:

If the issue appears to be a duplicate of a recently closed ticket,
suggest action: "close-as-duplicate" with the ticket key.

In testing, the model would hallucinate ticket keys. It would confidently say “this is a duplicate of ENG-4831” and ENG-4831 would be a ticket about something completely unrelated, or wouldn’t exist at all. I caught it before it went near production, but only because the validation set happened to include a duplicate-of-something case.

Lesson: never let the LLM reference external IDs that you haven’t explicitly provided in context. If you want duplicate detection, build it with embeddings and a similarity search over real ticket data, then have the LLM only confirm or reject — not generate the candidate.

The OpenAI documentation on prompt engineering was being updated heavily through this period and it’s worth re-reading for current best practice.

Common Pitfalls

PII in prompts. Every ticket body goes to OpenAI. If your tickets contain customer emails, names, or payloads, you’re sending all of it. Redact before sending or get the legal review.
Rate limits. GPT-4 had aggressive rate limits in early 2023 (around 200 requests per minute for most accounts). A burst of new tickets after a deploy can saturate. Queue and backoff.
Non-determinism. Even with temperature: 0.2, the same ticket can get slightly different responses on different days. Don’t build assertions in tests that depend on exact LLM output.
JSON parse failures. GPT-4 occasionally wraps JSON in markdown fences (```json) despite explicit instructions. Strip them defensively before parsing.
Context length surprises. A ticket with a long log paste can blow past 8K tokens. Truncate description content with a clear [content truncated] marker so the model knows it’s working with partial information.

Wrapping Up

LLM-assisted triage is one of those use cases where the tech is genuinely useful today, but only if you treat it as a suggester rather than a decider. The economics work, the integration is straightforward in n8n, and the failure modes are mostly recoverable as long as you keep humans in the close/reassign loop.

The next post in this series moves to something less flashy but arguably more valuable: actual two-way sync between Jira and GitHub Issues, which is where every team I’ve worked with eventually wants to get to.

The shape of the workflow

Function calling, not prompt-then-parse

Severity calibration is the hard part

Cost containment

Guardrails that actually matter

The prompt I almost shipped

Common Pitfalls

Wrapping Up

Related posts

n8n vs Zapier vs Power Automate for Engineering Teams, An Honest Comparison

Slack-Driven Approval Flows for Dev Backlogs With n8n and Block Kit

Orchestrating GitHub Actions From n8n, Webhooks, Dispatch, and Sanity

Self-Hosting n8n for Engineering Teams, A Pragmatic Setup Guide

May Retro, A Month of Workflow Automation

Error Handling and Retries in n8n

Webhooks 101 for Engineering Workflows

Slack Slash Commands via n8n

Let’s Start a Project