background-shape
Building a Support Knowledge Base from Zendesk and Jira
November 5, 2025 · 11 min read · by Muhammad Amal programming

TL;DR — Tickets are messy, contaminated with PII, and full of obsolete workarounds. Building a knowledge base from them is mostly a deduping and filtering problem, not a retrieval problem. Get the ETL right and the rest is easy.

The instinct to “just dump every closed ticket into a vector store” is the single most expensive mistake I see support orgs make. The math looks great in a planning doc, ten years of tickets equals millions of resolved problems equals a goldmine. The reality is that ninety percent of those tickets are duplicates of duplicates, the workarounds in them are wrong because the product changed, and a meaningful fraction contain customer secrets that nobody got around to redacting.

This tutorial walks through the ETL pipeline I’ve used to turn raw ticket data from Zendesk and Jira Service Management into a knowledge base that’s actually fit to feed a retrieval system. We’ll cover extraction, deduplication, anonymization, scoring for canonical-ness, and the data model in Postgres 17 that holds it all together. The code targets Python 3.12 and assumes you’ve got API credentials for both systems.

I’ll assume you’re a senior support engineer or a manager scoping this work for an engineer. If you haven’t read the prior piece on RAG systems for technical support teams, the short version is that the ingestion layer is where most production RAG systems silently fail. This article is the long version of that warning.

What “knowledge base” actually means here

Two different things in the industry share the name. There’s the curated public-facing KB (Zendesk Help Center articles, Confluence pages, product docs), and there’s the implicit KB hidden in resolved tickets. The first is small, authoritative, and slow to update. The second is huge, noisy, and freshly relevant to whatever broke yesterday.

You need both. The curated KB is your “ground truth” layer; tickets are your “long tail and recent” layer. They get indexed into the same vector store but with different metadata flags so the retriever can weight them or filter on them.

            +--------------------+       +--------------------+
            |   Zendesk KB       |       |   Jira issues      |
            |   (Help Center)    |       |   (resolved)       |
            +----------+---------+       +----------+---------+
                       |                            |
            +----------v----------+      +----------v---------+
            |  fetch + filter     |      |  fetch + filter    |
            +----------+----------+      +----------+---------+
                       |                            |
                       +----------+    +------------+
                                  |    |
                       +----------v----v----------+
                       |  anonymize + dedupe      |
                       |  + canonical scoring     |
                       +----------+---------------+
                                  |
                       +----------v---------------+
                       |  Postgres 17 + pgvector  |
                       |  kb_documents table      |
                       +--------------------------+

That’s the pipeline. Everything below is filling in the boxes.

Step 1, the Postgres schema

Start with the data model. If you get the schema right, every subsequent step is straightforward. If you get it wrong, you’ll be running migrations at 2am six months from now.

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE kb_documents (
    id              BIGSERIAL PRIMARY KEY,
    source_system   TEXT NOT NULL CHECK (source_system IN ('zendesk_kb','zendesk_ticket','jira_issue','confluence')),
    source_id       TEXT NOT NULL,
    title           TEXT NOT NULL,
    body            TEXT NOT NULL,
    body_clean      TEXT NOT NULL,
    canonical_score REAL NOT NULL DEFAULT 0,
    visibility      TEXT NOT NULL CHECK (visibility IN ('public','internal','confidential')),
    product_area    TEXT[],
    labels          TEXT[],
    resolved_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL,
    updated_at      TIMESTAMPTZ NOT NULL,
    ingested_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    content_hash    TEXT NOT NULL,
    dup_cluster_id  BIGINT,
    embedding       VECTOR(3072),
    UNIQUE (source_system, source_id)
);

CREATE INDEX kb_docs_trgm ON kb_documents USING GIN (body_clean gin_trgm_ops);
CREATE INDEX kb_docs_labels ON kb_documents USING GIN (labels);
CREATE INDEX kb_docs_product ON kb_documents USING GIN (product_area);
CREATE INDEX kb_docs_vec ON kb_documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

The body field keeps the raw extracted text. body_clean is the anonymized, dedup-friendly version. Keep both. When the bot says something weird and you need to audit what was indexed, you’ll want the raw too. The content_hash field is for exact dedup, dup_cluster_id for near-dup grouping that we’ll compute later, and canonical_score is what tells the retriever which member of a dup cluster to prefer.

If you’re new to picking Postgres indexes for this kind of workload, my earlier writeup on choosing the right Postgres index covers the tradeoffs between HNSW, IVFFlat, GIN, and BRIN.

Step 2, Zendesk ticket extraction with safety rails

The Zendesk Tickets API gives you a firehose. The trick is filtering it down before it reaches your processing code.

import os
import httpx
from datetime import datetime, timedelta, timezone

ZD = f"https://{os.environ['ZD_SUBDOMAIN']}.zendesk.com"
AUTH = (f"{os.environ['ZD_EMAIL']}/token", os.environ['ZD_TOKEN'])

def iter_resolved_tickets(since: datetime):
    cursor = None
    with httpx.Client(timeout=60.0) as client:
        while True:
            params = {
                "filter[start_time]": int(since.timestamp()),
                "filter[type]": "ticket",
                "include": "comment_count,satisfaction_rating",
            }
            if cursor:
                params["cursor"] = cursor
            r = client.get(f"{ZD}/api/v2/incremental/tickets/cursor.json",
                           auth=AUTH, params=params)
            r.raise_for_status()
            data = r.json()
            for t in data["tickets"]:
                if t["status"] not in ("solved", "closed"):
                    continue
                if t["satisfaction_rating"] and t["satisfaction_rating"].get("score") == "bad":
                    continue
                if t["comment_count"] < 2 or t["comment_count"] > 25:
                    continue
                yield t
            if data["end_of_stream"]:
                break
            cursor = data["after_cursor"]

Three filters I want to call out. Tickets with a “bad” satisfaction rating are not your canonical answers, full stop. A ticket with one comment is incomplete (auto-closed or spam); a ticket with more than twenty-five is a death march that’s almost certainly customer-specific. The sweet spot for ingestion is two to twenty-five comments and a non-negative CSAT.

Pulling the conversation requires a second call per ticket.

def fetch_comments(client: httpx.Client, ticket_id: int):
    url = f"{ZD}/api/v2/tickets/{ticket_id}/comments.json"
    r = client.get(url, auth=AUTH, params={"include": "users"})
    r.raise_for_status()
    return [c for c in r.json()["comments"] if c["public"]]

The public filter is critical. Internal-only comments often contain customer-identifying troubleshooting or vendor escalation notes. They are not safe to put in a customer-facing knowledge base.

Step 3, Jira Service Management extraction

Jira’s structure is different. An “issue” in JSM corresponds to what most of us call a ticket. We want issues that are resolved (statusCategory = Done) and have a resolution that suggests an actual answer, not just “Won’t Do” or “Duplicate.”

import httpx, base64, os

JIRA = os.environ["JIRA_BASE_URL"]
JIRA_AUTH = "Basic " + base64.b64encode(
    f"{os.environ['JIRA_EMAIL']}:{os.environ['JIRA_TOKEN']}".encode()
).decode()

def iter_resolved_issues(project: str, since_days: int = 365):
    jql = (
        f'project = "{project}" AND statusCategory = Done '
        f'AND resolution in (Done, Fixed, "Workaround Provided") '
        f'AND resolved >= -{since_days}d '
        f'ORDER BY resolved DESC'
    )
    headers = {"Authorization": JIRA_AUTH, "Accept": "application/json"}
    next_token = None
    with httpx.Client(timeout=60.0) as client:
        while True:
            payload = {
                "jql": jql,
                "fields": ["summary", "description", "comment", "labels",
                           "components", "resolution", "resolutiondate", "created"],
                "maxResults": 100,
            }
            if next_token:
                payload["nextPageToken"] = next_token
            r = client.post(f"{JIRA}/rest/api/3/search/jql",
                            headers=headers, json=payload)
            r.raise_for_status()
            data = r.json()
            for issue in data.get("issues", []):
                yield issue
            if data.get("isLast", True):
                break
            next_token = data.get("nextPageToken")

The Jira Cloud platform moved to a token-based pagination in early 2025; the old startAt style is deprecated. If your team’s docs still reference startAt, update them.

Step 4, anonymization that you can actually trust

This is the step everyone underestimates. You cannot rely on regex alone, and you cannot rely on an LLM alone. Use both, in a defense-in-depth setup, and log every redaction so you can audit.

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

EMAIL = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
IPV4 = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
UUID = re.compile(r"\b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\b")
BEARER = re.compile(r"(?i)bearer\s+[A-Za-z0-9\-_\.=]+")

REGEX_RULES = [
    (EMAIL, "[EMAIL]"),
    (IPV4, "[IP]"),
    (UUID, "[UUID]"),
    (BEARER, "[BEARER_TOKEN]"),
]

def anonymize(text: str) -> tuple[str, list[dict]]:
    redactions = []
    cleaned = text
    for pat, label in REGEX_RULES:
        matches = list(pat.finditer(cleaned))
        for m in matches:
            redactions.append({"type": label, "span": [m.start(), m.end()]})
        cleaned = pat.sub(label, cleaned)
    results = analyzer.analyze(text=cleaned, language="en",
                               entities=["PERSON", "PHONE_NUMBER", "CREDIT_CARD",
                                         "US_SSN", "URL", "DOMAIN_NAME"])
    if results:
        anon = anonymizer.anonymize(text=cleaned, analyzer_results=results)
        cleaned = anon.text
        for r in results:
            redactions.append({"type": r.entity_type, "span": [r.start, r.end]})
    return cleaned, redactions

The Presidio layer catches person names, phone numbers, card numbers, and URLs that the regex layer misses. The regex layer catches structured tokens (UUIDs, bearer tokens) that Presidio is bad at. Together, they get you to roughly 99% recall on the categories that matter for support data. The remaining 1% is why every redaction is logged; you need to be able to find and fix what slipped through.

Store the redaction log next to the document. When a customer files a privacy complaint, you’ll need it.

Step 5, deduplication in two passes

Tickets cluster heavily around common issues. Twenty different customers hit the same TLS misconfiguration in the same week and you’ve got twenty near-duplicate tickets. You don’t want all twenty in your KB, you want the best one.

First pass is exact hash dedup against content_hash (a SHA-256 of body_clean normalized for whitespace). Fast, cheap, catches the obvious cases.

Second pass uses MinHash LSH on shingled text to cluster near-dups.

from datasketch import MinHash, MinHashLSH

def shingle(text: str, k: int = 5) -> set[str]:
    tokens = text.lower().split()
    return {" ".join(tokens[i:i+k]) for i in range(len(tokens)-k+1)}

def build_dup_clusters(docs: list[dict], threshold: float = 0.7):
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    minhashes = {}
    for d in docs:
        mh = MinHash(num_perm=128)
        for sh in shingle(d["body_clean"]):
            mh.update(sh.encode("utf-8"))
        minhashes[d["id"]] = mh
        lsh.insert(str(d["id"]), mh)
    seen = set()
    clusters = []
    for doc_id in minhashes:
        if doc_id in seen:
            continue
        members = [int(x) for x in lsh.query(minhashes[doc_id])]
        seen.update(members)
        clusters.append(members)
    return clusters

Threshold 0.7 is what I’ve landed on for support tickets. Higher and you miss reworded duplicates; lower and you cluster unrelated tickets that happen to share boilerplate language (“Hi, I’m having an issue with…”).

Step 6, canonical scoring

Within each cluster, pick one representative. The scoring formula I use weighs four signals.

def canonical_score(doc: dict) -> float:
    score = 0.0
    if doc["source_system"] == "zendesk_kb":
        score += 5.0
    if doc.get("csat") == "good":
        score += 1.5
    age_days = (datetime.now(timezone.utc) - doc["resolved_at"]).days if doc.get("resolved_at") else 999
    score += max(0, 2.0 - age_days / 180)
    score += min(2.0, doc.get("comment_count", 0) / 10)
    if "verified" in (doc.get("labels") or []):
        score += 3.0
    return score

KB articles dominate by design. A satisfied customer matters. Recency matters but with a floor. A “verified” label, applied by your senior engineers, is the strongest single signal you can engineer. Within each dup_cluster_id the document with the highest canonical_score gets a flag, and the retriever filters on that flag for L1 queries.

Step 7, putting it together

The full ETL run, cron’d to run hourly for KB articles and nightly for tickets, looks like this.

import psycopg
import hashlib, json

def upsert_document(cur, doc: dict):
    cur.execute("""
        INSERT INTO kb_documents (source_system, source_id, title, body, body_clean,
            canonical_score, visibility, product_area, labels, resolved_at,
            created_at, updated_at, content_hash)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        ON CONFLICT (source_system, source_id) DO UPDATE SET
            title = EXCLUDED.title,
            body = EXCLUDED.body,
            body_clean = EXCLUDED.body_clean,
            canonical_score = EXCLUDED.canonical_score,
            labels = EXCLUDED.labels,
            updated_at = EXCLUDED.updated_at,
            content_hash = EXCLUDED.content_hash
    """, (
        doc["source_system"], doc["source_id"], doc["title"], doc["body"],
        doc["body_clean"], doc["canonical_score"], doc["visibility"],
        doc["product_area"], doc["labels"], doc.get("resolved_at"),
        doc["created_at"], doc["updated_at"],
        hashlib.sha256(doc["body_clean"].encode()).hexdigest(),
    ))

def run_etl():
    with psycopg.connect(os.environ["PG_DSN"]) as conn, conn.cursor() as cur:
        for raw in iter_resolved_tickets(datetime.now(timezone.utc) - timedelta(days=1)):
            doc = transform_zendesk(raw)
            doc["body_clean"], _ = anonymize(doc["body"])
            doc["canonical_score"] = canonical_score(doc)
            upsert_document(cur, doc)
        conn.commit()

The transform_zendesk and transform_jira adapters are the boring glue code; their job is to take the raw API payload and produce the dict shape that upsert_document expects.

For the source-of-truth on the Jira API specifically, the Atlassian REST API reference is the only docs I trust, since they update faster than any vendor blog.

Common Pitfalls

Treating tickets as authoritative. Tickets are evidence, not policy. A workaround from 2022 in a closed ticket can read as gospel to an LLM. Always weight KB articles higher than tickets, and have a recency decay on ticket relevance.

Anonymizing only the body. Titles leak too. “Login failure for acme-corp.com on bastion-01” has the customer’s domain and a hostname. Run your anonymizer over the title with the same rules, and store the cleaned title separately.

Skipping the audit trail. Every redaction, every dedup decision, every canonical-score assignment should be logged. When a customer files a GDPR request or a security review asks “what’s in your KB,” you need to answer in minutes, not weeks.

Ignoring stale documents. A KB article that hasn’t been updated in three years and references a deprecated API is a liability. Add a staleness_score based on updated_at and have the retriever penalize stale results unless the user explicitly opts in.

Troubleshooting

Symptom, the same answer appears five times in retrieval results. Your dedup didn’t run, or your threshold is too high. Lower MinHash threshold to 0.6 and re-cluster. Verify that the retriever filters on dup_cluster_id and prefers the highest canonical_score within each cluster.

Symptom, customer-specific config leaked into a public answer. Anonymizer missed it. Check the redaction log for the offending document. Add a specific regex or NER rule. Backfill by re-running anonymization on the full corpus; don’t try to patch in place.

Symptom, KB feels outdated even though the cron is running. Two likely causes. First, Zendesk’s updated_at field doesn’t change when only labels change, so your incremental fetch misses re-categorizations. Second, your staleness_score might be turned off in the retriever. Add a full weekly resync that ignores updated_at and compares content_hash directly.

Wrapping Up

The knowledge base behind a good support RAG isn’t built; it’s curated. The extraction code is the easy half. The deduping, anonymization, and canonical scoring is what separates a system that you can actually trust to answer customers from one that you have to babysit forever.

Next in this series I’ll get into embedding strategies, specifically when a general-purpose embedding is fine, when you need to fine-tune, and how to evaluate the difference. If you’re scoping this work, plan for the ETL to take twice as long as you think. It always does.