Building an SRE Copilot for On Call Engineers

Building an SRE Copilot for On Call Engineers

May 12, 2025 · 8 min read · by Muhammad Amal programming

TL;DR — Give the LLM a strict tool catalog (PromQL, LogQL, Tempo, runbook search), keep all tools read-only, drive the loop with LangGraph 0.2, and ship a Slack interface before you ship a web UI.

An on-call engineer at 2 AM doesn’t want a chatbot. They want answers. The right copilot is one that, given a PagerDuty alert, runs the same five queries the on-call would run, reads the runbook, and produces a brief that turns the first ten minutes into the last ten minutes.

The dangerous version of this copilot has cluster admin and a kubectl tool. The useful version is strictly read-only, runs in a sidecar, and never touches production state. It’s not a remediator. It’s a research assistant that compresses the boring part of on-call.

This tutorial builds that copilot. We’ll use claude-3.7-sonnet for tool use, LangGraph 0.2 for the loop, and the standard Grafana stack (Prometheus 3.0, Loki 3.3, Tempo 2.6) as the data sources. The copilot lives behind a Slack app, ingests PagerDuty webhooks, and posts a structured brief into the incident channel within 30 seconds of a page.

1. What the Copilot Does and Doesn’t Do

It does:

Read PromQL, LogQL, and TraceQL.
Search the runbook index.
Read recent deployment history from Argo CD 2.13.
Read Kubernetes resource state (events, pods, deployments).
Post a structured brief to Slack.

It does not:

Modify any resource.
Execute commands on hosts.
Bypass any of the above.

Read-only is not a guideline. It’s enforced by the service account.

# rbac/copilot.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-copilot-readonly
rules:
- apiGroups: [""]
  resources: ["pods", "events", "services", "endpoints", "nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["argoproj.io"]
  resources: ["applications", "applicationsets"]
  verbs: ["get", "list"]

No create, no update, no patch, no delete. This is the whole security story.

2. The Tool Catalog

LangGraph 0.2 calls tools through the Anthropic client. Each tool is a function with a typed signature. Five tools is the right number — fewer and the copilot misses obvious moves, more and it gets confused.

# copilot/tools.py
from pydantic import BaseModel, Field
from anthropic.types import ToolParam
import httpx

class PromQLArgs(BaseModel):
    query: str = Field(..., description="A PromQL expression")
    minutes: int = Field(15, ge=1, le=180)

class LogQLArgs(BaseModel):
    query: str
    minutes: int = Field(15, ge=1, le=120)
    limit: int = Field(100, ge=1, le=500)

class TraceArgs(BaseModel):
    service: str
    minutes: int = 15
    error_only: bool = True

class RunbookArgs(BaseModel):
    query: str

class DeployArgs(BaseModel):
    service: str
    hours: int = Field(24, ge=1, le=168)

TOOLS: list[ToolParam] = [
    {
        "name": "promql",
        "description": "Run a PromQL query and return the result series. Read-only.",
        "input_schema": PromQLArgs.model_json_schema(),
    },
    {
        "name": "logql",
        "description": "Run a LogQL query, returns up to `limit` log lines.",
        "input_schema": LogQLArgs.model_json_schema(),
    },
    {
        "name": "traces",
        "description": "Fetch recent traces for a service, optionally errors only.",
        "input_schema": TraceArgs.model_json_schema(),
    },
    {
        "name": "runbook_search",
        "description": "Search the internal runbook index. Returns top 3 results.",
        "input_schema": RunbookArgs.model_json_schema(),
    },
    {
        "name": "deploy_history",
        "description": "Recent deploys for a service from Argo CD.",
        "input_schema": DeployArgs.model_json_schema(),
    },
]

Each tool implementation is a thin async function. Here’s promql.

async def tool_promql(args: PromQLArgs) -> dict:
    async with httpx.AsyncClient(timeout=15) as c:
        r = await c.get("http://prometheus:9090/api/v1/query_range", params={
            "query": args.query,
            "start": f"now-{args.minutes}m",
            "end": "now",
            "step": "30s",
        })
        r.raise_for_status()
        result = r.json()["data"]["result"]
    # Compact the response: keep last 20 points per series, max 10 series
    compact = []
    for s in result[:10]:
        values = s["values"][-20:]
        compact.append({"metric": s["metric"], "values": values})
    return {"series": compact, "truncated": len(result) > 10}

The compaction step matters. A raw PromQL response can be 500 KB. The model only needs the shape, not every sample. Keep tool responses under 4 KB each.

3. The LangGraph Loop

LangGraph 0.2 gives us a typed state machine. The states are: investigate, assess, report.

# copilot/graph.py
from langgraph.graph import StateGraph, END
from typing import TypedDict
from anthropic import AsyncAnthropic

claude = AsyncAnthropic()

class State(TypedDict):
    alert: dict
    transcript: list[dict]  # message history
    findings: list[str]
    tool_calls: int
    report: dict | None

MAX_TOOL_CALLS = 8

async def investigate(state: State) -> State:
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1024,
        system=SYSTEM_INVESTIGATE,
        tools=TOOLS,
        messages=state["transcript"],
    )
    if msg.stop_reason == "tool_use":
        for block in msg.content:
            if block.type == "tool_use":
                result = await dispatch(block.name, block.input)
                state["transcript"].append({"role": "assistant", "content": msg.content})
                state["transcript"].append({
                    "role": "user",
                    "content": [{"type": "tool_result", "tool_use_id": block.id,
                                 "content": json.dumps(result)[:4000]}],
                })
                state["tool_calls"] += 1
    else:
        # model decided to stop investigating
        state["findings"].append(msg.content[0].text)
    return state

def should_continue(state: State) -> str:
    if state["tool_calls"] >= MAX_TOOL_CALLS:
        return "assess"
    if state["findings"]:
        return "assess"
    return "investigate"

async def assess(state: State) -> State:
    msg = await claude.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=800,
        system=SYSTEM_ASSESS,
        messages=state["transcript"] + [{
            "role": "user",
            "content": "Now produce the final brief as JSON.",
        }],
    )
    state["report"] = json.loads(msg.content[0].text)
    return state

graph = StateGraph(State)
graph.add_node("investigate", investigate)
graph.add_node("assess", assess)
graph.set_entry_point("investigate")
graph.add_conditional_edges("investigate", should_continue,
                             {"investigate": "investigate", "assess": "assess"})
graph.add_edge("assess", END)
copilot = graph.compile()

MAX_TOOL_CALLS = 8 is the budget. The model gets eight tool uses to investigate. If it hasn’t drawn a conclusion by then, the loop forces an assessment with whatever it’s got. That’s the safety valve.

4. The Brief Schema

The output is structured. Slack rendering depends on it.

class Brief(BaseModel):
    headline: str  # one line
    impact: str   # what users see
    likely_cause: str
    evidence: list[str]  # bullet points, max 6
    suggested_next_steps: list[str]  # max 4, read-only verbs only
    runbook_links: list[str]
    confidence: float  # 0..1

suggested_next_steps is the trickiest field. The model loves to suggest kubectl delete pod. We strip any imperative that isn’t on an allowlist of verbs: “check”, “inspect”, “verify”, “review”, “page”. A regex post-processor enforces it.

ALLOWED_VERBS = {"check", "inspect", "verify", "review", "page", "look", "compare"}

def sanitize_steps(steps: list[str]) -> list[str]:
    out = []
    for s in steps:
        verb = s.strip().split()[0].lower()
        if verb in ALLOWED_VERBS:
            out.append(s)
        else:
            out.append(f"Review: {s}")
    return out

This is belt-and-braces. The system prompt also forbids imperatives, but post-processing is what you trust.

5. The PagerDuty Webhook

PagerDuty fires a webhook on incident creation. The copilot runs and posts to Slack.

# api/webhook.py
from fastapi import FastAPI, Request, BackgroundTasks
import os, httpx

app = FastAPI()

@app.post("/pd/incident")
async def pd_incident(req: Request, bg: BackgroundTasks):
    event = await req.json()
    if event["event"]["event_type"] != "incident.triggered":
        return {"ok": True}
    bg.add_task(run_copilot, event["event"]["data"])
    return {"ok": True}

async def run_copilot(incident: dict):
    state = await build_initial_state(incident)
    final = await copilot.ainvoke(state)
    brief = final["report"]
    brief["suggested_next_steps"] = sanitize_steps(brief["suggested_next_steps"])
    await post_to_slack(incident["channel_id"], brief)

Build the initial state by extracting the alert labels, the service name, and pre-loading a single PromQL probe so the model has at least one anchor data point.

async def build_initial_state(incident: dict) -> State:
    service = incident["custom_details"].get("service", "unknown")
    initial_probe = await tool_promql(PromQLArgs(
        query=f'sum(rate(http_requests_total{{service="{service}",status=~"5.."}}[5m]))',
        minutes=30,
    ))
    return State(
        alert=incident,
        transcript=[{
            "role": "user",
            "content": json.dumps({
                "alert": incident,
                "initial_probe": initial_probe,
            }),
        }],
        findings=[],
        tool_calls=1,
        report=None,
    )

6. The Slack Render

Slack Block Kit. Keep it tight.

async def post_to_slack(channel: str, brief: dict):
    blocks = [
        {"type": "section", "text": {"type": "mrkdwn",
            "text": f"*{brief['headline']}*"}},
        {"type": "section", "fields": [
            {"type": "mrkdwn", "text": f"*Impact*\n{brief['impact']}"},
            {"type": "mrkdwn",
             "text": f"*Confidence*\n{int(brief['confidence']*100)}%"},
        ]},
        {"type": "section", "text": {"type": "mrkdwn",
            "text": f"*Likely cause*\n{brief['likely_cause']}"}},
        {"type": "section", "text": {"type": "mrkdwn",
            "text": "*Evidence*\n" + "\n".join(f"- {e}" for e in brief["evidence"])}},
        {"type": "section", "text": {"type": "mrkdwn",
            "text": "*Next steps*\n" + "\n".join(f"- {s}"
                                                  for s in brief["suggested_next_steps"])}},
    ]
    if brief["runbook_links"]:
        blocks.append({"type": "section", "text": {"type": "mrkdwn",
            "text": "*Runbooks*\n" + "\n".join(brief["runbook_links"])}})
    async with httpx.AsyncClient() as c:
        await c.post("https://slack.com/api/chat.postMessage",
                     headers={"Authorization": f"Bearer {SLACK_TOKEN}"},
                     json={"channel": channel, "blocks": blocks})

7. Cost and Latency Budget

A typical run uses 6-8 tool calls and 4-6 model calls (one per investigate loop plus assess). With claude-3.7-sonnet that’s about 15 cents per page at our token sizes. At 30 incidents a week, that’s twenty dollars a month. Cheap.

End-to-end latency is dominated by tool latency, not model latency. PromQL with query_range over 30 minutes is 200-400 ms. Loki tail with limit=100 is 100-200 ms. Total wall clock for a brief is 18-25 seconds. If your tools are slower than that, fix the tools first.

8. Common Pitfalls

Four mistakes you’ll make once.

Giving the model write access “just in case”. Don’t. The moment a tool can patch, the entire security model collapses. If it needs to act, escalate to a separate workflow.
No tool call ceiling. The model will happily run twenty queries chasing a hunch. Cap at 8, force an assessment, and let the human ask follow-ups.
Returning unbounded tool output. A LogQL query can return 50 MB. Truncate to 4 KB before the model sees it. If you don’t, you’ll burn context and the model will fixate on whatever happens to be in the last few KB.
Pre-injecting the model’s “favorite” cause. A common mistake is including phrases like “we recently saw issues with the database” in the system prompt. The model will always blame the database. Keep the system prompt situation-agnostic.

9. Troubleshooting

Three failures you’ll see in week one.

9.1 Copilot ignores the runbook

You’re storing runbooks but not searching them well. Make sure runbook_search returns the relevant top 3 and not the most recent 3. Embeddings on runbook titles plus a small content sample work better than full-text search.

9.2 Brief contradicts the metrics

This is almost always context overload. The model has 60 KB of tool results in context and is pattern-matching on the wrong slice. Reduce per-tool response size to 2 KB and try again.

9.3 Slack post arrives blank

Block Kit is unforgiving about missing fields. Validate the brief against the schema before rendering, and substitute "-" for any empty string. Slack rejects blocks with empty text.text.

10. Wrapping Up

The right SRE copilot is small, read-only, and opinionated about output shape. It’s not trying to replace the on-call. It’s trying to give them a 30-second head start. Once it’s running, the team’s instinct will be to add write tools. Resist. Build a separate, approval-gated remediation pipeline instead.

For the remediation pattern that pairs with this copilot, see auto remediation pipelines with LLM agents and Argo Events. For the upstream foundations on tool use, the Anthropic tool use docs are the canonical reference.

1. What the Copilot Does and Doesn’t Do

2. The Tool Catalog

3. The LangGraph Loop

4. The Brief Schema

5. The PagerDuty Webhook

6. The Slack Render

7. Cost and Latency Budget

8. Common Pitfalls

9. Troubleshooting

9.1 Copilot ignores the runbook

9.2 Brief contradicts the metrics

9.3 Slack post arrives blank

10. Wrapping Up

Related posts

AIOps in May 2025, What Actually Works in Production

Postmortem Automation with LLMs, Drafts That Don't Lie

Chaos Engineering with AI Augmented Hypotheses

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Incident Response Automation with LangGraph, A Step by Step Tutorial

Anomaly Detection on Prometheus Metrics, A Hands On Guide

AI Driven Log Analysis at Scale, A Production Tutorial

Auto Remediation Pipelines with LLM Agents and Argo Events

Let’s Start a Project