Building an SRE Copilot for On Call Engineers
TL;DR — Give the LLM a strict tool catalog (PromQL, LogQL, Tempo, runbook search), keep all tools read-only, drive the loop with LangGraph 0.2, and ship a Slack interface before you ship a web UI.
An on-call engineer at 2 AM doesn’t want a chatbot. They want answers. The right copilot is one that, given a PagerDuty alert, runs the same five queries the on-call would run, reads the runbook, and produces a brief that turns the first ten minutes into the last ten minutes.
The dangerous version of this copilot has cluster admin and a kubectl tool. The useful version is strictly read-only, runs in a sidecar, and never touches production state. It’s not a remediator. It’s a research assistant that compresses the boring part of on-call.
This tutorial builds that copilot. We’ll use claude-3.7-sonnet for tool use, LangGraph 0.2 for the loop, and the standard Grafana stack (Prometheus 3.0, Loki 3.3, Tempo 2.6) as the data sources. The copilot lives behind a Slack app, ingests PagerDuty webhooks, and posts a structured brief into the incident channel within 30 seconds of a page.
1. What the Copilot Does and Doesn’t Do
It does:
- Read PromQL, LogQL, and TraceQL.
- Search the runbook index.
- Read recent deployment history from Argo CD 2.13.
- Read Kubernetes resource state (events, pods, deployments).
- Post a structured brief to Slack.
It does not:
- Modify any resource.
- Execute commands on hosts.
- Bypass any of the above.
Read-only is not a guideline. It’s enforced by the service account.
# rbac/copilot.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sre-copilot-readonly
rules:
- apiGroups: [""]
resources: ["pods", "events", "services", "endpoints", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["argoproj.io"]
resources: ["applications", "applicationsets"]
verbs: ["get", "list"]
No create, no update, no patch, no delete. This is the whole security story.
2. The Tool Catalog
LangGraph 0.2 calls tools through the Anthropic client. Each tool is a function with a typed signature. Five tools is the right number — fewer and the copilot misses obvious moves, more and it gets confused.
# copilot/tools.py
from pydantic import BaseModel, Field
from anthropic.types import ToolParam
import httpx
class PromQLArgs(BaseModel):
query: str = Field(..., description="A PromQL expression")
minutes: int = Field(15, ge=1, le=180)
class LogQLArgs(BaseModel):
query: str
minutes: int = Field(15, ge=1, le=120)
limit: int = Field(100, ge=1, le=500)
class TraceArgs(BaseModel):
service: str
minutes: int = 15
error_only: bool = True
class RunbookArgs(BaseModel):
query: str
class DeployArgs(BaseModel):
service: str
hours: int = Field(24, ge=1, le=168)
TOOLS: list[ToolParam] = [
{
"name": "promql",
"description": "Run a PromQL query and return the result series. Read-only.",
"input_schema": PromQLArgs.model_json_schema(),
},
{
"name": "logql",
"description": "Run a LogQL query, returns up to `limit` log lines.",
"input_schema": LogQLArgs.model_json_schema(),
},
{
"name": "traces",
"description": "Fetch recent traces for a service, optionally errors only.",
"input_schema": TraceArgs.model_json_schema(),
},
{
"name": "runbook_search",
"description": "Search the internal runbook index. Returns top 3 results.",
"input_schema": RunbookArgs.model_json_schema(),
},
{
"name": "deploy_history",
"description": "Recent deploys for a service from Argo CD.",
"input_schema": DeployArgs.model_json_schema(),
},
]
Each tool implementation is a thin async function. Here’s promql.
async def tool_promql(args: PromQLArgs) -> dict:
async with httpx.AsyncClient(timeout=15) as c:
r = await c.get("http://prometheus:9090/api/v1/query_range", params={
"query": args.query,
"start": f"now-{args.minutes}m",
"end": "now",
"step": "30s",
})
r.raise_for_status()
result = r.json()["data"]["result"]
# Compact the response: keep last 20 points per series, max 10 series
compact = []
for s in result[:10]:
values = s["values"][-20:]
compact.append({"metric": s["metric"], "values": values})
return {"series": compact, "truncated": len(result) > 10}
The compaction step matters. A raw PromQL response can be 500 KB. The model only needs the shape, not every sample. Keep tool responses under 4 KB each.
3. The LangGraph Loop
LangGraph 0.2 gives us a typed state machine. The states are: investigate, assess, report.
# copilot/graph.py
from langgraph.graph import StateGraph, END
from typing import TypedDict
from anthropic import AsyncAnthropic
claude = AsyncAnthropic()
class State(TypedDict):
alert: dict
transcript: list[dict] # message history
findings: list[str]
tool_calls: int
report: dict | None
MAX_TOOL_CALLS = 8
async def investigate(state: State) -> State:
msg = await claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
system=SYSTEM_INVESTIGATE,
tools=TOOLS,
messages=state["transcript"],
)
if msg.stop_reason == "tool_use":
for block in msg.content:
if block.type == "tool_use":
result = await dispatch(block.name, block.input)
state["transcript"].append({"role": "assistant", "content": msg.content})
state["transcript"].append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": block.id,
"content": json.dumps(result)[:4000]}],
})
state["tool_calls"] += 1
else:
# model decided to stop investigating
state["findings"].append(msg.content[0].text)
return state
def should_continue(state: State) -> str:
if state["tool_calls"] >= MAX_TOOL_CALLS:
return "assess"
if state["findings"]:
return "assess"
return "investigate"
async def assess(state: State) -> State:
msg = await claude.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=800,
system=SYSTEM_ASSESS,
messages=state["transcript"] + [{
"role": "user",
"content": "Now produce the final brief as JSON.",
}],
)
state["report"] = json.loads(msg.content[0].text)
return state
graph = StateGraph(State)
graph.add_node("investigate", investigate)
graph.add_node("assess", assess)
graph.set_entry_point("investigate")
graph.add_conditional_edges("investigate", should_continue,
{"investigate": "investigate", "assess": "assess"})
graph.add_edge("assess", END)
copilot = graph.compile()
MAX_TOOL_CALLS = 8 is the budget. The model gets eight tool uses to investigate. If it hasn’t drawn a conclusion by then, the loop forces an assessment with whatever it’s got. That’s the safety valve.
4. The Brief Schema
The output is structured. Slack rendering depends on it.
class Brief(BaseModel):
headline: str # one line
impact: str # what users see
likely_cause: str
evidence: list[str] # bullet points, max 6
suggested_next_steps: list[str] # max 4, read-only verbs only
runbook_links: list[str]
confidence: float # 0..1
suggested_next_steps is the trickiest field. The model loves to suggest kubectl delete pod. We strip any imperative that isn’t on an allowlist of verbs: “check”, “inspect”, “verify”, “review”, “page”. A regex post-processor enforces it.
ALLOWED_VERBS = {"check", "inspect", "verify", "review", "page", "look", "compare"}
def sanitize_steps(steps: list[str]) -> list[str]:
out = []
for s in steps:
verb = s.strip().split()[0].lower()
if verb in ALLOWED_VERBS:
out.append(s)
else:
out.append(f"Review: {s}")
return out
This is belt-and-braces. The system prompt also forbids imperatives, but post-processing is what you trust.
5. The PagerDuty Webhook
PagerDuty fires a webhook on incident creation. The copilot runs and posts to Slack.
# api/webhook.py
from fastapi import FastAPI, Request, BackgroundTasks
import os, httpx
app = FastAPI()
@app.post("/pd/incident")
async def pd_incident(req: Request, bg: BackgroundTasks):
event = await req.json()
if event["event"]["event_type"] != "incident.triggered":
return {"ok": True}
bg.add_task(run_copilot, event["event"]["data"])
return {"ok": True}
async def run_copilot(incident: dict):
state = await build_initial_state(incident)
final = await copilot.ainvoke(state)
brief = final["report"]
brief["suggested_next_steps"] = sanitize_steps(brief["suggested_next_steps"])
await post_to_slack(incident["channel_id"], brief)
Build the initial state by extracting the alert labels, the service name, and pre-loading a single PromQL probe so the model has at least one anchor data point.
async def build_initial_state(incident: dict) -> State:
service = incident["custom_details"].get("service", "unknown")
initial_probe = await tool_promql(PromQLArgs(
query=f'sum(rate(http_requests_total{{service="{service}",status=~"5.."}}[5m]))',
minutes=30,
))
return State(
alert=incident,
transcript=[{
"role": "user",
"content": json.dumps({
"alert": incident,
"initial_probe": initial_probe,
}),
}],
findings=[],
tool_calls=1,
report=None,
)
6. The Slack Render
Slack Block Kit. Keep it tight.
async def post_to_slack(channel: str, brief: dict):
blocks = [
{"type": "section", "text": {"type": "mrkdwn",
"text": f"*{brief['headline']}*"}},
{"type": "section", "fields": [
{"type": "mrkdwn", "text": f"*Impact*\n{brief['impact']}"},
{"type": "mrkdwn",
"text": f"*Confidence*\n{int(brief['confidence']*100)}%"},
]},
{"type": "section", "text": {"type": "mrkdwn",
"text": f"*Likely cause*\n{brief['likely_cause']}"}},
{"type": "section", "text": {"type": "mrkdwn",
"text": "*Evidence*\n" + "\n".join(f"- {e}" for e in brief["evidence"])}},
{"type": "section", "text": {"type": "mrkdwn",
"text": "*Next steps*\n" + "\n".join(f"- {s}"
for s in brief["suggested_next_steps"])}},
]
if brief["runbook_links"]:
blocks.append({"type": "section", "text": {"type": "mrkdwn",
"text": "*Runbooks*\n" + "\n".join(brief["runbook_links"])}})
async with httpx.AsyncClient() as c:
await c.post("https://slack.com/api/chat.postMessage",
headers={"Authorization": f"Bearer {SLACK_TOKEN}"},
json={"channel": channel, "blocks": blocks})
7. Cost and Latency Budget
A typical run uses 6-8 tool calls and 4-6 model calls (one per investigate loop plus assess). With claude-3.7-sonnet that’s about 15 cents per page at our token sizes. At 30 incidents a week, that’s twenty dollars a month. Cheap.
End-to-end latency is dominated by tool latency, not model latency. PromQL with query_range over 30 minutes is 200-400 ms. Loki tail with limit=100 is 100-200 ms. Total wall clock for a brief is 18-25 seconds. If your tools are slower than that, fix the tools first.
8. Common Pitfalls
Four mistakes you’ll make once.
- Giving the model write access “just in case”. Don’t. The moment a tool can
patch, the entire security model collapses. If it needs to act, escalate to a separate workflow. - No tool call ceiling. The model will happily run twenty queries chasing a hunch. Cap at 8, force an assessment, and let the human ask follow-ups.
- Returning unbounded tool output. A LogQL query can return 50 MB. Truncate to 4 KB before the model sees it. If you don’t, you’ll burn context and the model will fixate on whatever happens to be in the last few KB.
- Pre-injecting the model’s “favorite” cause. A common mistake is including phrases like “we recently saw issues with the database” in the system prompt. The model will always blame the database. Keep the system prompt situation-agnostic.
9. Troubleshooting
Three failures you’ll see in week one.
9.1 Copilot ignores the runbook
You’re storing runbooks but not searching them well. Make sure runbook_search returns the relevant top 3 and not the most recent 3. Embeddings on runbook titles plus a small content sample work better than full-text search.
9.2 Brief contradicts the metrics
This is almost always context overload. The model has 60 KB of tool results in context and is pattern-matching on the wrong slice. Reduce per-tool response size to 2 KB and try again.
9.3 Slack post arrives blank
Block Kit is unforgiving about missing fields. Validate the brief against the schema before rendering, and substitute "-" for any empty string. Slack rejects blocks with empty text.text.
10. Wrapping Up
The right SRE copilot is small, read-only, and opinionated about output shape. It’s not trying to replace the on-call. It’s trying to give them a 30-second head start. Once it’s running, the team’s instinct will be to add write tools. Resist. Build a separate, approval-gated remediation pipeline instead.
For the remediation pattern that pairs with this copilot, see auto remediation pipelines with LLM agents and Argo Events. For the upstream foundations on tool use, the Anthropic tool use docs are the canonical reference.