Catching Regressions with an AI Reviewer Agent on Pull Requests

Langgraph article cover illustration on a gradient background

February 17, 2026 · 11 min read · by Muhammad Amal programming

TL;DR — An ai reviewer agent reads the PR diff, not the whole repo / LangGraph 0.3 gives you a deterministic graph with retry and budget control / GitHub Actions runs it on pull_request and posts inline comments via the Reviews API.

Most code review tools that bolt an LLM onto a pull request fail the same way: they summarize the diff in prose nobody reads, or they hallucinate problems in code that’s perfectly fine. The summary gets a thumbs-up emoji and the actual regression — the one where someone changed a default argument from None to [] and now every caller shares mutable state — sails straight through.

I’ve been running an ai reviewer agent on pull requests across two production services for about four months. It does not replace human review. What it does is catch the boring, mechanical regressions that humans skim past on a Friday afternoon: swallowed exceptions, off-by-one in pagination, a removed null check, a config key renamed in one place but not the other. The trick is scoping the agent narrowly and feeding it structured context instead of dumping the repo into a prompt.

This post builds that agent with LangGraph 0.3 and wires it into GitHub Actions. The graph is small on purpose — fetch diff, analyze hunks, dedupe findings, post review. If you’ve already read the broader AI code review pipeline post , this is the deep dive on the reviewer node itself.

Why a graph and not a single prompt

You could write this as one client.messages.create() call. I did, originally. It broke in three ways. Large diffs blew the context window. A transient 529 from the API failed the whole job with no retry. And there was no place to enforce a token budget, so a 4,000-line refactor PR cost more than the feature was worth.

LangGraph 0.3 solves all three with explicit nodes and edges. Each node is a pure-ish function over a typed state object. You get checkpointing, conditional edges, and a natural seam to inject retries and budget checks. The graph for a PR reviewer looks like this:

fetch_diff -> chunk_hunks -> analyze (per chunk, with retry)
           -> dedupe_findings -> post_review

Let’s build it from the state up.

Project setup

# pyproject.toml
[project]
name = "pr-reviewer-agent"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
  "langgraph==0.3.27",
  "langchain-anthropic==0.3.9",
  "anthropic==0.49.0",
  "unidiff==0.7.5",
  "pygithub==2.6.1",
  "pydantic==2.11.3",
]

unidiff parses the raw patch into structured hunks so the model never has to count line numbers itself. PyGithub handles the Reviews API. Pin everything — agent behavior drifts when a dependency moves under you.

The state object

LangGraph state is a TypedDict. Keep it flat and serializable so checkpointing works.

# state.py
from typing import TypedDict, Annotated
from operator import add
from pydantic import BaseModel


class Finding(BaseModel):
    file: str
    line: int           # line in the new file (RIGHT side of the diff)
    severity: str       # "blocker" | "warning" | "nit"
    category: str       # "regression" | "bug" | "style" | "security"
    message: str
    suggestion: str | None = None


class ReviewState(TypedDict):
    repo: str
    pr_number: int
    base_sha: str
    head_sha: str
    raw_diff: str
    hunks: list[dict]
    findings: Annotated[list[Finding], add]   # reducer: nodes append
    tokens_used: int
    budget_exceeded: bool

The Annotated[list[Finding], add] reducer matters. When the analyze node runs once per chunk, each invocation returns a partial findings list and LangGraph concatenates them. Without the reducer the last chunk would overwrite the rest.

Fetching and chunking the diff

Never feed the agent the whole repository. Feed it the diff, and only the diff. The pull_request event gives you base and head SHAs; the GitHub API returns the patch directly.

# nodes/fetch.py
import os
from github import Github
from unidiff import PatchSet
from state import ReviewState

# Files that are noise for a reviewer agent.
SKIP_PATTERNS = (".lock", ".snap", "package-lock.json", "pnpm-lock.yaml")


def fetch_diff(state: ReviewState) -> dict:
    gh = Github(os.environ["GITHUB_TOKEN"])
    repo = gh.get_repo(state["repo"])
    pr = repo.get_pull(state["pr_number"])

    # PyGithub exposes the unified diff via the .diff media type.
    diff_text = repo.get_contents  # placeholder removed below
    patch = PatchSet(_download_patch(pr))
    return {"raw_diff": str(patch)}


def _download_patch(pr) -> str:
    import requests
    resp = requests.get(
        pr.url,
        headers={
            "Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}",
            "Accept": "application/vnd.github.v3.diff",
        },
        timeout=30,
    )
    resp.raise_for_status()
    return resp.text


def chunk_hunks(state: ReviewState) -> dict:
    patch = PatchSet(state["raw_diff"])
    hunks: list[dict] = []
    for pfile in patch:
        if pfile.is_removed_file:
            continue
        if any(pfile.path.endswith(p) for p in SKIP_PATTERNS):
            continue
        for hunk in pfile:
            # Only review hunks that actually add or change lines.
            if not any(line.is_added for line in hunk):
                continue
            hunks.append({
                "file": pfile.path,
                "header": str(hunk).splitlines()[0],
                "body": str(hunk),
                "start_line": hunk.target_start,
            })
    return {"hunks": hunks}

Two decisions worth defending. First, skip lock files and snapshots — they’re generated, and a reviewer agent that comments on pnpm-lock.yaml trains people to ignore it. Second, chunk by hunk, not by file. A 600-line file with one changed line should produce one small prompt, not a 600-line one.

The analyze node

This is where the model earns its keep. Send one hunk at a time with a strict instruction to return JSON. The prompt must tell the model what a regression looks like, because “find bugs” is too vague to be useful.

# nodes/analyze.py
import json
from langchain_anthropic import ChatAnthropic
from anthropic import APIStatusError
from tenacity import retry, stop_after_attempt, wait_exponential
from state import ReviewState, Finding

MODEL = ChatAnthropic(
    model="claude-sonnet-4-5-20250929",
    temperature=0,
    max_tokens=1500,
)

SYSTEM = """You are a senior backend engineer reviewing a single diff hunk.
Report ONLY concrete regressions and bugs introduced by the added lines.
A regression is: removed validation, changed default that breaks callers,
swallowed exception, altered control flow, off-by-one, resource leak,
or a security weakening (auth, injection, secrets).

Do NOT report style preferences. Do NOT praise the code.
Return a JSON array of findings. Empty array if the hunk is clean.
Each finding: {file, line, severity, category, message, suggestion}.
'line' must be a line number from the RIGHT side of this hunk."""

TOKEN_BUDGET = 120_000


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=2, min=2, max=20),
    retry=lambda rs: isinstance(rs.outcome.exception(), APIStatusError),
)
def _call_model(prompt: str):
    return MODEL.invoke(prompt)


def analyze(state: ReviewState) -> dict:
    if state.get("budget_exceeded"):
        return {}

    findings: list[Finding] = []
    tokens = state.get("tokens_used", 0)

    for hunk in state["hunks"]:
        if tokens > TOKEN_BUDGET:
            return {"budget_exceeded": True, "tokens_used": tokens}

        prompt = (
            f"{SYSTEM}\n\n"
            f"File: {hunk['file']}\n"
            f"Hunk header: {hunk['header']}\n\n"
            f"```diff\n{hunk['body']}\n```"
        )
        try:
            resp = _call_model(prompt)
        except APIStatusError as exc:
            # Don't fail the whole review for one bad hunk.
            print(f"::warning::analyze failed for {hunk['file']}: {exc}")
            continue

        tokens += resp.usage_metadata["total_tokens"]
        findings.extend(_parse(resp.content, hunk["file"]))

    return {"findings": findings, "tokens_used": tokens}


def _parse(content: str, file: str) -> list[Finding]:
    """Tolerate fenced JSON and stray prose around the array."""
    text = content if isinstance(content, str) else str(content)
    start, end = text.find("["), text.rfind("]")
    if start == -1 or end == -1:
        return []
    try:
        raw = json.loads(text[start : end + 1])
    except json.JSONDecodeError:
        print(f"::warning::could not parse findings for {file}")
        return []
    out = []
    for item in raw:
        try:
            out.append(Finding(**item))
        except (TypeError, ValueError):
            continue   # skip malformed finding, keep the rest
    return out

temperature=0 is non-negotiable for a reviewer. You want the same diff to produce the same comments on a re-run. The _parse function is defensive on purpose — models occasionally wrap JSON in prose despite instructions, and one malformed finding should not nuke the whole batch.

Deduping findings

Run the same agent on a stacked PR and you’ll see the same finding twice — once per overlapping hunk. Dedupe before posting.

# nodes/dedupe.py
from state import ReviewState


def dedupe_findings(state: ReviewState) -> dict:
    seen: set[tuple] = set()
    unique = []
    for f in state["findings"]:
        key = (f.file, f.line, f.category)
        if key in seen:
            continue
        seen.add(key)
        unique.append(f)

    # Sort blockers first so they render at the top of the review.
    order = {"blocker": 0, "warning": 1, "nit": 2}
    unique.sort(key=lambda f: order.get(f.severity, 9))
    return {"findings": unique}

Posting the review

The GitHub Reviews API lets you attach comments to specific lines and set an overall event. Map blocker to REQUEST_CHANGES, everything else to COMMENT. Never auto-APPROVE from an agent — that’s a human’s signature.

# nodes/post.py
import os
from github import Github, GithubException
from state import ReviewState

SEV_ICON = {"blocker": "[blocker]", "warning": "[warning]", "nit": "[nit]"}


def post_review(state: ReviewState) -> dict:
    gh = Github(os.environ["GITHUB_TOKEN"])
    pr = gh.get_repo(state["repo"]).get_pull(state["pr_number"])

    findings = state["findings"]
    if not findings:
        pr.create_issue_comment("AI reviewer: no regressions found in this diff.")
        return {}

    comments = []
    for f in findings:
        body = f"{SEV_ICON.get(f.severity, '')} **{f.category}** — {f.message}"
        if f.suggestion:
            body += f"\n\n```suggestion\n{f.suggestion}\n```"
        comments.append({"path": f.file, "line": f.line, "body": body})

    has_blocker = any(f.severity == "blocker" for f in findings)
    event = "REQUEST_CHANGES" if has_blocker else "COMMENT"

    try:
        pr.create_review(
            body=_summary(findings),
            event=event,
            comments=comments,
        )
    except GithubException as exc:
        # Line-anchored comments fail if the line isn't in the diff.
        # Fall back to a single summary comment so feedback isn't lost.
        if exc.status == 422:
            pr.create_issue_comment(_summary(findings) + _flatten(comments))
        else:
            raise
    return {}


def _summary(findings) -> str:
    blockers = sum(1 for f in findings if f.severity == "blocker")
    return (
        f"### AI reviewer agent\n"
        f"{len(findings)} finding(s), {blockers} blocker(s). "
        f"Mechanical review only — human review still required."
    )


def _flatten(comments) -> str:
    lines = ["\n"]
    for c in comments:
        lines.append(f"- `{c['path']}:{c['line']}` — {c['body']}")
    return "\n".join(lines)

That 422 fallback is load-bearing. The Reviews API rejects a comment whose line isn’t part of the diff, and GitHub’s idea of “in the diff” includes a few context lines that the model sometimes overshoots. Catching 422 and reposting as a plain comment means a slightly-off line number degrades gracefully instead of failing the job.

Wiring the graph

# graph.py
from langgraph.graph import StateGraph, START, END
from state import ReviewState
from nodes.fetch import fetch_diff, chunk_hunks
from nodes.analyze import analyze
from nodes.dedupe import dedupe_findings
from nodes.post import post_review


def build_graph():
    g = StateGraph(ReviewState)
    g.add_node("fetch_diff", fetch_diff)
    g.add_node("chunk_hunks", chunk_hunks)
    g.add_node("analyze", analyze)
    g.add_node("dedupe", dedupe_findings)
    g.add_node("post_review", post_review)

    g.add_edge(START, "fetch_diff")
    g.add_edge("fetch_diff", "chunk_hunks")
    g.add_edge("chunk_hunks", "analyze")
    g.add_edge("analyze", "dedupe")
    g.add_edge("dedupe", "post_review")
    g.add_edge("post_review", END)
    return g.compile()


if __name__ == "__main__":
    import os
    app = build_graph()
    app.invoke({
        "repo": os.environ["GITHUB_REPOSITORY"],
        "pr_number": int(os.environ["PR_NUMBER"]),
        "base_sha": os.environ["BASE_SHA"],
        "head_sha": os.environ["HEAD_SHA"],
        "raw_diff": "",
        "hunks": [],
        "findings": [],
        "tokens_used": 0,
        "budget_exceeded": False,
    })

The GitHub Actions workflow

# .github/workflows/ai-review.yml
name: AI Reviewer Agent
on:
  pull_request:
    types: [opened, synchronize, reopened]

permissions:
  contents: read
  pull-requests: write   # required to post reviews

concurrency:
  group: ai-review-${{ github.event.pull_request.number }}
  cancel-in-progress: true   # newest commit wins, kill stale runs

jobs:
  review:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e .
      - name: Run reviewer
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_REPOSITORY: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          BASE_SHA: ${{ github.event.pull_request.base.sha }}
          HEAD_SHA: ${{ github.event.pull_request.head.sha }}
        run: python graph.py

concurrency with cancel-in-progress is the detail people miss. Push three commits in a minute and without it you get three overlapping reviews fighting over the same PR. See the GitHub Actions concurrency docs for the exact semantics.

Common Pitfalls

Feeding the whole file instead of the hunk. Context window cost scales with what you send. A hunk is usually 5-40 lines; a file can be thousands. Send the hunk plus its header.

Forgetting pull-requests: write. The default GITHUB_TOKEN is read-only for PRs. Without the explicit permission the create_review call fails with a 403 that looks like an auth bug.

Letting the agent approve PRs. An APPROVE event from a bot defeats branch protection that requires human approval. Cap the agent at COMMENT and REQUEST_CHANGES.

No token budget. A massive refactor PR will happily consume thousands of tokens per hunk. The TOKEN_BUDGET check stops the bleed and posts whatever it found so far.

Non-zero temperature. A reviewer that flags different lines on every re-run erodes trust fast. Pin temperature=0.

Troubleshooting

Symptom: review posts but every comment is on line 1. Cause: the model returned line numbers from the LEFT (old) side of the diff. Fix: the system prompt already says RIGHT side — also pass hunk.target_start so the model anchors correctly, and validate line >= hunk["start_line"] before posting.

Symptom: job fails with HTTP 422 on create_review. Cause: a comment references a line not present in the unified diff. Fix: the 422 handler in post_review reposts as an issue comment. If it happens constantly, your chunker is including context-only hunks — re-check the is_added filter.

Symptom: same finding appears twice. Cause: overlapping hunks in a stacked or rebased PR. Fix: confirm dedupe_findings runs before post_review and that the key includes category.

Symptom: agent flags generated files. Cause: SKIP_PATTERNS is missing your generator’s output. Fix: add the path suffix, or check for a @generated marker in the file header.

Symptom: random 529 Overloaded failures. Cause: API backpressure. Fix: the tenacity retry with exponential backoff already handles this; if it still fails after three tries the hunk is skipped with a ::warning:: rather than failing the build.

Wrapping Up

An ai reviewer agent is most valuable when it’s narrow: structured diffs in, JSON findings out, inline comments anchored to real lines. LangGraph 0.3 gives you the retries and budget seams that a single prompt can’t, and GitHub Actions makes it a zero-friction part of every PR. Next, wire the same findings into a metrics sink so you can measure precision over time and tune the system prompt against real false positives.

Why a graph and not a single prompt

Project setup

The state object

Fetching and chunking the diff

The analyze node

Deduping findings

Posting the review

Wiring the graph

The GitHub Actions workflow

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

Wiring an Automated AI Code Review Pipeline into CI

Stateful Agent Graphs, Checkpointing and Human in the Loop

Building an Autonomous Engineering Squad with LangGraph

Incident Response Automation with LangGraph, A Step by Step Tutorial

Long Running Autonomous Agent Workflows, Checkpoints and Retries

Production Multi Agent Systems with LangGraph 0.2, A Hands On Tutorial

Multi Agent Systems in 2025, Architecture Patterns That Work

Production Agents with LangGraph, State Machines Over Chains

Let’s Start a Project