background-shape
SAST in 2024, Semgrep and AI Triage for Real Codebases
October 21, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Semgrep 1.90 alone produces too much noise on large repos. AI triage cuts the queue but only if you give it real context. Treat the triage layer as a filter, not an oracle.

Static analysis tools have always struggled with the same trade-off: precise enough to be useful, or fast enough to run in CI, pick one. Semgrep 1.90 is the best balance I’ve found, and it’s what I default to. The problem isn’t Semgrep. It’s that any analyzer that scans a multi-million-line codebase produces findings faster than humans can review them, and that’s where most security programs lose momentum.

I’ve spent the last several months wiring claude-3.5-sonnet into a Semgrep triage pipeline. The results have been good enough that I’d recommend the pattern to anyone with more than a few hundred open findings. But the path to “good enough” had more potholes than I expected. This post is the honest version.

If you’re new to Semgrep, the official rule writing docs are excellent. I’ll assume some familiarity with the basics and focus on the operational layer.

What Semgrep Is Good At, And What It Isn’t

Semgrep’s killer feature is the pattern language. Writing a rule for “any call to eval with a parameter that flowed from request.body” is a few lines, runs fast, and works across languages with minor adjustments. For a security engineer who can spend an afternoon writing rules tuned to your codebase’s idioms, the ROI is hard to beat.

Where it falls short:

  • Cross-function taint without aggressive tuning. Interprocedural analysis works, but you have to configure it and the false-positive rate climbs.
  • Framework-specific routes. A controller annotated with @Get('/api/v1/users') exposes an endpoint, but Semgrep needs help to know that. Custom rules to mark sources.
  • High-level business logic flaws. Authorization gaps, IDORs, race conditions in payment flows. SAST is not the tool. Don’t try.

The realistic posture: Semgrep catches the well-known patterns reliably (SSRF, hardcoded secrets, command injection, deserialization), produces a long tail of probably-not-exploitable findings, and lets you encode codebase-specific patterns when you find recurring issues.

The Triage Problem

On a 2M-line codebase with the default registry rules, my last scan returned ~3,400 findings. Triaging them manually at five minutes apiece is 280 hours of engineering time. Not happening.

The traditional answer is rule tuning. Disable noisy rules, write paths-ignore, suppress entire classes. This works but tends to disable things that occasionally do catch real bugs, and you can’t easily distinguish between “this rule is always wrong here” and “this rule is right one time in fifty.”

The new answer is AI triage. The idea is simple: for each finding, the LLM gets the Semgrep finding, the surrounding code, and a clear prompt, and returns a structured judgment. The model is far from perfect, but it’s much better than nothing and much faster than human review.

A Triage Prompt That Behaves

Here’s the prompt structure I landed on, after a lot of iteration:

You are a security engineer triaging a SAST finding. You must classify
it into one of: TRUE_POSITIVE, LIKELY_FALSE_POSITIVE, NEEDS_HUMAN.

Output ONLY a JSON object with this schema:
{
  "verdict": "TRUE_POSITIVE" | "LIKELY_FALSE_POSITIVE" | "NEEDS_HUMAN",
  "confidence": 0.0..1.0,
  "reasoning": "string, <= 280 chars",
  "suggested_fix": "string or null"
}

Tool: Semgrep 1.90
Rule: {rule_id}
Rule description: {rule_message}
File: {file_path}
Line: {line_number}

Code context (50 lines before, 50 lines after the finding):

{code_context}


Project context: {project_blurb}

Rules for your judgment:
- Mark TRUE_POSITIVE only if you can name the exploitable input and reach.
- Mark LIKELY_FALSE_POSITIVE only with concrete reasoning (e.g., input is
  validated upstream, parameter is a compile-time constant).
- When in doubt, NEEDS_HUMAN. Do not guess.

The schema constraint is doing real work. Without it, the model produces prose that’s hard to aggregate. The “do not guess” instruction reduces false-positive verdicts in either direction; the model is more willing to admit uncertainty when given an explicit escape hatch.

I run this with claude-3.5-sonnet. It’s cheaper than o1 for batch triage and the quality is high enough that the marginal improvement from a reasoning model doesn’t justify the cost at this scale.

Calibration, The Step Everyone Skips

Before trusting any of this in production, you have to calibrate. Get a labeled corpus of 100 findings: 50 you’ve personally confirmed are true positives, 50 you’ve confirmed are false. Run the triage prompt against all 100. Look at the confusion matrix.

What you’re checking:

  • False negatives on true positives. The model dismissing real bugs. This is the failure mode that matters most. If you see any, your prompt isn’t conservative enough.
  • False positives on false positives. Less bad, but if rampant means triage isn’t actually saving time.
  • Confidence calibration. Are the high-confidence verdicts more often correct? If confidence is uncorrelated with accuracy, the model isn’t really triaging, it’s guessing.

My current calibration on internal-ish corpora: ~95% recall on true positives at the TRUE_POSITIVE or NEEDS_HUMAN verdicts combined. That means the queue I actually review still contains essentially all the real bugs, but it’s 5 to 8 times shorter than the raw Semgrep output.

The 5% the model dismisses incorrectly is real risk. I mitigate by also running a secondary check: for any rule where historical true-positive rate is above some threshold, force NEEDS_HUMAN regardless of model verdict.

Wiring It Into CI

The naive setup is to run Semgrep on every PR, then triage the diff. That works for PR-scoped scanning but doesn’t handle the backlog of pre-existing findings.

What I actually run:

  1. PR job: Semgrep diff scan + triage on new findings only. Fails the PR if any verdict is TRUE_POSITIVE with confidence above 0.7.
  2. Nightly full scan: Semgrep against the whole repo, triage everything, write to a database, update a dashboard. Slow, but offline.
  3. Weekly review: human walks the NEEDS_HUMAN queue, sorted by severity and rule reliability score.
# .github/workflows/semgrep-pr.yml (excerpt)
- name: Semgrep diff scan
  run: |
    semgrep --config=auto --baseline-commit=${{ github.event.pull_request.base.sha }} \
            --json --output=findings.json --error
  continue-on-error: true

- name: AI triage
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: python triage.py findings.json > triage.json

- name: Gate PR
  run: python gate.py triage.json --fail-on=TRUE_POSITIVE --min-confidence=0.7

The gate.py step is what actually fails the build. The split between triage and gating matters; it lets you change the gating policy without touching the triage logic.

Custom Rules, The Force Multiplier

The triage layer makes Semgrep usable, but custom rules are what make it valuable. Every time a real bug is found in code review or post-incident, write a Semgrep rule for the pattern. Six months in, your custom-rule corpus catches more genuine issues than the registry rules do, because it encodes your codebase’s specific failure modes.

A worked example: we had two incidents in a quarter where a developer called an internal HTTP client with a URL that wasn’t validated against the allowlist. The rule:

rules:
  - id: internal-http-client-unvalidated-url
    message: HttpClient call with non-allowlisted URL
    severity: WARNING
    languages: [python]
    pattern-either:
      - pattern: |
          $CLIENT = HttpClient(...)
          ...
          $CLIENT.request($URL, ...)
      - pattern: |
          HttpClient(...).request($URL, ...)
    pattern-not: |
      $CLIENT.request(validate_url($URL), ...)
    metadata:
      category: security
      cwe: CWE-918

This kind of pattern is too codebase-specific to ever appear in the public registry. It’s also dead simple to write once you know what to look for. The triage layer learns to recognize it and rarely false-positives.

I touched on a related pattern around AI-assisted authoring in AI assisted detection rules, Sigma and YARA in 2024. Same lesson there: the model is a great first draft, but the domain expertise has to come from you.

Gotchas

  • Cost runaway. Triage with frontier models adds up. A nightly run against 3,000 findings at a few thousand tokens each is real money. Cache by (rule_id, file_hash, line_hash) so unchanged findings don’t re-triage.
  • Context window limits on huge files. A 50-line window is usually fine; for some findings (deserialization in a 4,000-line legacy module) you need more. Have a strategy for chunking that doesn’t lose the surrounding control flow.
  • Generated code. Semgrep happily finds “bugs” in generated protobuf or OpenAPI client code. Exclude generated paths or you’ll triage the same false positives every night.
  • Rule version drift. Semgrep registry rules change. A finding that was true positive last month might be a different rule today. Pin your ruleset to a specific revision in CI and bump intentionally.
  • The model agrees too easily. If you write “this code is in a test, it’s probably fine” in your prompt, the model will dutifully classify everything in tests as false positive, including a real bug in a test that runs against production data. Be careful what context you feed in.
  • Suppression abuse. Once you build a // nosemgrep comment, developers will use it. Require a code review with a security label for any new suppression, and audit them quarterly.

Wrapping Up

SAST has had a reputation problem for years because the tools produced more output than anyone could consume. Semgrep 1.90 plus an AI triage layer is the first combination I’ve used that genuinely closes that loop. The output is short enough that engineers actually read it. The signal-to-noise is high enough that they don’t ignore it.

The work isn’t done after you wire it up. The maintenance is in the corpus of custom rules, the triage prompt calibration, and the discipline to keep the false-negative rate honest. Done well, SAST goes from “the noise we ignore” to “the cheap insurance we depend on.” That transition is worth the effort.