background-shape
AI Assisted Detection Rules, Sigma and YARA in 2024
October 9, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — LLMs are great at drafting Sigma and YARA rules from incident notes. They are mediocre at correctness and lousy at false-positive intuition. Treat output as a first draft that always gets tested against a corpus.

I run a small detection engineering practice on the side. Over the past year I’ve moved from “let me hand-write this Sigma rule from scratch” to “let me describe the behavior to claude-3.5-sonnet and iterate on what it produces.” The productivity gain is real. The hidden cost, if you’re not careful, is rules that look reasonable, ship cleanly, and quietly miss the thing they were supposed to catch.

This post is what I wish someone had written a year ago. Concrete prompts, concrete review steps, and concrete failure modes. The goal isn’t to convince you to use AI for detection work — that boat has sailed and the productivity case is obvious. The goal is to make sure the rules you ship actually fire on what they claim to fire on.

I’m going to use Sigma for the log-analytics examples and YARA for binary/file content. The patterns generalize.

Where LLMs Earn Their Keep

The honest framing: large language models excel at the boring, structural parts of detection engineering. They are passable at semantic correctness. They are unreliable at operational tuning.

Tasks where I let the model take the lead:

  • Translating a threat report’s narrative description into a draft Sigma rule with the right logsource, fields, and modifiers.
  • Converting a rule between backends (Splunk SPL, KQL, Elastic Lucene) via sigma-cli and then checking the result reads sensibly.
  • Generating YARA strings from a list of IoCs in an analyst’s notes, including the wide/ascii variants you’d otherwise forget.
  • Drafting the description, references, tags, and MITRE mappings that nobody likes writing.

Tasks where I do not trust the model alone:

  • Choosing whether selection_a and not filter_b is the right shape, or whether the filter belongs inside selection_a.
  • Estimating false-positive rates. Models will confidently produce rules that match every PowerShell invocation in your fleet.
  • Crafting regexes for binary content. PCRE is a minefield and the model will happily write a pattern that doesn’t compile, or worse, compiles and matches nothing.

A Workflow That Survives Contact With Production

Here’s the loop I run for new Sigma rules. It’s longer than “ask the model to write it,” but it’s the difference between a rule that fires and a rule that pretends to.

  1. Write a one-paragraph behavioral description. What did the attacker do, in plain English, citing log fields.
  2. Provide the model 2 to 3 sample log events from your SIEM, redacted for privacy. JSON or whatever shape your platform emits.
  3. Ask for a Sigma rule plus a separate list of fields it relied on and the reasoning for each condition clause.
  4. Run the rule through sigma-cli to convert to your query language.
  5. Run that query against a known-good corpus to measure baseline false positive rate.
  6. Run it against a known-bad corpus or an emulation harness like Atomic Red Team.
  7. Iterate on the prompt with the failures.

Steps 5 and 6 are the ones nobody wants to do. They are also the only steps that matter.

# Example draft produced from a prompt about suspicious cmd /c chain
title: Suspicious Cmd Chained To PowerShell With Encoded Args
id: 9f4a1c7e-1b2c-4f4d-9b6a-1c2e3f4a5b6c
status: experimental
description: Detects cmd.exe spawning powershell.exe with -EncodedCommand
logsource:
  category: process_creation
  product: windows
detection:
  selection:
    Image|endswith: '\powershell.exe'
    ParentImage|endswith: '\cmd.exe'
    CommandLine|contains:
      - '-EncodedCommand'
      - '-enc '
      - '-e '
  condition: selection
falsepositives:
  - Legitimate admin scripting frameworks
level: medium

That draft looks fine. It’s also wrong in subtle ways. The -e token with a trailing space matches innocuous flags in other shells, and CommandLine|contains against -enc will miss case-insensitive variants on some normalization pipelines. The model didn’t know which normalization your pipeline uses. You do.

Evaluating LLM-Generated Rules

I run every AI-drafted rule through a small harness before merging. It does three things:

  1. Compile the rule (sigma-cli convert -t splunk or yara -p). Syntax errors fail loud.
  2. Run against a labeled corpus. I keep around a few thousand benign events and a smaller set of attack traces from Atomic Red Team and CALDERA executions.
  3. Diff the AST against a previously approved version if this is an update, to catch silent semantic drift.

For YARA rules the harness extends to a perf check. A rule that matches $mz at 0 with no other constraints scans the whole filesystem and tanks your endpoint. The model has no concept of scan cost. You have to enforce it.

# Minimal eval loop
yara -p 4 -w -s candidate.yar ./corpus/ > matches.txt
python3 score.py --rule candidate.yar \
                 --benign-matches matches.txt \
                 --baseline ./rules/approved.yar

The score.py is whatever you want it to be. Mine produces a markdown report with match counts, baseline drift, and a flag if the rule’s average scan time per file crossed a threshold.

A Concrete Prompt I Use

The prompt matters more than the model. Here’s a stripped-down version of what I send to claude-3.5-sonnet for new Sigma drafts:

You are drafting a Sigma rule. Output ONLY valid Sigma YAML in a single
fenced block, then a short bullet list explaining each detection clause.

Constraints:
- Use logsource categories from the official Sigma taxonomy.
- Prefer field-level modifiers (|contains, |endswith) over regex.
- Include falsepositives, tags (attack.* taxonomy), and level.
- Do not invent fields. If a field name is unclear, ask before producing the rule.

Behavioral description: {description}

Sample events:
{events}

The “do not invent fields” instruction is load-bearing. Without it the model will sometimes synthesize a plausible-sounding field name (ProcessIntegrityLevel, CommandLineNormalized) that does not exist in your data. The “ask before producing” escape hatch turns out to be effective; models do use it when given permission.

For the YARA equivalent, swap in constraints about string anchoring, the condition block, and an explicit “no overly broad strings shorter than 4 bytes unless justified” rule.

Where I’d Push Harder

Tooling is moving fast. Some directions I think are underexplored:

  • Counterfactual generation. Ask the model to produce three variants of the malicious behavior the rule should still catch, then test the rule against synthesized log lines for each. This catches over-fitted rules.
  • Cross-validation against existing rule corpora. Before publishing, embed the new rule and find the closest existing rules in your repo. Either it’s a duplicate or it conflicts with something. Both worth knowing.
  • AI-assisted triage of alerts the rule fires. Once the rule is live, route low-confidence alerts through a secondary LLM step that pulls related context and produces a one-paragraph analyst summary.

The triage angle pairs well with the broader pattern I wrote about in SAST in 2024, Semgrep and AI triage for real codebases. Different domain, same shape: AI for the first 70% of the work, humans for the 30% that actually requires judgment.

For folks new to Sigma, the canonical reference is the SigmaHQ repository, which has the schema spec and a large rule corpus you can use as in-context examples.

Gotchas

Things I’ve been burned by:

  • Field name drift across vendors. A Sigma rule that’s correct against the official taxonomy may not match how your SIEM normalizes fields. Always test against your own corpus, not the schema.
  • YARA false negatives from PE specifics. Asking the model to detect a specific malware family by strings often produces rules that miss packed variants. Pair string-based detection with behavioral indicators.
  • MITRE ATT&CK tag hallucination. The model will sometimes invent technique IDs that don’t exist or have been deprecated. Validate against the published JSON.
  • Over-constraining with timestamps. I once had a rule that included an EventTime predicate the model invented based on the sample I provided. It worked perfectly in tests and fired exactly zero times in production.
  • Encoded payload variants. Models tend to forget about case-insensitivity, alternate spellings, and Unicode normalization. Add these explicitly to your prompt template.

Wrapping Up

AI-assisted detection authoring is genuinely good. It collapses the time from “we saw this behavior” to “we have a rule.” What it does not do is replace the disciplined eval loop that detection engineering has always required. If anything, the eval loop becomes more important because the rules look more polished, which masks defects.

My next experiment is wiring the same workflow into a CI job so that proposed rule PRs get evaluated automatically against the corpus and the report shows up as a check. The friction of running the harness manually is the main reason it sometimes gets skipped. Remove the friction, and the discipline holds.