background-shape
Auto Remediation for Cloud Security Findings
October 14, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Auto-remediation works when it’s narrow, reversible, and observable. Start with three findings, not thirty. Wire rollback before you wire fixes. Measure outcomes, not closure rates.

I’ve worked on cloud security tooling at two companies now, and the pattern around auto-remediation is almost always the same. Year one: nobody automates anything. Year two: someone proposes “let’s auto-fix everything.” Year three: production outage. Year four: everyone is afraid of auto-remediation and we’re back to year one.

There is a middle ground. It looks like a small number of very well-understood findings, fixed by code that has been reviewed like production code, with audit trails and rollback paths. This post is about how to design that middle ground for AWS Security Hub and GCP Security Command Center (SCC).

The mental shift I want you to make: stop thinking of auto-remediation as “fixing the finding.” Start thinking of it as “responding to an event with a well-defined action under a policy.” Same thing, different framing, totally different blast radius when something goes wrong.

Pick The Right Findings First

Not all findings are created equal. I sort them on two axes: how reversible the fix is, and how confident I am that the fix is correct.

Always-safe, fully reversible:

  • S3 buckets created with PublicAccessBlock disabled. Fix: enable it. Rollback: trivial.
  • Security groups with 0.0.0.0/0 on management ports (22, 3389). Fix: remove the rule. Rollback: re-add with a ticket.
  • GCS buckets with allUsers IAM bindings on objects that aren’t tagged as public. Fix: remove the binding.
  • IAM access keys older than 180 days for service accounts that have an alternative auth method. Fix: deactivate (not delete) the key.

Maybe-safe, reversible with care:

  • Open RDS snapshots. Fix: mark private. Rollback exists but might break a shared analytics workflow you didn’t know about.
  • Unused IAM roles. Fix: detach policies. This one’s a foot-gun; “unused” often means “used by a job that runs quarterly.”

Do not auto-remediate:

  • Anything that affects compute capacity directly. Stopping instances, scaling down. The blast radius is too big.
  • Anything that deletes data. Yes, even “obviously orphaned” data.
  • Anything in a production account during business hours unless there’s an active incident.

Start with the first category. Get six months of clean operation under your belt before touching the second.

A Two-Stage Pipeline

The architecture I keep landing on:

Finding -> Normalize -> Policy -> Action Plan -> Approval Gate -> Execute -> Verify -> Notify
                                                 (auto or human)

Each stage is independent. Each stage logs. Each stage can be paused.

The normalize step matters more than it looks. AWS Security Hub findings come in ASFF format; GCP SCC has its own schema; if you bring in other vendors you have more shapes. Normalize early to one internal representation so the policy engine doesn’t have to know about source-specific quirks.

@dataclass
class Finding:
    source: str               # "aws-sh" | "gcp-scc"
    finding_id: str
    resource_arn: str
    resource_type: str
    severity: str             # critical | high | medium | low
    rule_id: str              # e.g. "s3-bucket-public-access-prohibited"
    account_id: str
    region: str
    raw: dict                 # for audit

The policy engine maps (rule_id, severity, account_class) to either auto or manual. Account class is a tag on the account: prod-customer, prod-internal, staging, sandbox. Auto-remediation in prod-customer requires more conservative rules than in sandbox. This sounds obvious, and yet I’ve seen multiple incidents from a single policy treating all accounts the same.

The Approval Gate

For anything in the maybe-safe category, build a real approval gate. Slack with a button. Or PagerDuty. Or a ticket. Whatever your team actually uses.

The gate must include enough context to make the decision without leaving the message:

  • The exact resource and what’s being changed.
  • The Terraform or CLI command that will execute.
  • A rollback command.
  • A 24-hour expiry. Pending actions that go unapproved get auto-rejected and re-queued as manual.

For things in the always-safe category, you can skip the gate but not the audit trail. Every auto-action goes to an immutable log with the input finding, the action taken, and the result. I use a separate S3 bucket with Object Lock for this. It saves you the next time auditors come around, and it’s the only way to do honest post-incident analysis when the bot breaks something.

A Concrete Lambda Pattern

Here’s the shape of a remediation Lambda that I’d actually ship for the “S3 PublicAccessBlock disabled” finding from AWS Security Hub. EventBridge triggers it on a Security Hub finding ingestion event.

import boto3
import json
import os

s3 = boto3.client("s3")
sh = boto3.client("securityhub")
audit = boto3.client("s3")  # separate creds, write-only role

ALLOWED_RULE = "s3-bucket-level-public-access-prohibited"
ALLOWED_ACCOUNTS = set(os.environ["ALLOWED_ACCOUNTS"].split(","))

def handler(event, _ctx):
    finding = event["detail"]["findings"][0]
    if finding["GeneratorId"] != f"aws-foundational/{ALLOWED_RULE}":
        return {"skipped": "rule not in allowlist"}
    if finding["AwsAccountId"] not in ALLOWED_ACCOUNTS:
        return {"skipped": "account not in allowlist"}

    bucket = finding["Resources"][0]["Id"].split(":")[-1]

    # Pre-action snapshot for rollback
    before = s3.get_public_access_block(Bucket=bucket) \
        if _has_pab(bucket) else None

    s3.put_public_access_block(
        Bucket=bucket,
        PublicAccessBlockConfiguration={
            "BlockPublicAcls": True,
            "IgnorePublicAcls": True,
            "BlockPublicPolicy": True,
            "RestrictPublicBuckets": True,
        },
    )

    sh.batch_update_findings(
        FindingIdentifiers=[{
            "Id": finding["Id"],
            "ProductArn": finding["ProductArn"],
        }],
        Workflow={"Status": "RESOLVED"},
        Note={
            "Text": "Auto-remediated by sec-bot v1",
            "UpdatedBy": "sec-bot",
        },
    )

    audit.put_object(
        Bucket=os.environ["AUDIT_BUCKET"],
        Key=f"actions/{finding['Id']}.json",
        Body=json.dumps({
            "finding": finding,
            "before": before,
            "action": "enable-pab",
            "result": "success",
        }),
    )

A few things to call out. The allowlist of rules is at the top, not in some external config that could change without review. The audit write goes to a different bucket with a different IAM role; the remediation function can write there but cannot delete or modify. And the Security Hub finding is closed only after the action succeeds, so a failure surfaces as an open finding for human follow-up.

Verify, Don’t Trust

The bug I see most often: the remediation function succeeds, marks the finding closed, and nobody notices that the next scan still shows the same finding. The action ran, but a sibling resource or an account-level setting overrode it.

Add a verification step that re-queries the resource state after the change and compares against the desired state. If they don’t match, escalate. Don’t close the finding optimistically.

For longer-lived resources I run a separate “drift watcher” that periodically compares known-remediated resources against the policy. Easy to write, catches the cases where someone reverts the fix manually or where a Terraform apply blows away the bot’s work.

I wrote a related piece about secrets scanning in 2024, TruffleHog and Gitleaks in CI that touches on a similar drift problem for credentials. Same lesson: detect, fix, then verify.

External reference worth bookmarking: the AWS Security Hub automation rules documentation. Native automation rules can handle the trivial cases without writing Lambda at all, which I’d prefer when possible.

Gotchas

Things that have bitten me or people I work with:

  • IAM permission creep on the remediation role. Start with one IAM action per remediation type. The temptation to add * will be strong. Resist.
  • Cross-region surprises. Security Hub aggregates across regions but findings carry the source region. Your remediation function must call the API in the right region, not the aggregator region.
  • Race conditions with deployment pipelines. Bot fixes a public S3 bucket, Terraform applies five minutes later and re-opens it. Either coordinate or stop fighting the IaC and instead open a PR.
  • Multiple findings for the same resource. A noisy resource generates a finding per rule. Your function might run three times in parallel against the same bucket. Make actions idempotent.
  • Auto-remediation hiding upstream bugs. If you’re closing the same finding type 200 times a week, the bot is a band-aid on a process failure. Fix the process.
  • The “closure rate” metric. Easy to game. The metric that matters is mean time to actual safe state, measured against re-occurrence.

What’s Next

Auto-remediation is a small piece of a larger picture. The bigger investment is in preventing the findings in the first place: better defaults in your account-baseline modules, service control policies that make the bad state structurally impossible, and CI checks that block the regression before it ships.

I’d rather close zero findings because none exist than close a thousand because the bot is heroically holding back the tide. Build the bot, but spend more of your week on the structural fixes.