LLM Red Teaming, Practical Techniques for 2024
TL;DR — LLM red teaming is useful only when it’s structured. Define what success looks like for the attacker, attack the application, not the model, and turn findings into evals you can run forever.
LLM red teaming the way it’s usually done is theater. Someone sits down with a chat window, finds a clever way to make the model say something embarrassing, screenshots it, posts it to a Slack channel. The team pats them on the back, marks the bug as fixed by adjusting the system prompt, and the same attack works again two weeks later under a slightly different phrasing.
I’ve done a fair amount of this work for production systems, and the version that actually moves the needle is much more boring. It looks like security testing in any other domain: defined threat models, structured attack inventories, repeatable test cases, and a clear definition of what counts as a finding. This post is what I’ve learned about doing it well.
The reference models in this post are claude-3.5-sonnet and o1-preview. The techniques generalize, but the specifics of what works and what doesn’t vary by model.
Define The Attacker’s Win
The first question to answer, and the one most red teams skip: what does the attacker actually want? “Make the model say something bad” is too vague to act on. Useful goal definitions look like:
- Cause the agent to call
delete_accounton an account the attacker doesn’t own. - Cause the agent to disclose a document tagged
confidentialfrom a different user’s tenant. - Cause the agent to emit content that meets a specific abuse category defined by the platform’s policy.
- Cause the agent to consume more than $X in API costs from a single attacker session.
- Cause the agent to produce code with a specific backdoor pattern that bypasses CI checks.
Each of these maps to a concrete observable. You can write a test that checks whether the attacker won. That’s the entire point. If you can’t write a clear pass/fail check, you’re not doing security testing — you’re doing vibes-based evaluation.
Attack The Application, Not The Model
The same model behaves very differently inside different applications. A bare model with no tools is essentially a question-answering machine; the worst it can do is produce text. A model wired into a customer support flow with the ability to issue refunds is a substantially more interesting attack surface.
Red team the application, not the underlying model. The bugs that matter are application-level: bad tool gating, missing authorization checks on retrieval, output filters that don’t catch exfiltration shapes, prompt structures that make injection more likely. These are the bugs you can actually fix.
This is also why generic “jailbreak” lists from public datasets are of limited value. They tell you the model has weak spots — useful context — but they don’t tell you whether your application is exploitable. A jailbreak that produces a recipe for a cocktail in an unrelated chat is meaningless if your application can’t issue refunds based on its outputs.
A Working Attack Inventory
I keep a structured inventory of attack techniques, mapped to which categories of system they apply to. The categories I track in 2024:
Direct prompt injection. Variants of “ignore previous instructions” and its descendants. Worth running because it still occasionally works on poorly defended applications.
Indirect prompt injection. Instructions embedded in retrieved documents, tool outputs, or shared workspace content. This is where most of the real bugs live in 2024.
Tool misuse. Crafting inputs that cause the model to call tools with arguments it shouldn’t. The recurring pattern is convincing the model that a benign-looking request requires a dangerous tool call as a side effect.
Identity confusion. Multi-tenant attacks where the attacker tries to get the model to act as if they were a different user. “On behalf of user X” framings, fake authentication tokens in input, plausible-looking session identifiers.
Cost amplification. Inputs that cause the model to consume disproportionate compute. Long generation loops, recursive tool calls, encouraging the model to “think step by step in detail at length.” Less glamorous than a data leak, but real money.
Output exfiltration. Coercing the model to emit data via covert channels. Markdown image URLs with payloads, base64 in the response, structured tags that downstream systems parse.
Policy bypass. Standard “make the model produce content it shouldn’t” testing. Lowest-priority for most applications because the consequences are reputational, but matters for consumer-facing platforms.
Memory and cross-session pollution. If the application has any persistent memory or shared context, can the attacker plant something that affects future sessions or other users?
For each category I have a starter set of 20 to 50 prompts and the procedure for evaluating outcomes. The set grows over time as I find new variants.
Automate The Boring Parts
Manual testing finds creative attacks. Automation finds regressions. You want both, and you should automate as much as possible so the manual work goes into the genuinely novel.
I run a test harness that:
- Loads attack templates from a YAML file.
- Renders each template against a target endpoint, varying parameters (target document IDs, user contexts, etc.).
- Captures the model’s full response, including any tool calls.
- Scores each attempt against a defined oracle.
@dataclass
class AttackCase:
id: str
category: str
template: str
target_endpoint: str
success_oracle: str # name of a function in oracles.py
def run_case(case: AttackCase, params: dict) -> AttackResult:
prompt = case.template.format(**params)
response = call_target(case.target_endpoint, prompt)
won = ORACLES[case.success_oracle](response, params)
return AttackResult(
case_id=case.id,
params=params,
response=response,
attacker_won=won,
)
The oracle functions are the load-bearing piece. They turn a probabilistic output into a deterministic pass/fail. For exfiltration tests the oracle might be “the response contains the canary string we planted in the target document.” For tool misuse it might be “the recorded tool call list includes delete_account with an argument matching victim_account_id.”
When the oracle is hard to define, the test isn’t ready. Don’t add it to the suite until you can score it automatically.
Manual Sessions With Discipline
The complement to automation: scheduled manual sessions where a human probes for novel attacks. I block out two hours, define a specific target (e.g., “find a way to make the agent leak the content of a document outside the requester’s tenant”), and write up every attempt in a structured log.
The log entries include: the input, the response, whether the attacker won, the suspected reason for success or failure, and a candidate test case to add to the regression suite. This is how the corpus grows. Casual screenshot-driven testing produces stories; structured logging produces tests.
For reasoning models like o1-preview, the chain of thought is part of the attack surface even when you don’t see it. Some attacks succeed by getting the model to commit to a course of action during its hidden reasoning that the visible output then carries out. Probe by asking the model to summarize its reasoning explicitly, then compare against the actual behavior.
Scoring And Reporting
The classic mistake in red team reporting is producing a list of “the model said this thing it shouldn’t” findings without prioritization. A useful report ranks findings by:
- Severity. What was the attacker’s win condition? Data exfiltration is higher than reputational policy violation.
- Reliability. Does the attack succeed consistently, or only occasionally? An unreliable attack is still real, but a 100% reliable one is more urgent.
- Surface area. Does the attack work only with specific inputs, or is it broad? A jailbreak that works for any restricted topic is more important than one that only works for a single edge case.
- Defensibility. Can you fix it with a clear engineering change, or is the fix at the model level? Application-level fixes go in this sprint; model-level concerns become risk acceptance or vendor escalation.
Each finding gets a recommended mitigation. Not “improve the system prompt.” Actual changes: add a tool argument validation, tighten the retrieval filter, change the agent loop’s iteration cap, add an output inspector regex.
The patterns are very similar to what I wrote about for prompt injection defenses in LLM apps, patterns for 2024 — most red team findings get mitigated by architectural changes rather than prompt engineering.
A useful external reference for taxonomy is the OWASP Top 10 for LLM Applications. The categories are coarse but provide common vocabulary for reporting findings to non-specialists.
Turning Findings Into Evals
Every confirmed finding becomes a regression test. This is the highest-leverage step in the whole process. A red team that doesn’t produce evals is a red team that finds the same bugs every six months.
For each finding:
- Add a case to the automated suite with the exact input and an oracle that detects the failure mode.
- Tag the case with the finding ID and the date.
- Wire the suite into CI so any change to the application or model version triggers a re-run.
- Track pass rates over time. A regression on a previously-fixed case is a production-critical alert.
The corpus that results from a year of disciplined red teaming becomes the most valuable asset you have for evaluating model upgrades. When a new model version drops and you’re tempted to roll it out, you re-run the suite. If the previously-mitigated cases come back, you have data to push back with.
Gotchas
- Testing against the chat playground instead of the production endpoint. They have different system prompts, different tool configs, different rate limits. Always test the actual integration.
- Letting the attacker oracle be the model itself. “Ask the same model if its response is bad” is convenient and unreliable. Use independent oracles where possible.
- Mistaking model improvements for fixes. A bug that disappears after a model upgrade may reappear after another model upgrade. Don’t close findings without a corresponding application-level change.
- Forgetting about non-text channels. Tool calls, function arguments, hidden reasoning, embeddings. Attacks land in any of these; testing must include all of them.
- Counting jailbreaks as findings without business context. A jailbreak is a vulnerability; whether it’s a finding depends on whether your application gives the model authority over anything that matters.
- No defined out-of-scope. Red teams without scope grow until they’re testing the entire vendor’s safety model, which is not your job. Define what’s in scope and stick to it.
- Confidential prompts leaking through test artifacts. Red team logs contain attack payloads and successful exploit traces. Treat them like incident reports: restricted access, retention policy, secure storage.
Wrapping Up
Useful LLM red teaming is unsexy. It looks like writing test cases, scoring them automatically, prioritizing findings by impact, and feeding fixed bugs back into a regression suite. The party-trick jailbreaks make for good demos but don’t move the security posture of a production system.
If you’re starting a red team practice from scratch, my advice: pick three attacker win conditions specific to your application, build twenty attack cases against each, automate the scoring, and run the suite weekly. The corpus and the discipline are worth more than the cleverness of any individual attack. Build them first; clever comes later.