Designing Tools for LLM Agents, Function Schemas That Survive Production

Designing Tools for LLM Agents, Function Schemas That Survive Production

May 10, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — The model is good at picking tools, bad at structuring arguments. Tight schemas, narrow types, and informative error messages do more for reliability than any prompt change.

When an agent misbehaves, the failure is almost never “the LLM didn’t understand the goal.” It’s “the LLM picked the right tool and passed it garbage.” Off-by-one date ranges, currency strings instead of integers, customer IDs with stray whitespace, optional fields that aren’t actually optional. The model produces plausible-shaped JSON because it produces plausible-shaped everything, and your tool either crashes or, worse, silently does the wrong thing.

Most of the time I spend on agent reliability is now spent on the tool layer, not the prompt. Schema design, validation, error message wording. None of it is glamorous, all of it pays off. This post is the checklist I run through before I let a tool near production traffic.

We’ll cover schema design for both OpenAI and Anthropic, what validation belongs where, how to write error messages the model can act on, and the idempotency pattern that has saved me from multi-thousand-dollar refund bugs.

Schemas should be narrow, not flexible

The instinct to make tools “powerful and flexible” is a trap. A tool that accepts ten optional parameters will get called with the wrong combination eventually. A tool with three required parameters and no optionals will not.

Start by writing the Pydantic model. Let it generate the JSON schema. Don’t write JSON schemas by hand.

from pydantic import BaseModel, Field
from typing import Literal
from datetime import date

class CreateRefundInput(BaseModel):
    order_id: str = Field(..., pattern=r"^ord_[a-z0-9]{10}$",
                          description="Order identifier in canonical form ord_xxxxxxxxxx.")
    amount_cents: int = Field(..., gt=0, le=1_000_000,
                              description="Refund amount in cents. Must be positive integer.")
    reason: Literal["customer_request", "damaged", "wrong_item", "duplicate"] = Field(
        ..., description="Refund reason. Use customer_request when unsure.")
    notify_customer: bool = Field(default=True,
                                  description="Whether to email the customer.")

print(CreateRefundInput.model_json_schema())

Three patterns worth calling out. First, regex constraints on identifiers catch about 80% of bad calls before they hit your code. Second, Literal types for enums force the model to pick from a fixed set, which works much better than a free-form string with “valid values” in the description. Third, descriptions read by the LLM are not the same as descriptions read by humans. Write them for the LLM. Short, specific, with an example or default when ambiguous.

OpenAI and Anthropic want the same shape, mostly

For OpenAI’s tool calling, you pass a list of tools with a function block per tool.

from openai import OpenAI

client = OpenAI()
schema = CreateRefundInput.model_json_schema()

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": "Refund order ord_abc1234567 for $20, item was damaged."}],
    tools=[{
        "type": "function",
        "function": {
            "name": "create_refund",
            "description": "Issue a refund for a completed order.",
            "parameters": schema,
        }
    }],
    tool_choice="auto",
)

Anthropic’s tool use is structured the same way under a different key.

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    tools=[{
        "name": "create_refund",
        "description": "Issue a refund for a completed order.",
        "input_schema": schema,
    }],
    messages=[{"role": "user", "content": "Refund order ord_abc1234567 for $20, item was damaged."}],
)

The schema you feed both is the same JSON. What differs is how you parse responses and pass tool results back. Anthropic uses tool_use content blocks and tool_result blocks; OpenAI uses tool_calls on the message and role: tool for results. The Anthropic tool use docs lay out the full message shape clearly, and you should read them once before assuming the OpenAI mental model maps.

One small but important difference. Claude 3 models in May 2024 are noticeably better at picking when not to call a tool. GPT-4-turbo will sometimes force a tool call even when the answer is in its head. For high-volume agents this matters; force a “no_tool_needed” terminal tool if you see this happen.

Validate twice, server-side always

The model produces JSON. Your tool executes it. Between those two things, you validate. Pydantic does most of this for free.

def create_refund(raw_args: dict) -> str:
    try:
        args = CreateRefundInput.model_validate(raw_args)
    except ValidationError as e:
        return f"ERROR: invalid arguments. {format_validation_error(e)}"
    if args.amount_cents > get_remaining_refundable(args.order_id):
        return (f"ERROR: refund amount {args.amount_cents} cents exceeds remaining "
                f"refundable balance. Look up the order first.")
    return execute_refund(args)

def format_validation_error(e):
    return "; ".join(f"{'.'.join(str(x) for x in err['loc'])}: {err['msg']}" for err in e.errors())

The format of the error message matters more than people realize. The model reads it and decides what to do next. “ERROR: amount_cents: Input should be greater than 0” tells the model exactly which argument to fix. “Invalid input” tells it nothing and you get a retry with the same broken input.

I keep a personal rule. Every tool error must include the field name, what was wrong, and ideally a hint at the fix. If you can’t write that error, your tool’s contract isn’t tight enough.

Idempotency keys are non-negotiable for side-effecting tools

This is the one I learned the expensive way. An agent retried a refund tool call because it didn’t see the first response in time. The customer got refunded twice. The fix is a client-generated idempotency key the agent passes in, which your service treats as a uniqueness constraint.

class CreateRefundInput(BaseModel):
    order_id: str = Field(..., pattern=r"^ord_[a-z0-9]{10}$")
    amount_cents: int = Field(..., gt=0, le=1_000_000)
    reason: Literal["customer_request", "damaged", "wrong_item", "duplicate"]
    idempotency_key: str = Field(..., min_length=16, max_length=64,
                                 description="Unique key for this operation. Reuse to retry safely.")

You can generate the key in the agent wrapper rather than asking the LLM to invent one, which is cleaner. The model never sees the key as part of its reasoning; the wrapper adds it before calling the actual service. Stripe’s approach to this is documented well in their idempotency guide and worth borrowing wholesale.

Tool descriptions are prompts, treat them like prompts

The description field on a tool is part of the system prompt every time the model is asked to consider that tool. Long, ambiguous descriptions waste tokens and confuse the model. The pattern that works for me is one sentence on what the tool does, one sentence on when to use it, and constraints inline on the argument descriptions.

Bad: “This tool will look up information about an order from the orders database, including the status, items, customer, shipping address, and any associated refunds or returns. Use it whenever the user wants information about an order.”

Better: “Fetch a single order by id. Use when the user references an order by number or asks about its status.”

The first version is 50 tokens. The second is 22. Multiply by every tool in your set, every turn of the conversation, and you’ve cut your system prompt overhead by a third for free.

For broader context on how tool selection interacts with agent loops, see /blog/production-agents-langgraph-state-machines/.

Common Pitfalls

The recurring ones across the agent codebases I’ve reviewed.

Accepting strings where ints belong. “Amount” should be int cents, never a string. Currency symbols, commas, decimals, all model-generated, all wrong eventually.
Free-text enums. “status: string” with “valid values: open, closed, pending” in the description fails. Use Literal types.
Returning raw exceptions to the model. Stack traces eat tokens and confuse the model. Return one-line, structured errors.
No tool for “I don’t know.” Without an explicit way to say “ask the user,” the model will hallucinate arguments. Add a clarify tool.
Tools that read and write in the same call. Split them. The model handles “look up then act” much better than “look up and act atomically.”
Forgetting additionalProperties: false. Without it, the model can add invented fields and you won’t notice.

Wrapping Up

Tool design is the unglamorous craft of agent engineering. Nobody writes blog posts about renaming a parameter from amount to amount_cents, but that rename has saved me more incidents than any prompt change. The model is a structured-output machine sitting on top of probabilistic text generation. Give it tight, narrow, well-named interfaces and it behaves. Give it flexibility and it will find a creative way to misuse every degree of freedom you offered.

The next time an agent does something inexplicable in production, look at the tool call arguments before you look at the prompt. Nine times out of ten the model picked the right tool. The arguments are where things went sideways, and that’s a problem you can fix with code, not with prayers.

Schemas should be narrow, not flexible

OpenAI and Anthropic want the same shape, mostly

Validate twice, server-side always

Idempotency keys are non-negotiable for side-effecting tools

Tool descriptions are prompts, treat them like prompts

Common Pitfalls

Wrapping Up

Related posts

Evaluating LLM Agents, From Vibes to Regression Suites

Cost Control for LLM Agents, Token Budgets and Anthropic Prompt Caching

Guardrails for LLM Agents in 2024, Llama Guard, Rebuff, and NeMo

Memory for AI Agents, Short Term, Long Term, and What to Store Where

ReAct, Reflexion, and Planner Executor, Agent Loop Patterns That Work

Multi Agent Conversations with AutoGen, Patterns and Pitfalls

Production Agents with LangGraph, State Machines Over Chains

The Agentic AI Landscape in May 2024, LangGraph, AutoGen, CrewAI

Let’s Start a Project