Few-Shot Prompting and In-Context Learning

Prompt engineering article cover illustration on a gradient background

January 17, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — Few-shot = include 2-5 worked examples in the prompt. The model pattern-matches. More reliable than long instructions for novel task formats. Cost: 100-500 extra tokens per call. Worth it for tasks where zero-shot prompts fail >10% of the time.

After prompt basics , the next technique. Zero-shot (just instructions) works for common tasks. Few-shot (instructions + examples) works for novel ones.

When few-shot wins

Zero-shot prompts work great when the task is well-known: “summarize this text”, “translate to French”, “extract phone numbers”. The model has seen millions of examples in training.

Few-shot wins when:

Output format is novel or app-specific
The task involves your domain conventions
Zero-shot is getting 70-85% accuracy and you need 90%+
The instructions to fully specify the task would be longer than just showing 3 examples

For a custom ticket classifier with your specific labels, few-shot reliably outperforms zero-shot.

The basic shape

Task: Classify customer feedback as positive, negative, or neutral.

Examples:

Feedback: "Love the new dashboard! Much cleaner than before."
Classification: positive

Feedback: "Login is broken since the update. Cannot log in."
Classification: negative

Feedback: "When will the API export feature ship?"
Classification: neutral

Feedback: "{user_input}"
Classification:

The model completes after “Classification:” because it pattern-matches the previous examples.

3 examples covering the spread is usually enough. Beyond 5, returns diminish.

Choosing examples

Three rules:

1. Cover the spread. One example per output class. Don’t include three positives and skip negatives.

2. Match difficulty. Don’t pick only easy cases. Include at least one edge case so the model handles edges in real input.

3. Match your real input distribution. Use real production samples (anonymized) as examples. Synthetic examples don’t capture quirks of real data.

For our ticket classifier:

Examples:

Ticket: "Charged twice for the same month — please refund the duplicate."
Category: billing | Priority: high | Reason: financial issue, customer affected

Ticket: "Is there a dark mode planned for the dashboard?"
Category: other | Priority: low | Reason: feature inquiry, no urgency

Ticket: "API returns 503 on all requests since 9am, our integration is down."
Category: technical | Priority: high | Reason: outage, customer blocked

Ticket: "Can't find where to update my billing address."
Category: account | Priority: medium | Reason: usability question, moderate impact

Four examples, four categories, mix of priorities. Real-shaped tickets.

Output format consistency

Critical: example outputs follow the EXACT format you want. If examples show “Category: billing”, don’t expect “category: billing” or {"category": "billing"} from the model.

For JSON output:

Examples:

Ticket: "Charged twice — please refund."
JSON: {"category": "billing", "priority": "high"}

Ticket: "Dark mode planned?"
JSON: {"category": "other", "priority": "low"}

Match output style precisely; the model copies the style.

Cost math

Few-shot adds tokens. Three 50-token examples = 150 extra tokens per call. At $0.02/1K = $0.003 extra per call. For 1M calls, that’s $3000 added cost.

When the few-shot accuracy bumps from 80% to 95%, the cost of 15% manual rework drops by $15K+ in human time. Math works out for most cases.

For ultra-high-volume cases ($100K+ in API calls), evaluate carefully — fine-tuning a smaller model might be cheaper than few-shot prompting a big one.

Chain-of-thought variant

For tasks requiring reasoning, prompt the model to think step-by-step:

Examples:

Q: A truck holds 12 boxes. Each box has 20 books. How many books in 3 trucks?
A: 12 × 20 = 240 books per truck. 240 × 3 = 720. Answer: 720.

Q: A user pays $99/month. With a 20% annual discount paid upfront, what's the yearly cost?
A: Monthly = $99 × 12 = $1188. With 20% discount: 1188 × 0.8 = $950.40. Answer: $950.40.

Q: {real question}
A:

The model produces reasoning in the output, then the answer. Catches arithmetic mistakes the model would otherwise make.

Cost: longer outputs. For backend usage where you only need the final answer, parse it out: take the last line, or use a delimiter.

Self-consistency

For maximum reliability on tricky tasks, run the same prompt 5 times with temperature=0.5. Take the majority answer.

def consensus_classify(text, n=5):
    results = [classify(text, temperature=0.5) for _ in range(n)]
    from collections import Counter
    return Counter(results).most_common(1)[0][0]

5× the cost. Useful only for high-stakes individual decisions, not bulk batch jobs.

When few-shot DOESN’T help

Tasks the model already nails zero-shot (translation, summarization)
Tasks requiring information the model doesn’t have (your customer DB)
Tasks needing strict factual accuracy (RAG works better — covered in later months)

Match the technique to the problem. Few-shot is a tool, not the answer.

Common Pitfalls

Examples too similar. Three positive sentiment examples; model assumes everything is positive.

Examples in different format than instructions. Model gets confused.

Too many examples. Past 5, diminishing returns; context budget consumed.

Cherry-picked easy examples. Production data is harder. Use representative samples.

No iteration on examples. First set is rarely best. Track accuracy; swap weak examples.

Examples that leak prompt injection. If your “examples” come from real user data without sanitization, attackers can inject instructions through them.

Wrapping Up

Few-shot = 2-5 examples in the prompt. Beats zero-shot when format is novel or domain-specific. ~$3K extra per $30K of base cost; bumps accuracy 10-15%. Friday: streaming responses .

When few-shot wins

The basic shape

Choosing examples

Output format consistency

Cost math

Chain-of-thought variant

Self-consistency

When few-shot DOESN’T help

Common Pitfalls

Wrapping Up

Related posts

Prompt Engineering Basics for Engineers

Calling OpenAI from Node.js

Calling OpenAI from Python, Patterns and Pitfalls

Why Every Backend Needs an LLM Integration in 2023

The 2023 LLM Tooling Retrospective, What Actually Changed About My Workflow

LLM Vendor Risk, A Failover Playbook After the OpenAI Weekend

LangChain LCEL vs LlamaIndex, Picking a Framework in Late 2023

Claude 2.1 vs GPT-4 Turbo, A Side-by-Side at 100K Context

Let’s Start a Project