Reviewing AI-Suggested Code | Hi, I'm Muhammad Amal

December 9, 2022 · 4 min read · by Muhammad Amal ai

TL;DR — AI code looks more correct than it is. Review checklist: API existence, types match, edge cases handled, security implications, performance reasonable, idioms match codebase. Reviewing AI-generated code is a distinct skill — partly faster than human review (uniform style) and partly slower (deceptive surface plausibility).

After prompt-style comments, what happens after Copilot produces a suggestion. The accept-and-move-on pattern is dangerous; structured review is mandatory.

Why AI code needs different review

Human code reviews focus on: intent, design, alternative approaches, side effects. The author already verified the syntax compiles and the basic logic works (they ran it).

AI code skips those checks. The author (you) is also the reviewer (you). Copilot generates; you decide if it’s right. Often the surface looks plausible; the substance is wrong.

Specific failure modes:

Hallucinated APIs. Function calls that don’t exist.
Wrong package version. Code matches an old version of the library.
Subtle off-by-one. Confidently incorrect.
Missing edge case. Happy path; null inputs crash.
Inappropriate algorithm. O(n²) where O(n) was easy.
Insecure pattern. SQL concat, weak crypto.
Wrong idiom for codebase. Functional where everything else is OOP, or vice versa.

Each of these passes “looks right” review. Need active checking.

The review checklist

For every non-trivial AI suggestion:

1. Does the code run?

LSP shows no errors after accepting?
Imports resolve?
Types match?

2. Does the API exist?

The functions called — are they real?
Library version compatible?
Common hallucination: lodash.flatten exists but lodash.flattenAll doesn’t.

3. Edge cases:

Null / empty input?
Single element vs many?
Boundary values?
Concurrency where relevant?

4. Security:

Any input getting concatenated into SQL/HTML/shell?
Any secrets in code or logs?
Any unbounded input (DoS vector)?

5. Performance:

Algorithm complexity reasonable?
N+1 queries?
Allocation in a hot path?

6. Idiom match:

Same style as the surrounding code?
Same error handling pattern?
Same logging pattern?

For most simple completions, 1-2 minutes total. For complex generated functions, more.

Reading the code you didn’t write

Reading code you wrote = checking your assumptions. Reading code you didn’t write = building a model first.

For AI code:

Start with the signature: what’s the input, output?
Read the implementation in order
Where does each variable come from?
Where does each side effect happen?
Does the flow match the intent?

If you don’t deeply understand a generated line: don’t accept it. Either ask Copilot to explain (via a comment), or rewrite manually.

Patterns that catch hallucinated APIs

LSP red squiggles after acceptance: the most reliable signal. Compile errors immediately surface unknown function calls.

For dynamic languages without strong type checking:

# AI suggested
result = my_lib.fancy_helper(data)

# Verify
from my_lib import fancy_helper   # explicit import; ImportError if doesn't exist

Or:

print(dir(my_lib))   # see what actually exists

The seconds you spend verifying save the minutes you’d spend debugging “why does this fail in production but not locally” (because the function did exist locally with a different signature).

When Copilot is more reliable than usual

Specific contexts where I’m more confident accepting:

Code that closely mirrors what’s already in the file (Copilot picks up local idioms)
Standard library calls (well-trained)
Heavily-documented libraries (pgx, axios, requests, etc.)
Boilerplate (mappers, getters, simple constructors)

Less confident:

New libraries / version-specific code
Cross-file logic
Anything async-related
DSLs or templating
Security primitives

Pre-commit hygiene

For AI-heavy days, I lean more on tools:

LSP errors before commit. No squiggles at staging.
Linter rules (golangci-lint --enable-all, eslint, mypy --strict). Catches off-by-one, missing await, unused imports.
Unit tests for new code. AI generates tests too; I write at least one I trust to verify the AI-generated code under it.
Quick benchmark for hot paths. A 5-line benchmark catches O(n²) regressions.

Same pre-commit hygiene as before AI, just more important.

Pull request hygiene

When a PR contains AI-generated code:

Don’t disclose unless your team policy requires it (some do for legal/IP reasons)
DO verify the code passes the same review bar as human-written code
Don’t treat “Copilot wrote it” as an excuse for sloppiness

For reviewers reviewing your PR: they don’t know it was AI-generated. The code stands on its own.

Speed expectations

I’m faster on greenfield code with Copilot — boilerplate-heavy work is dramatically accelerated. I’m at the same speed (or slower) on:

Code reviews (AI doesn’t help review)
Architectural design (AI doesn’t help)
Debugging (AI sometimes misleads)
Hot-path performance work (verify-cost is high)
Security-sensitive code (review-cost is high)

Net: faster overall, but the speedup is concentrated in specific kinds of work.

Common Pitfalls

Tab and ship. Don’t.

Accepting confidence as correctness. AI sounds confident even when wrong.

Skipping tests because “AI wrote it.” The lack of human author increases test importance.

Not reading line-by-line. Skim is fine for low-stakes; line-by-line for hot paths.

Treating LSP green as proof. LSP passes for syntactically valid code; doesn’t verify behavior.

Ignoring linter rules. Linters catch a category of AI errors. Run them.

No code review on AI PRs. Self-review by author + Copilot doesn’t replace second pair of eyes.

Wrapping Up

Review AI code with the same rigor as human code; the failure modes are different but real. Monday: pair programming with AI.

Why AI code needs different review

The review checklist

Reading the code you didn’t write

Patterns that catch hallucinated APIs

When Copilot is more reliable than usual

Pre-commit hygiene

Pull request hygiene

Speed expectations

Common Pitfalls

Wrapping Up

Related posts

Beyond Copilot, Tabnine, Codeium, Amazon CodeWhisperer

Copilot for Tests, TDD or Anti-TDD?

Pair Programming With an AI Assistant

Prompt-Style Comments to Steer Copilot

What Copilot Is Good At (and What It Isn't)

A Year With GitHub Copilot in Production

AI Assist in Neovim, Copilot, Codeium, and ChatGPT in 2024

IP, Licensing, and AI-Generated Code

Let’s Start a Project