Reviewing AI-Suggested Code
TL;DR — AI code looks more correct than it is. Review checklist: API existence, types match, edge cases handled, security implications, performance reasonable, idioms match codebase. Reviewing AI-generated code is a distinct skill — partly faster than human review (uniform style) and partly slower (deceptive surface plausibility).
After prompt-style comments, what happens after Copilot produces a suggestion. The accept-and-move-on pattern is dangerous; structured review is mandatory.
Why AI code needs different review
Human code reviews focus on: intent, design, alternative approaches, side effects. The author already verified the syntax compiles and the basic logic works (they ran it).
AI code skips those checks. The author (you) is also the reviewer (you). Copilot generates; you decide if it’s right. Often the surface looks plausible; the substance is wrong.
Specific failure modes:
- Hallucinated APIs. Function calls that don’t exist.
- Wrong package version. Code matches an old version of the library.
- Subtle off-by-one. Confidently incorrect.
- Missing edge case. Happy path; null inputs crash.
- Inappropriate algorithm. O(n²) where O(n) was easy.
- Insecure pattern. SQL concat, weak crypto.
- Wrong idiom for codebase. Functional where everything else is OOP, or vice versa.
Each of these passes “looks right” review. Need active checking.
The review checklist
For every non-trivial AI suggestion:
1. Does the code run?
- LSP shows no errors after accepting?
- Imports resolve?
- Types match?
2. Does the API exist?
- The functions called — are they real?
- Library version compatible?
- Common hallucination:
lodash.flattenexists butlodash.flattenAlldoesn’t.
3. Edge cases:
- Null / empty input?
- Single element vs many?
- Boundary values?
- Concurrency where relevant?
4. Security:
- Any input getting concatenated into SQL/HTML/shell?
- Any secrets in code or logs?
- Any unbounded input (DoS vector)?
5. Performance:
- Algorithm complexity reasonable?
- N+1 queries?
- Allocation in a hot path?
6. Idiom match:
- Same style as the surrounding code?
- Same error handling pattern?
- Same logging pattern?
For most simple completions, 1-2 minutes total. For complex generated functions, more.
Reading the code you didn’t write
Reading code you wrote = checking your assumptions. Reading code you didn’t write = building a model first.
For AI code:
- Start with the signature: what’s the input, output?
- Read the implementation in order
- Where does each variable come from?
- Where does each side effect happen?
- Does the flow match the intent?
If you don’t deeply understand a generated line: don’t accept it. Either ask Copilot to explain (via a comment), or rewrite manually.
Patterns that catch hallucinated APIs
LSP red squiggles after acceptance: the most reliable signal. Compile errors immediately surface unknown function calls.
For dynamic languages without strong type checking:
# AI suggested
result = my_lib.fancy_helper(data)
# Verify
from my_lib import fancy_helper # explicit import; ImportError if doesn't exist
Or:
print(dir(my_lib)) # see what actually exists
The seconds you spend verifying save the minutes you’d spend debugging “why does this fail in production but not locally” (because the function did exist locally with a different signature).
When Copilot is more reliable than usual
Specific contexts where I’m more confident accepting:
- Code that closely mirrors what’s already in the file (Copilot picks up local idioms)
- Standard library calls (well-trained)
- Heavily-documented libraries (pgx, axios, requests, etc.)
- Boilerplate (mappers, getters, simple constructors)
Less confident:
- New libraries / version-specific code
- Cross-file logic
- Anything async-related
- DSLs or templating
- Security primitives
Pre-commit hygiene
For AI-heavy days, I lean more on tools:
- LSP errors before commit. No squiggles at staging.
- Linter rules (
golangci-lint --enable-all,eslint,mypy --strict). Catches off-by-one, missing await, unused imports. - Unit tests for new code. AI generates tests too; I write at least one I trust to verify the AI-generated code under it.
- Quick benchmark for hot paths. A 5-line benchmark catches O(n²) regressions.
Same pre-commit hygiene as before AI, just more important.
Pull request hygiene
When a PR contains AI-generated code:
- Don’t disclose unless your team policy requires it (some do for legal/IP reasons)
- DO verify the code passes the same review bar as human-written code
- Don’t treat “Copilot wrote it” as an excuse for sloppiness
For reviewers reviewing your PR: they don’t know it was AI-generated. The code stands on its own.
Speed expectations
I’m faster on greenfield code with Copilot — boilerplate-heavy work is dramatically accelerated. I’m at the same speed (or slower) on:
- Code reviews (AI doesn’t help review)
- Architectural design (AI doesn’t help)
- Debugging (AI sometimes misleads)
- Hot-path performance work (verify-cost is high)
- Security-sensitive code (review-cost is high)
Net: faster overall, but the speedup is concentrated in specific kinds of work.
Common Pitfalls
Tab and ship. Don’t.
Accepting confidence as correctness. AI sounds confident even when wrong.
Skipping tests because “AI wrote it.” The lack of human author increases test importance.
Not reading line-by-line. Skim is fine for low-stakes; line-by-line for hot paths.
Treating LSP green as proof. LSP passes for syntactically valid code; doesn’t verify behavior.
Ignoring linter rules. Linters catch a category of AI errors. Run them.
No code review on AI PRs. Self-review by author + Copilot doesn’t replace second pair of eyes.
Wrapping Up
Review AI code with the same rigor as human code; the failure modes are different but real. Monday: pair programming with AI.