background-shape
The 2023 LLM Tooling Retrospective, What Actually Changed About My Workflow
December 27, 2023 · 10 min read · by Muhammad Amal programming

TL;DR — LLM tools became real this year for senior engineering work, not just boilerplate / the wins were in reading, exploring, and reviewing — not in writing / what stuck was the stuff that respected my context, not the stuff that tried to replace me.

A year ago I was using GitHub Copilot for autocomplete and not much else. Today my dev setup includes Copilot Chat, Cursor, two different chat assistants in browser tabs, and a small CLI script that pipes diffs into an LLM for first-pass review. It’s been a strange year. Not all of these will survive 2024, and I want to write down what I actually use, what I tried and dropped, and what I think changed about the practice of senior engineering work.

This is the eighth post in my December series and the first of two retrospectives. Tomorrow’s post zooms out further — a year-end retro on leadership and team practice. This one is narrower: tools, workflow, and the question of where LLMs are and aren’t pulling weight in my day-to-day.

I’ll be honest about what worked and what didn’t. There’s a strong cultural pull to be either an LLM evangelist or an LLM skeptic this year. I’m neither. The tools are useful, they’re also flawed, and the right framing for a senior engineer is the same as for any other tool — what’s the marginal value, what’s the failure mode, when do I reach for it.

What Shipped in 2023, Briefly

The LLM tooling landscape in 2023 moved fast enough that even people working in the space lost track of releases. A non-exhaustive list of things that changed how I work:

  • ChatGPT API (March 2023): GPT-3.5-turbo at low enough cost to actually script against. The first time “throw an LLM at it” became a defensible architectural choice for production features.
  • GPT-4 (March 2023): A real step up in reasoning quality. Slow and expensive, but the quality difference for non-trivial tasks was obvious from day one.
  • GitHub Copilot Chat (GA in late 2023 after preview): Chat interface inside the editor, with codebase awareness. Different from the autocomplete experience, more useful for architectural questions.
  • Cursor (rising throughout 2023): Fork of VS Code with deeper LLM integration. The “Apply” flow and codebase-wide context handling changed how I navigate unfamiliar code.
  • GPT-4 Turbo and DevDay (November 6, 2023): Longer context (128k), lower price, JSON mode, function calling improvements. The economics of LLM-in-the-loop tooling shifted significantly.
  • Anthropic’s Claude 2 / 2.1 (mid-late 2023): The 100k+ context window made some workflows possible that hadn’t been before — feeding entire codebases or long technical documents into a single conversation.

That’s a lot of shipping in twelve months. The signal in the noise, for me, was that the second half of the year was when these tools stopped being toys for boilerplate generation and started being useful for actual senior engineering work.

What I Use Now (and Why)

Here’s my current setup, with honest assessments of where each tool earns its keep.

Copilot autocomplete

Still on. Mostly invisible. Saves me from typing things I already know how to write — boilerplate test scaffolds, repetitive type signatures, the body of a function I’ve already named clearly. I disable it when I’m thinking, because the suggestions become noise. It’s the most useful when I’m tired and doing mechanical work; least useful when I’m doing the kind of work that’s worth doing as a senior engineer.

Honest take: I’d miss it if it went away, but the daily-value contribution is small and stable. It’s a 5-10% productivity tool, not a transformative one. Anyone telling you it’s 10x is selling something.

Copilot Chat / Cursor for codebase exploration

This is where the year’s biggest change landed for me. Walking into an unfamiliar codebase — a client project, a service I haven’t touched in two years, an open source library — used to take an hour or two of grep, file-tree skimming, and tracing. With Cursor’s codebase-aware chat, I can ask “where is authentication handled in this repo” or “show me the entry point for the worker pipeline” and get directly to the relevant files in under a minute.

This is the tooling shift that genuinely changed my workflow. Reading code went from a chore to a fast lookup. Cursor’s “Apply” flow — where you can preview LLM-suggested edits as a diff before accepting — is the right interaction model. I don’t trust the model to write production code directly. I trust it to propose, and I review the diff.

GPT-4 / Claude for design conversations

When I’m thinking through an architecture problem, I increasingly use an LLM as a rubber duck. Not because it has better answers than I do — it usually doesn’t, on senior-level work — but because the act of articulating the problem clearly enough for the model to respond forces my own thinking. The model occasionally surfaces an angle I hadn’t considered, mostly by virtue of being trained on a lot of similar conversations.

What I learned to do: paste the actual context (the RFC, the error message, the schema) into the conversation. Hand-waved prompts get hand-waved answers. With concrete artifacts in context, the responses get noticeably more useful. This is also where the longer context windows from late 2023 matter — being able to drop in a whole module and ask “what’s the worst design smell here” is a different conversation than asking abstractly.

LLM-assisted code review

I wrote a small script that pipes a git diff to an LLM with a “review this change as a senior engineer would” prompt. It runs locally, takes maybe ten seconds, and produces a first-pass review of my own changes before I push the PR. About 30% of the comments are useful — usually catching things like missing error handling, naming inconsistencies, or test gaps. 70% are noise or wrong. But running it costs me ten seconds and catches enough that I keep using it.

I do not use it on others’ PRs. The signal-to-noise is too low to trust as feedback to another human, and there’s something quietly insulting about an LLM-generated code review showing up on someone’s PR. Reserve it for self-review.

Documentation and writing

LLMs are very good at first drafts of documentation, decision logs, and writeups. I use them constantly for that. The trick is to feed them my own raw notes — bullet points, half-sentences, code snippets — and ask for a structured draft. I then rewrite in my own voice. The LLM is doing the structural work; I’m doing the judgment work.

This is how I’ve ended up writing more, not less, despite spending less time on each piece. The activation energy for a writeup dropped because I no longer have to face the blank page.

What I Tried and Dropped

Not everything stuck. A few things I picked up during 2023 and abandoned:

Auto-PR generators. Tools that auto-generate PR descriptions from diffs. The output was always either too generic (“This PR updates the authentication module”) or hallucinated context (“This change addresses the long-standing concern about…”). I write my own PR descriptions. It takes two minutes and they’re honest.

Agent-style coding tools. Several products this year promised to take a ticket and produce a PR autonomously. I tried a few. On non-trivial tickets, the output was unusable — it would produce something that looked right but missed the actual constraints. On trivial tickets, I could’ve done it in less time than it took to set up the agent. The agent paradigm needs another generation to be useful for production work, in my opinion.

LLM-based commit message generators. Same problem as PR generators. The model can describe what changed, but it can’t describe why, which is the only part of a commit message worth writing. I write my own.

Generic chat as a search replacement. Early in the year I tried using ChatGPT instead of Google for technical questions. It works for stable, well-known topics — Python stdlib, classic algorithms, common libraries. It’s actively dangerous for anything new, fast-moving, or version-specific. I went back to a mix of docs, source code, and targeted search.

What This Changes About Senior Engineering Work

The thing I find myself thinking about most is this: LLMs are very good at the exploratory phase of work — reading, asking, prototyping — and much less useful at the committing phase — designing, deciding, owning. That maps cleanly onto what’s already true about senior engineering work. Most of the value is in the second phase. The first phase is necessary but not where the leverage is.

What this means practically:

  • The barrier to exploring an unfamiliar area dropped. I’ll spike a prototype in a domain I don’t know before declining the work. I’ll skim a library’s source to evaluate it before adopting. The exploration cost is down by maybe 5x.
  • The bar for written artifacts went up, slightly, because first drafts are cheaper. Internal docs, RFCs, and decision logs are more numerous and slightly higher quality on the teams I work with. (More on this in my post on RFCs and decision logs.)
  • The value of clear thinking and clear writing went up, not down. The bottleneck moved from “can you produce a document” to “can you produce a document worth reading.” The latter is harder.
  • Code review remains a human activity, but the prep work — running a self-review, scanning for obvious issues — is partially automated.

The senior engineers who’ll do best in the next few years, I think, are the ones who treat LLMs as power tools — useful for specific tasks, dangerous if you stop paying attention, never substitutes for judgment. The ones who’ll struggle are at both extremes: full evangelists who let the tools make decisions, and full skeptics who refuse to engage at all.

Common Pitfalls

Trusting LLM output as fact. This bites people most often with library APIs and version-specific behavior. The model will confidently describe a function that doesn’t exist, or describe one that exists in a different version. Always verify against actual docs or actual code.

Over-prompting. Long, elaborate prompts often produce worse results than short, concrete ones. If you can’t explain what you want in two or three sentences, the model probably can’t either. Iterate the prompt, don’t bloat it.

Letting LLMs make decisions instead of inform them. The model can list tradeoffs. The model cannot weight them — that requires context about your team, your business, and your constraints. Treat output as input to your judgment, not as the decision itself.

Confusing speed with quality. Producing a 600-word writeup in two minutes feels productive. If the writeup is wrong, you’ve just produced two minutes’ worth of harm at unprecedented speed. The fastest path is rarely the right path for senior-level work.

Hidden context drift. When an LLM has been in a long conversation, it loses track of earlier context, especially constraints you stated near the start. Restart the conversation periodically, or paste key context back in. The longer the conversation, the more aggressively you have to re-anchor.

Wrapping Up

2023 was the year LLM tooling crossed from “interesting if you squint” to “actually useful in daily senior engineering work.” The wins concentrated in reading, exploring, and reviewing rather than in writing production code. My workflow looks different than it did twelve months ago, mostly in subtle ways. I expect 2024 to bring another step change, probably in agentic coding tools — I’ll be watching, but my baseline assumption is that the wins will again be in the exploratory phase rather than the committing phase.

Tomorrow, the proper year-end retro: leadership, team practice, and what 2023 taught me beyond tooling.