background-shape
Productivity Metrics That Actually Matter
December 23, 2022 · 5 min read · by Muhammad Amal programming

TL;DR — Don’t measure lines of code or commits. Use DORA (deploy frequency, lead time, MTTR, change failure rate) or SPACE (satisfaction, performance, activity, communication, efficiency). AI tools shift the inputs, not the outcomes. Outcomes still matter.

After AI legal, the operational layer. With AI tools claiming 25-50% productivity gains, the question is “by what measure?” Most existing measures don’t capture it well.

What NOT to measure

Lines of code, commits per day, hours worked — all bad. They:

  • Reward verbose code
  • Reward many small commits over fewer thoughtful ones
  • Encourage padding

Worse: AI tools make all of these easier to inflate. A Copilot user can write more lines per hour; doesn’t mean shipping more value.

Don’t go there.

DORA — the four metrics

DevOps Research and Assessment (DORA) metrics. Mature; widely adopted.

1. Deployment frequency. How often do you ship to production?

  • Elite: multiple times per day
  • High: between weekly and daily
  • Medium: between weekly and monthly
  • Low: less than monthly

2. Lead time for changes. From commit to production.

  • Elite: < 1 hour
  • High: < 1 day
  • Medium: < 1 week
  • Low: > 1 week

3. Mean Time to Recovery (MTTR). When something breaks, how fast do you fix it?

  • Elite: < 1 hour
  • High: < 1 day
  • Medium: < 1 day
  • Low: > 1 week

4. Change failure rate. What % of deploys cause incidents?

  • Elite: 0-15%
  • High: 16-30%
  • Medium: 16-30%
  • Low: 16-30%

These four cover speed (1, 2) and reliability (3, 4). Improving all four together = real improvement.

AI tools shift #1 and #2 (you ship more often, faster). #3 and #4 depend on judgment AI doesn’t help with directly.

SPACE — broader framework

For human-side measurement, SPACE:

  • Satisfaction: are engineers happy?
  • Performance: are systems and code working?
  • Activity: are PRs being merged, code being written?
  • Communication & collaboration: are decisions reaching the right people?
  • Efficiency & flow: is work moving without blockers?

Each tracked with mix of metrics, surveys, anecdote. SPACE pairs well with DORA: DORA for output; SPACE for sustainability.

What changes with AI tools

Output metrics likely improve:

  • Deploy frequency may rise (faster scaffolding)
  • Lead time may shrink
  • Lines of code per dev definitely rises (and matters less)

Outcome metrics might not change:

  • Customer satisfaction
  • Revenue per engineer
  • Number of features shipped that actually got used

If AI makes you 25% faster at writing code but you ship the same features, the productivity gain didn’t translate to outcomes. Common pattern.

A useful dashboard

For a 10-engineer team:

┌─────────────────────────────────────────────────────────────┐
│ Engineering Health — Last 30 Days                            │
├─────────────────────────────────────────────────────────────┤
│ Deploy frequency:      8.2/day      ●●●○ (target: 10/day)   │
│ Lead time:             4.5 hours    ●●●● (target: < 8h)     │
│ MTTR:                  45 min       ●●●● (target: < 2h)     │
│ Change failure rate:   12%          ●●●○ (target: < 15%)    │
├─────────────────────────────────────────────────────────────┤
│ PRs merged:            142          (-5% vs prev period)    │
│ Reviews avg time:      3.2 hours    (target: < 8h)          │
│ Build success rate:    97%          ●●●●                    │
├─────────────────────────────────────────────────────────────┤
│ Engineer satisfaction: 7.2/10       (quarterly survey)      │
│ On-call pages:         18 / week    (-30% vs prev quarter)  │
│ Postmortems published: 3            (process working)       │
└─────────────────────────────────────────────────────────────┘

Three sections: DORA, supporting activity metrics, human / health metrics. Reviewed weekly; trends matter more than absolute numbers.

What I personally track

Less formal:

  • Time from “idea to user-facing feature” — best signal of overall throughput
  • Number of unplanned interrupts per week — measure of how much OOO / on-call dominates
  • Fraction of week in “deep work” (uninterrupted 90+ min blocks) — leading indicator of sustained output

These aren’t team-level metrics. They’re personal health checks.

Anti-patterns

Stack ranking by metrics. Optimizing the metric distracts from the work. Goodhart’s law.

Per-engineer DORA metrics. Deploy frequency is a team metric. Per-engineer breakdown is meaningless and gameable.

Adding tools to track activity. Productivity tracking software (“how many keystrokes per hour”) shows lack of trust; engineers performatively type. Don’t.

AI-tool-specific metrics. “How many Copilot completions accepted?” doesn’t measure productivity. Ignore.

Optimization without buy-in. Imposing DORA metrics top-down without team buy-in = check-the-box compliance. Engineers should want to know the numbers.

When metrics help

For teams looking to identify bottlenecks:

  • Low deploy frequency? CI is slow OR release process is heavy OR fear of breaking things
  • High change failure rate? Test gaps OR poor incident response OR over-aggressive deploys
  • High MTTR? Poor observability OR no on-call rotation OR runbooks missing
  • High lead time? Approval bottleneck OR code review backlog OR slow CI

Each metric points at specific systemic issues. Address those.

When metrics don’t help

For high-functioning teams already shipping well:

  • Marginal metric improvement is rarely worth the optimization cost
  • Focus on outcomes (customer impact, revenue, growth) instead
  • Engineers’ intuition about “what’s slow” is often more useful than the dashboard

Use metrics as diagnostics. Stop measuring constantly once they’re green.

Common Pitfalls

Lines-of-code metric. Always wrong. AI makes it even more meaningless.

Commits per day. Encourages padding.

Hours worked. Encourages presenteeism.

Per-engineer DORA. Team metric, not individual.

Tools for activity tracking. Trust your team.

Optimization to hit a number. The number isn’t the goal; the underlying improvement is.

Wrapping Up

Measure outcomes more than inputs. DORA + SPACE > LoC + commits. AI tools shift inputs; outcomes still depend on judgment.