What I Learned Containerizing 11 Services in One Month

Docker article cover illustration on a gradient background

January 31, 2022 · 6 min read · by Muhammad Amal programming

TL;DR — Containerize first, refactor second. The boring services come out first; never let the exciting ones tempt you. Build the shared infra (CI cache, Compose stack, schema isolation) once, in week one. Most of “microservices” is operational, not architectural.

A month ago I’d never run this monolith in a container. As of today: monolith containerized, two Go services extracted (notifications + billing), shared Postgres set up with per-service schemas, local-dev stack reduced to docker compose up. Eleven container images in total across the stack, counting databases, caches, and tooling.

This is the retro. What worked, what didn’t, what I’d do differently, and one thing I’d cut entirely if I were starting over. Posting on the last weekday of January as a stake in the ground; February’s theme shifts to Postgres performance tuning and CI/CD with GitHub Actions, building on what’s here.

What worked

Containerizing the monolith before touching any extraction. This was the highest-leverage decision of the month. The first week was unsexy: getting the PHP monolith into a Dockerfile, getting docker compose up working on three different laptops, getting the CI building images. No new architecture. No new services. Just packaging. From week two onward, every subsequent thing — extracting notifications, designing the billing API, setting up gRPC — leaned on the work in week one. If I’d gone microservices-first I’d still be debugging local-dev tooling instead of shipping services.

The shadow → cutover pattern. Both Go services extracted via shadow traffic for a week before flipping authority. Notifications had two real bugs we caught only because the shadow service queued things slightly differently. Billing had a timezone bug that would have eaten money in production. Neither would have been caught by tests. The shadow pattern earns its slowness.

Per-service schemas in shared Postgres. Skipping “one cluster per service” saved us probably six weeks of ops work and gave up basically nothing in terms of boundary enforcement. The role + schema isolation is enforced by the database, not by convention. If the volume grows past what this can handle, the migration to separate clusters is straightforward; doing it preemptively would have been waste.

gRPC for service-to-service, REST at the edge. Generated clients across Go + Python + PHP saved real glue code. Built-in deadlines saved at least one incident already (a hung downstream that would have cascaded into a billing latency spike). The proto repo is a separate thing with its own versioning; that’s a small investment with a big payoff once the third consumer shows up.

Distroless + Buildx cache. 14 MB Go images, 35-second CI builds when deps haven’t changed. Pulls are basically instant. The “two-stage Dockerfile” advice everyone gives is correct but understates how much the cache mounts and registry-backed cache layer add on top.

What didn’t

I underestimated how much “ops” the team would have to learn. The codebase work was the easy part. Teaching the team to read structured logs in a JSON aggregator, to understand readiness vs liveness, to debug a request that crossed three services via correlated trace IDs — that’s a weeks-long retraining, and it’s not technically about microservices, it’s about distributed systems. Should have started the training in week one alongside the containerization work, not in week three when people started running into things they didn’t know.

Two-day detour into a service mesh. I spent a day reading about Istio and another spinning up Linkerd in a test cluster, convinced we needed it for mTLS and traffic shaping. We didn’t. Direct gRPC + Kubernetes Services + a pair of NetworkPolicies covered every case. Mesh might earn its complexity at 30+ services. At 3 it’s a tax.

Composer + opcache + FPM tuning took longer than expected. Two full days of “the container starts fine but production load makes FPM queue requests.” Fixing it was tuning pm.max_children against the container memory limit. I had read about this. I had even noted it as a pitfall in the production Dockerfile post . I still hit it. Some lessons you have to learn at 2 AM.

Premature standardization on a “platform library.” Started writing a shared Go library for logging + config + health checks across services. Got 80% through and realized two services had genuinely different needs and the abstraction was already leaking. Reverted to copy-paste-and-evolve. Library extraction is a Year 2 problem, not a Month 1 problem.

What I’d do differently

Set up tracing on day one, not month three. I held off on Jaeger / OpenTelemetry because “we’re not big enough yet.” That was wrong. The cost of adding distributed tracing once you have N services is N times the cost of adding it when you have 1. Spans on the monolith from day one would have made everything since easier.

Adopt make targets at the team level immediately. A Makefile with make up, make down, make test, make logs, make seed saves new-hire confusion and reduces the surface area people have to know. We added it in week three; should have been week zero.

Write the runbook as the work is happening. Not after. Every operational thing I learned this month — “if billing’s readiness flaps, check Postgres connection saturation” — I now have to reconstruct from memory and Slack scrollback. Should have been writing the runbook entries the same day I learned them.

Pick the third microservice before extracting the second. Knowing what’s next clarifies your shared concerns. We extracted billing without knowing whether service #3 would be inventory or auth, and the proto repo / shared CI patterns we set up assumed billing’s specific shape. With a clear next target, those decisions would have generalized better.

What I’d skip entirely

The “should we use Kubernetes vs ECS vs Nomad” debate. Skip it. Whatever your team already runs containers on is the right answer. The 5% efficiency difference between platforms is dwarfed by the cost of context-switching. We had Kubernetes; we kept it. If you have Cloud Run or ECS, use it. The principles in this month’s posts work on any of them.

What February looks like

Theme: Postgres performance tuning, advanced indexing, and CI/CD with GitHub Actions. The shape of the month: one week on Postgres tuning (we’ll need it once these services start hitting real traffic), one week on indexing strategies, one week on CI/CD patterns specific to multi-repo deploys, one week on monitoring + alerting that ties the whole thing together.

Anything in January I’d want to go deeper on, drop a comment (when comments ship — they’re not live on this site yet) or email me. The wrap-up of this month is the start of the next.

What worked

What didn’t

What I’d do differently

What I’d skip entirely

What February looks like

Related posts

August Retro, IIoT Production Lessons

March Retro, What Rust Earned Its Keep For

Local Development with Docker Compose for a Polyglot Stack

November Retro, Security Hardening Sprint

October Retro, ETL Pragmatism

September Retro, One Stack to Watch Them All

July Retro, Compose in Production-Adjacent Workflows

Building Images Inside Docker Compose, build vs image

Let’s Start a Project