background-shape
Advanced GitHub Actions, Reusable Workflows, OIDC, and Matrix Patterns That Don't Become Spaghetti
June 23, 2023 · 8 min read · by Muhammad Amal programming

TL;DR — Stop copy-pasting .github/workflows/ci.yml across repos. Use reusable workflows for org-wide CI logic. / OIDC trust to AWS/GCP/Azure eliminates long-lived AWS_ACCESS_KEY_ID secrets — set it up once, never rotate again. / Composite actions handle the small-but-shared steps; matrix strategies need fail-fast: false and max-parallel to be usable at scale.

GitHub Actions stopped being “Jenkins but easier” around the time reusable workflows landed. The platform now has enough primitives to build a real CI/CD layer for an org, but most teams I see are still treating each repo’s ci.yml like an island. Below is the pattern I push every team toward: a single platform repo that holds the shared CI logic, OIDC-based cloud auth, and a small set of well-versioned composite actions.

This is the seventh post in the platform engineering series. For the GitOps side that consumes these CI artifacts, see ArgoCD ApplicationSets at scale. For progressive delivery downstream, see Argo Rollouts and Flagger.

Reusable workflows: one definition, many callers

A reusable workflow lives in a repo and is called by other workflows via uses: org/repo/.github/workflows/file.yml@ref. The caller workflow looks small; the called workflow does the work. This is exactly what you want for “every Go service should run the same lint, test, build, scan, push pipeline.”

The platform repo’s reusable workflow:

# acme/platform-ci/.github/workflows/go-service-ci.yml
name: Go Service CI
on:
  workflow_call:
    inputs:
      service-name:
        required: true
        type: string
      go-version:
        required: false
        type: string
        default: '1.20'
      push-image:
        required: false
        type: boolean
        default: false
    secrets:
      registry-token:
        required: false

permissions:
  contents: read
  id-token: write
  packages: write

jobs:
  test:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v4
        with:
          go-version: ${{ inputs.go-version }}
          cache: true
      - run: go test ./... -race -coverprofile=cover.out
      - uses: actions/upload-artifact@v3
        with:
          name: coverage
          path: cover.out

  lint:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: golangci/golangci-lint-action@v3
        with:
          version: v1.53

  build:
    needs: [test, lint]
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v2
      - uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-build
          aws-region: us-east-1
      - uses: aws-actions/amazon-ecr-login@v1
        id: ecr
      - uses: docker/build-push-action@v4
        with:
          context: .
          push: ${{ inputs.push-image }}
          tags: |
            ${{ steps.ecr.outputs.registry }}/${{ inputs.service-name }}:${{ github.sha }}
            ${{ steps.ecr.outputs.registry }}/${{ inputs.service-name }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  scan:
    needs: build
    runs-on: ubuntu-22.04
    steps:
      - uses: aquasecurity/trivy-action@0.11.2
        with:
          image-ref: 123456789012.dkr.ecr.us-east-1.amazonaws.com/${{ inputs.service-name }}:${{ github.sha }}
          severity: 'CRITICAL,HIGH'
          exit-code: '1'

And the per-service caller workflow shrinks to:

# acme/checkout/.github/workflows/ci.yml
name: CI
on:
  push:
    branches: [main]
  pull_request:

jobs:
  ci:
    uses: acme/platform-ci/.github/workflows/go-service-ci.yml@v1
    with:
      service-name: checkout
      go-version: '1.20'
      push-image: ${{ github.event_name == 'push' }}

The win: when the platform team updates the Trivy version, fixes a flaky test runner, adds an SBOM step, every service gets it on the next CI run. No 80-PR campaign across stream-team repos.

Pin reusable workflows to versions, not main

@v1 above is a release tag. Use git tags (or even better, immutable SHAs) for reusable workflow refs. @main means every push to the platform repo retests every consumer. That includes the in-progress fix that breaks everything.

Cut a release of platform-ci like any other product. Semver. Changelog. Major version bumps when you change the input contract. The consumers can pin to @v1 for the major and get patch updates automatically, or pin to a SHA for full immutability.

OIDC: kill the long-lived AWS keys

Until OIDC trust landed, the standard pattern was an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY stored as repo secrets. Every CI run had cloud credentials in environment variables, the keys had to be rotated, leaked credentials in build logs were a recurring problem. OIDC makes this go away entirely.

GitHub Actions issues a JWT per workflow run, signed by GitHub. AWS IAM trusts GitHub’s OIDC provider and exchanges the JWT for a short-lived STS session. No long-lived keys. The trust policy on the IAM role looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:acme/*:ref:refs/heads/main"
        }
      }
    }
  ]
}

The sub condition is the security boundary. The pattern above lets any workflow on main of any repo in the acme org assume the role. For sensitive roles (prod deploy), narrow it: repo:acme/checkout:environment:prod requires the workflow run to be against the prod GitHub Environment, which has its own approval gates.

The GitHub docs at docs.github.com document the sub claim format exhaustively. Read it carefully — getting the trust policy wrong gives any repo in the org access to prod.

GCP and Azure have equivalent OIDC integrations. Use them. Long-lived service account keys belong in 2018.

Composite actions for the small shared pieces

Reusable workflows are heavy. For small composable pieces — “set up the cache, fetch a vault secret, post a Slack notification” — composite actions are the right grain.

# acme/platform-actions/notify-slack/action.yml
name: Notify Slack
inputs:
  webhook-url:
    required: true
  status:
    required: true
  service:
    required: true
runs:
  using: composite
  steps:
    - shell: bash
      run: |
        emoji=":white_check_mark:"
        if [[ "${{ inputs.status }}" != "success" ]]; then emoji=":x:"; fi
        payload=$(jq -nc \
          --arg svc "${{ inputs.service }}" \
          --arg sha "${GITHUB_SHA::7}" \
          --arg actor "$GITHUB_ACTOR" \
          --arg emoji "$emoji" \
          --arg url "$GITHUB_SERVER_URL/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID" \
          '{text: ("\($emoji) *\($svc)* deploy by \($actor) at \($sha) — <\($url)|details>")}')
        curl -fsS -X POST -H 'Content-Type: application/json' \
          -d "$payload" "${{ inputs.webhook-url }}"

Call it from any workflow: uses: acme/platform-actions/notify-slack@v2. Composite actions are not isolated runners — they execute in the caller’s job, sharing the workspace. That makes them fast and easy to chain.

Matrix patterns that don’t get out of hand

Matrix builds turn 1-job workflows into 30-job workflows in one line. Useful, but fail-fast and max-parallel deserve more attention than they get.

strategy:
  fail-fast: false
  max-parallel: 4
  matrix:
    go-version: ['1.19', '1.20']
    os: [ubuntu-22.04, ubuntu-20.04]
    include:
      - go-version: '1.20'
        os: ubuntu-22.04
        coverage: true
    exclude:
      - go-version: '1.19'
        os: ubuntu-20.04
  • fail-fast: false — without this, one failing combination cancels the others. For test matrices you usually want to see all failures, not just the first.
  • max-parallel: 4 — caps concurrent runners. Set this when you have many matrix dimensions or run on self-hosted runners with limited capacity. Without it, a 30-cell matrix consumes 30 runners simultaneously.
  • include adds extra combinations with custom values. exclude drops combinations from the cartesian product. Use them when the matrix doesn’t quite match what you want.

For dynamic matrices (e.g., “test against every service that changed in this PR”), generate the matrix in a setup job and pass it via needs.setup.outputs.matrix:

jobs:
  setup:
    runs-on: ubuntu-22.04
    outputs:
      services: ${{ steps.changed.outputs.services }}
    steps:
      - uses: actions/checkout@v4
      - id: changed
        run: |
          services=$(git diff --name-only origin/main...HEAD \
            | grep -oP '^services/\K[^/]+' | sort -u | jq -R . | jq -sc .)
          echo "services=$services" >> $GITHUB_OUTPUT

  test:
    needs: setup
    if: needs.setup.outputs.services != '[]'
    strategy:
      matrix:
        service: ${{ fromJson(needs.setup.outputs.services) }}
    runs-on: ubuntu-22.04
    steps:
      - run: echo "Testing ${{ matrix.service }}"

This pattern shrinks PR CI time enormously in monorepos. Only the changed services get tested.

Environments and approvals: where deployment lives

For prod deploys, GitHub Environments are the gate. An environment can require manual approval, restrict to specific branches, and hold its own secrets. Plus it integrates with the OIDC sub claim above.

deploy-prod:
  needs: build
  runs-on: ubuntu-22.04
  environment:
    name: prod
    url: https://checkout.acme.io
  steps:
    - uses: aws-actions/configure-aws-credentials@v2
      with:
        role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy-prod
        aws-region: us-east-1
    - run: |
        kubectl set image deployment/checkout \
          checkout=${ECR}/checkout:${{ github.sha }} -n checkout

Configure the prod environment with required reviewers (a Group), a deployment branch restriction (main only), and a wait timer if you want a “30 seconds to abort” window. The approval is logged on the workflow run page — auditable, not Slack-message-only.

In a GitOps world this job usually doesn’t kubectl directly; it bumps an image tag in the deploy repo and lets ArgoCD pick it up. Either way, the environment gate is the right place to enforce the approval.

Common Pitfalls

  • Reusable workflows that take 12 inputs. That’s a sign the workflow is doing too much. Split it into two reusable workflows that compose, not one monolith.
  • No version pinning on third-party actions. uses: some-action/foo@v1 lets the maintainer change v1 under you. Pin to a full SHA for any action that touches secrets. The action ecosystem has had supply-chain incidents.
  • OIDC trust policy too broad. repo:acme/* lets every repo assume every role. Scope by repo and by environment for prod-touching roles.
  • Workflow concurrency unbounded. Add concurrency: { group: ${{ github.workflow }}-${{ github.ref }}, cancel-in-progress: true } on PR workflows. Without it, every push to a PR queues another run while the old one finishes.
  • Caching without keys that invalidate. actions/cache is great until you have a stale cache from six months ago. Include lockfile hashes and tool versions in the cache key.
  • Self-hosted runners on the default network. A self-hosted runner that can reach internal infrastructure plus is shared across public repos is a lateral movement risk. Use ephemeral runners (actions/runner-set or arc-runner-set on Kubernetes) per job.
  • Secrets in workflow logs. GitHub redacts known secret values, but not derived values. echo "DB_URL=postgres://user:$DB_PASSWORD@host/db" leaks the password even if $DB_PASSWORD is a secret because the assembled string isn’t. Be careful with composite values.

Wrapping Up

GitHub Actions in 2023 is a real CI/CD platform if you treat it like one. Centralize logic in reusable workflows, kill long-lived cloud credentials with OIDC, gate prod with Environments. The cost is a small platform repo and the discipline to version it.

Final post in this series: cluster cost engineering with Karpenter and KEDA. Because all of this CI/CD machinery scheduling pods has a bill at the end of the month.