background-shape
Cluster Cost Engineering, Karpenter, KEDA, and the End of Static Node Groups
June 27, 2023 · 8 min read · by Muhammad Amal programming

TL;DR — Static node groups plus cluster-autoscaler is a cost ceiling, not a cost strategy. / Karpenter picks instance types per workload, bin-packs aggressively, and consolidates spare capacity. / KEDA scales workloads to zero based on real signals (queue depth, event count), not just CPU. / Cost engineering needs metrics (OpenCost) wired to teams, not just a Grafana dashboard nobody reads.

The first time I looked at a Kubernetes bill closely, I found we were running 40% over-provisioned across the fleet. Every team had asked for “a bit more headroom,” every node group was 60% utilized on average, and the spot instance story was three Slack threads and a comment in someone’s Notion. None of this was unusual. Most clusters I see today have the same shape.

Karpenter and KEDA are the two pieces that flipped this for me. Together they take cluster capacity from “ops decision made quarterly” to “scheduler-level optimization made every few minutes.” If your bill is meaningful — and at 50+ engineers and prod traffic it is — the savings pay for the migration in under a month.

This is the final post in the June series. For the GitOps and progressive delivery layers, see progressive delivery with Argo Rollouts and advanced GitHub Actions patterns.

Why cluster-autoscaler isn’t enough

The classic cluster-autoscaler reads pending pods and adds nodes to a pre-defined Auto Scaling Group. It does what it says: if pods are pending, add more of whatever instance type the ASG specifies. That works. It also locks you into a single instance type per ASG, encourages a small number of large node groups (because managing many ASGs is painful), and has no opinion about cost.

The failure modes show up at scale:

  • A 16-vCPU pod schedules onto a m5.4xlarge, and the next 200ms-CPU pod also gets a m5.4xlarge because that’s the ASG. You’re paying full price for 1% utilization.
  • Spot instances exist in a separate ASG, the workload labels are inconsistent, and the priority expander is “best effort.” You drift back onto on-demand without noticing.
  • Scaling down is conservative. Empty nodes hang around for 10+ minutes “in case.” Across hundreds of nodes that’s real money.

Karpenter rethinks this. It’s a node-provisioning controller that looks at pending pods, asks “what’s the cheapest combination of instance types that fits these pods,” and provisions them directly. No ASG. No predefined instance type list per workload class.

Karpenter in practice

Karpenter 0.29 (mid-2023) configures via NodePool (formerly Provisioner) and EC2NodeClass resources:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        node-tier: shared
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64, arm64]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [c, m, r]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['4']
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]
        - key: topology.kubernetes.io/zone
          operator: In
          values: [us-east-1a, us-east-1b, us-east-1c]
      nodeClassRef:
        name: default
      taints: []
  limits:
    cpu: 1000
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  role: KarpenterNodeRole-acme
  subnetSelectorTerms:
    - tags: { karpenter.sh/discovery: acme }
  securityGroupSelectorTerms:
    - tags: { karpenter.sh/discovery: acme }
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs: { volumeType: gp3, volumeSize: 100Gi, encrypted: true }

Three things doing heavy lifting here:

  • Broad instance type requirements. Karpenter picks from c, m, and r families across multiple generations, on both architectures. The result: it finds the cheapest available instance that satisfies the pod’s requests, accounting for spot price fluctuations.
  • consolidationPolicy: WhenUnderutilized. This is the part that quietly saves money. When a node is under-loaded and the pods on it could fit on existing nodes, Karpenter drains and terminates it. No fixed scale-down delay; it’s bin-packing in real time.
  • expireAfter: 720h. Every node gets recycled after 30 days. AMI rotation happens automatically. This is the cheapest way to keep nodes patched.

Spot instances become much more useful in this model. Karpenter respects karpenter.sh/capacity-type: spot preferences, handles the interruption notice (drains the node before the 2-minute mark), and falls back to on-demand when spot capacity is unavailable. For batch and stateless workloads, you can be 80% spot without operational pain.

KEDA: scale-to-zero on real signals

The Horizontal Pod Autoscaler scales on CPU and memory by default. Most workloads I care about don’t have a CPU-shaped load. They have a queue depth. Or an event rate. Or “is it business hours.”

KEDA (Kubernetes Event-Driven Autoscaler) 2.10 lets you scale a Deployment based on 60+ different signals — SQS queue length, Kafka lag, Redis list length, Prometheus query result, cron schedule, HTTP request rate. And critically, it scales to zero.

A worker that processes SQS messages:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: orders
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/orders
        queueLength: '20'
        awsRegion: us-east-1
        identityOwner: operator
      authenticationRef:
        name: sqs-trigger-auth

When the queue is empty, the deployment scales to zero pods. Cost: zero, beyond the namespace overhead. When messages appear, KEDA scales up; one pod per 20 messages of queue depth. The cooldownPeriod (5 minutes) avoids thrashing on bursty traffic.

For HTTP services, scale-to-zero is harder because you need request-driven activation. KEDA has an HTTP add-on that proxies the first request, scales up the deployment, and forwards once a pod is ready. It works, but the cold-start latency (often 10-30 seconds depending on image pull and app startup) makes it best suited for internal/staging services, not user-facing prod.

A Prometheus-driven scaler — useful for services where business metrics drive load:

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.platform-observability:9090
      metricName: active_user_sessions
      threshold: '100'
      query: |
        sum(rate(http_requests_total{job="checkout"}[2m]))

One pod per 100 RPS. Combined with Karpenter, the cluster shrinks at night when traffic falls, scales up gradually in the morning, and never over-provisions during low-traffic windows.

OpenCost: making the bill visible per team

You can’t optimize what you can’t measure. OpenCost (formerly Kubecost’s OSS engine, now a CNCF Sandbox project) reads pod usage and node cost and produces per-namespace, per-workload, per-label cost attribution.

The hookup is minimal:

apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: opencost
  namespace: opencost
spec:
  interval: 1h
  url: https://opencost.github.io/opencost-helm-chart
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: opencost
  namespace: opencost
spec:
  interval: 1h
  chart:
    spec:
      chart: opencost
      version: 1.21.0
      sourceRef: { kind: HelmRepository, name: opencost }
  values:
    opencost:
      prometheus:
        external:
          enabled: true
          url: http://prometheus.platform-observability:9090
      exporter:
        defaultClusterId: prod-us

Once running, OpenCost surfaces metrics like opencost_namespace_cost_hourly that you can pipe into Grafana 10 dashboards. The trick is to scope dashboards to teams, not show “the entire cluster cost.” A team that sees their own namespace’s cost line — and where it spiked last week — will optimize. A team that sees the global cluster cost will scroll past it.

The OpenCost docs at opencost.io cover the metric set and the integrations. For cloud cost reconciliation (matching pod costs to actual AWS bills), the cloud-billing integration is worth setting up; the spot price fluctuations otherwise produce attribution drift.

Pod resource right-sizing: the boring win

Karpenter packs nodes based on pod requests. If your pods have requests set to 2x what they actually use, Karpenter still bin-packs to those requests. Most over-provisioning lives here.

Vertical Pod Autoscaler in recommendation mode (updateMode: Off) is the cheapest way to find the gap:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout
  namespace: checkout
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: 'Off'
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed: { cpu: 50m, memory: 64Mi }
        maxAllowed: { cpu: 2, memory: 4Gi }

VPA observes for a week, then kubectl describe vpa checkout shows recommended requests. Don’t auto-apply (that fights with HPA in confusing ways). Surface the recommendations in a quarterly cost-engineering review and have teams update their Helm values.

The pattern that works socially: a slow opt-in. Don’t mandate right-sizing org-wide. Pick two teams as pilots, show the cost reduction (and the prod stability — over-requested pods aren’t more stable, they’re more expensive), then roll out as a recommendation everyone follows because they want to.

Common Pitfalls

  • Karpenter without disruption budgets. Aggressive consolidation will drain nodes during business hours. Set PodDisruptionBudget on every prod workload and use Karpenter’s do-not-disrupt annotation on stateful pods.
  • Spot for stateful workloads. Databases and queues on spot instances are bad ideas. Tag those workloads to require karpenter.sh/capacity-type: on-demand via node selectors.
  • KEDA with no minReplica for prod HTTP services. Scale-to-zero on a user-facing service means the first morning request waits for a cold start. Use minReplicaCount: 2 for prod, 0 for internal-only.
  • Mixing HPA and KEDA. KEDA creates its own HPA under the hood. Don’t define a separate HPA on the same deployment.
  • OpenCost dashboards nobody owns. A dashboard without an owner is decorative. Assign each cost dashboard to a specific person (or the platform PM) who reports on it monthly.
  • Karpenter NodePool with no limits. Without a limits block, a misbehaving job that requests 10,000 CPUs will get them, briefly. Set sane upper bounds.
  • Forgetting GPU and ARM nodes. Karpenter handles both. ARM64 (Graviton) is meaningfully cheaper for stateless workloads. Many container images are already multi-arch; rebuild the rest with docker buildx.
  • Not factoring in egress. AZ-cross egress costs are not part of compute cost engineering but show up on the same bill. Karpenter’s topology.kubernetes.io/zone requirements can be used to keep chatty workloads zone-local.

Wrapping Up

The shape of cluster cost in 2023 is: Karpenter doing the bin-packing, KEDA doing the application-level scaling, OpenCost doing the attribution, and a quarterly cost-engineering review where teams see their own numbers. None of this is exotic. All of it requires committing to a measurement loop and letting the data drive decisions.

This wraps up the June series on platform engineering, GitOps, Kubernetes operations, and CI/CD. The throughline: take the cognitive load off product teams, give them a paved path, and instrument everything so the platform team can prove its value. The tools change every six months. The shape doesn’t.