Cluster Cost Engineering, Karpenter, KEDA, and the End of Static Node Groups
TL;DR — Static node groups plus cluster-autoscaler is a cost ceiling, not a cost strategy. / Karpenter picks instance types per workload, bin-packs aggressively, and consolidates spare capacity. / KEDA scales workloads to zero based on real signals (queue depth, event count), not just CPU. / Cost engineering needs metrics (OpenCost) wired to teams, not just a Grafana dashboard nobody reads.
The first time I looked at a Kubernetes bill closely, I found we were running 40% over-provisioned across the fleet. Every team had asked for “a bit more headroom,” every node group was 60% utilized on average, and the spot instance story was three Slack threads and a comment in someone’s Notion. None of this was unusual. Most clusters I see today have the same shape.
Karpenter and KEDA are the two pieces that flipped this for me. Together they take cluster capacity from “ops decision made quarterly” to “scheduler-level optimization made every few minutes.” If your bill is meaningful — and at 50+ engineers and prod traffic it is — the savings pay for the migration in under a month.
This is the final post in the June series. For the GitOps and progressive delivery layers, see progressive delivery with Argo Rollouts and advanced GitHub Actions patterns.
Why cluster-autoscaler isn’t enough
The classic cluster-autoscaler reads pending pods and adds nodes to a pre-defined Auto Scaling Group. It does what it says: if pods are pending, add more of whatever instance type the ASG specifies. That works. It also locks you into a single instance type per ASG, encourages a small number of large node groups (because managing many ASGs is painful), and has no opinion about cost.
The failure modes show up at scale:
- A 16-vCPU pod schedules onto a
m5.4xlarge, and the next 200ms-CPU pod also gets am5.4xlargebecause that’s the ASG. You’re paying full price for 1% utilization. - Spot instances exist in a separate ASG, the workload labels are inconsistent, and the priority expander is “best effort.” You drift back onto on-demand without noticing.
- Scaling down is conservative. Empty nodes hang around for 10+ minutes “in case.” Across hundreds of nodes that’s real money.
Karpenter rethinks this. It’s a node-provisioning controller that looks at pending pods, asks “what’s the cheapest combination of instance types that fits these pods,” and provisions them directly. No ASG. No predefined instance type list per workload class.
Karpenter in practice
Karpenter 0.29 (mid-2023) configures via NodePool (formerly Provisioner) and EC2NodeClass resources:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
node-tier: shared
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: [amd64, arm64]
- key: karpenter.k8s.aws/instance-category
operator: In
values: [c, m, r]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ['4']
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b, us-east-1c]
nodeClassRef:
name: default
taints: []
limits:
cpu: 1000
memory: 4000Gi
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
role: KarpenterNodeRole-acme
subnetSelectorTerms:
- tags: { karpenter.sh/discovery: acme }
securityGroupSelectorTerms:
- tags: { karpenter.sh/discovery: acme }
blockDeviceMappings:
- deviceName: /dev/xvda
ebs: { volumeType: gp3, volumeSize: 100Gi, encrypted: true }
Three things doing heavy lifting here:
- Broad instance type requirements. Karpenter picks from
c,m, andrfamilies across multiple generations, on both architectures. The result: it finds the cheapest available instance that satisfies the pod’s requests, accounting for spot price fluctuations. consolidationPolicy: WhenUnderutilized. This is the part that quietly saves money. When a node is under-loaded and the pods on it could fit on existing nodes, Karpenter drains and terminates it. No fixed scale-down delay; it’s bin-packing in real time.expireAfter: 720h. Every node gets recycled after 30 days. AMI rotation happens automatically. This is the cheapest way to keep nodes patched.
Spot instances become much more useful in this model. Karpenter respects karpenter.sh/capacity-type: spot preferences, handles the interruption notice (drains the node before the 2-minute mark), and falls back to on-demand when spot capacity is unavailable. For batch and stateless workloads, you can be 80% spot without operational pain.
KEDA: scale-to-zero on real signals
The Horizontal Pod Autoscaler scales on CPU and memory by default. Most workloads I care about don’t have a CPU-shaped load. They have a queue depth. Or an event rate. Or “is it business hours.”
KEDA (Kubernetes Event-Driven Autoscaler) 2.10 lets you scale a Deployment based on 60+ different signals — SQS queue length, Kafka lag, Redis list length, Prometheus query result, cron schedule, HTTP request rate. And critically, it scales to zero.
A worker that processes SQS messages:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: orders
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/orders
queueLength: '20'
awsRegion: us-east-1
identityOwner: operator
authenticationRef:
name: sqs-trigger-auth
When the queue is empty, the deployment scales to zero pods. Cost: zero, beyond the namespace overhead. When messages appear, KEDA scales up; one pod per 20 messages of queue depth. The cooldownPeriod (5 minutes) avoids thrashing on bursty traffic.
For HTTP services, scale-to-zero is harder because you need request-driven activation. KEDA has an HTTP add-on that proxies the first request, scales up the deployment, and forwards once a pod is ready. It works, but the cold-start latency (often 10-30 seconds depending on image pull and app startup) makes it best suited for internal/staging services, not user-facing prod.
A Prometheus-driven scaler — useful for services where business metrics drive load:
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.platform-observability:9090
metricName: active_user_sessions
threshold: '100'
query: |
sum(rate(http_requests_total{job="checkout"}[2m]))
One pod per 100 RPS. Combined with Karpenter, the cluster shrinks at night when traffic falls, scales up gradually in the morning, and never over-provisions during low-traffic windows.
OpenCost: making the bill visible per team
You can’t optimize what you can’t measure. OpenCost (formerly Kubecost’s OSS engine, now a CNCF Sandbox project) reads pod usage and node cost and produces per-namespace, per-workload, per-label cost attribution.
The hookup is minimal:
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: opencost
namespace: opencost
spec:
interval: 1h
url: https://opencost.github.io/opencost-helm-chart
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: opencost
namespace: opencost
spec:
interval: 1h
chart:
spec:
chart: opencost
version: 1.21.0
sourceRef: { kind: HelmRepository, name: opencost }
values:
opencost:
prometheus:
external:
enabled: true
url: http://prometheus.platform-observability:9090
exporter:
defaultClusterId: prod-us
Once running, OpenCost surfaces metrics like opencost_namespace_cost_hourly that you can pipe into Grafana 10 dashboards. The trick is to scope dashboards to teams, not show “the entire cluster cost.” A team that sees their own namespace’s cost line — and where it spiked last week — will optimize. A team that sees the global cluster cost will scroll past it.
The OpenCost docs at opencost.io cover the metric set and the integrations. For cloud cost reconciliation (matching pod costs to actual AWS bills), the cloud-billing integration is worth setting up; the spot price fluctuations otherwise produce attribution drift.
Pod resource right-sizing: the boring win
Karpenter packs nodes based on pod requests. If your pods have requests set to 2x what they actually use, Karpenter still bin-packs to those requests. Most over-provisioning lives here.
Vertical Pod Autoscaler in recommendation mode (updateMode: Off) is the cheapest way to find the gap:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: checkout
namespace: checkout
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: 'Off'
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed: { cpu: 50m, memory: 64Mi }
maxAllowed: { cpu: 2, memory: 4Gi }
VPA observes for a week, then kubectl describe vpa checkout shows recommended requests. Don’t auto-apply (that fights with HPA in confusing ways). Surface the recommendations in a quarterly cost-engineering review and have teams update their Helm values.
The pattern that works socially: a slow opt-in. Don’t mandate right-sizing org-wide. Pick two teams as pilots, show the cost reduction (and the prod stability — over-requested pods aren’t more stable, they’re more expensive), then roll out as a recommendation everyone follows because they want to.
Common Pitfalls
- Karpenter without disruption budgets. Aggressive consolidation will drain nodes during business hours. Set
PodDisruptionBudgeton every prod workload and use Karpenter’sdo-not-disruptannotation on stateful pods. - Spot for stateful workloads. Databases and queues on spot instances are bad ideas. Tag those workloads to require
karpenter.sh/capacity-type: on-demandvia node selectors. - KEDA with no minReplica for prod HTTP services. Scale-to-zero on a user-facing service means the first morning request waits for a cold start. Use
minReplicaCount: 2for prod,0for internal-only. - Mixing HPA and KEDA. KEDA creates its own HPA under the hood. Don’t define a separate HPA on the same deployment.
- OpenCost dashboards nobody owns. A dashboard without an owner is decorative. Assign each cost dashboard to a specific person (or the platform PM) who reports on it monthly.
- Karpenter NodePool with no
limits. Without alimitsblock, a misbehaving job that requests 10,000 CPUs will get them, briefly. Set sane upper bounds. - Forgetting GPU and ARM nodes. Karpenter handles both. ARM64 (Graviton) is meaningfully cheaper for stateless workloads. Many container images are already multi-arch; rebuild the rest with
docker buildx. - Not factoring in egress. AZ-cross egress costs are not part of compute cost engineering but show up on the same bill. Karpenter’s
topology.kubernetes.io/zonerequirements can be used to keep chatty workloads zone-local.
Wrapping Up
The shape of cluster cost in 2023 is: Karpenter doing the bin-packing, KEDA doing the application-level scaling, OpenCost doing the attribution, and a quarterly cost-engineering review where teams see their own numbers. None of this is exotic. All of it requires committing to a measurement loop and letting the data drive decisions.
This wraps up the June series on platform engineering, GitOps, Kubernetes operations, and CI/CD. The throughline: take the cognitive load off product teams, give them a paved path, and instrument everything so the platform team can prove its value. The tools change every six months. The shape doesn’t.