SPIFFE and SPIRE for Service Identity, A Hands On Tutorial
TL;DR — SPIFFE specifies what a workload identity looks like; SPIRE 1.10 is the production-ready implementation that issues those identities based on cryptographically verifiable attestation. Install the server, set up node and workload attestation, fetch SVIDs from your apps, and you’ve replaced static service credentials with rotating, cryptographic identity.
Every multi-service system eventually invents some half-baked notion of service identity. Shared API keys in a vault. JWTs signed by a “machine” identity provider. Mutual TLS with a static CA where every cert lasts a year. These patterns share a problem: the credential is decoupled from the workload that uses it. If a process gets compromised, its credential keeps working for everyone else who has it.
SPIFFE (Secure Production Identity Framework For Everyone) is the answer the industry converged on, and SPIRE 1.10 (August 2025) is the reference implementation that’s finally easy enough to operate at small scale without a dedicated platform team. The core idea: a workload’s identity is derived from cryptographically verified attributes of the runtime (which kernel namespace, which Kubernetes service account, which AWS instance metadata), and the credential is rotated every hour so a leak buys you minutes, not months.
This tutorial walks through installing SPIRE, registering workloads, and using SVIDs from Go and Python apps. We’ll build on the Zero Trust setup from earlier in this series but this post stands alone if you just need workload identity for any reason.
1. What SPIFFE Actually Specifies
SPIFFE is three things: an identity format, a credential format, and an API for delivering credentials.
Identity: spiffe://trust-domain/path
e.g. spiffe://prod.acme.com/ns/payments/sa/api
Credential (X.509-SVID):
X.509 cert where the SPIFFE ID is the URI SAN
+ private key
+ trust bundle of the issuer
Workload API:
gRPC endpoint, usually a Unix socket
streams SVIDs as they rotate
A workload calls the Workload API, gets back its current SVID and the trust bundle of the SPIRE server. Mutual TLS between two workloads verifies the peer’s SVID against the bundle.
What’s missing from this list: a way for the workload to authenticate to the Workload API. Solving that is the whole job of SPIRE.
2. Installing SPIRE 1.10
We’ll install on Kubernetes 1.31, which is the easiest case. The official Helm chart for SPIRE 1.10 ships hardened defaults.
helm repo add spiffe https://spiffe.github.io/helm-charts-hardened/
helm repo update
helm install spire-crds spiffe/spire-crds \
--namespace spire-server \
--create-namespace
cat <<EOF > spire-values.yaml
global:
spire:
trustDomain: "ai.internal"
clusterName: "prod-us-east-1"
spire-server:
controllerManager:
enabled: true
ca_subject:
country: "US"
organization: "ACME"
common_name: "ACME SPIRE CA"
spire-agent:
workloadAttestors:
k8s:
enabled: true
skipKubeletVerification: false
EOF
helm install spire spiffe/spire \
--namespace spire-server \
--values spire-values.yaml \
--version 0.24.0
After a minute, you’ll have a SPIRE server StatefulSet and a SPIRE agent DaemonSet on every node. Verify:
kubectl -n spire-server get pods
kubectl -n spire-server exec -it spire-server-0 -- /opt/spire/bin/spire-server healthcheck
2.1 What just happened
The SPIRE server is the root of trust. It generates a CA, holds the root key, and issues SVIDs. The SPIRE agent on each node has a long-lived attestation token that lets it authenticate to the server. Workloads on that node talk to the local agent through a Unix socket.
The two-tier design means workloads never talk to the server directly. The agent does node attestation; the workload does workload attestation through the agent.
3. Registering Workloads
A workload registration is a rule: “any process matching these selectors gets this SPIFFE ID.” With the controller manager enabled, you create registrations declaratively.
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: payments-api
spec:
spiffeIDTemplate: |-
spiffe://ai.internal/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}
podSelector:
matchLabels:
app: payments-api
workloadSelectorTemplates:
- "k8s:ns:{{ .PodMeta.Namespace }}"
- "k8s:sa:{{ .PodSpec.ServiceAccountName }}"
- "k8s:pod-label:app:payments-api"
dnsNameTemplates:
- "payments-api.{{ .PodMeta.Namespace }}.svc"
The controller watches for pods matching app: payments-api, computes the SPIFFE ID from the template, and pushes a registration to the SPIRE server. When the pod starts, the SPIRE agent matches the workload selectors against the running process (looking up its Kubernetes attributes via the kubelet API) and issues an SVID.
3.1 Selector matching in depth
Selectors are AND-ed. A workload with ns:payments and sa:api matches a registration with both selectors, not one with just ns:payments. Multiple selector types are supported: Kubernetes attributes, Unix process attributes (UID, GID, binary path), Docker labels, AWS IMDS data, and more.
Be specific. A registration with only ns:payments will issue the same identity to every pod in that namespace, which defeats the point. Pin to service account at minimum; pin to pod labels when you need finer granularity.
4. Fetching an SVID from Go
The official go-spiffe library handles the streaming gRPC client. The typical pattern:
package main
import (
"context"
"log"
"net/http"
"github.com/spiffe/go-spiffe/v2/spiffetls/tlsconfig"
"github.com/spiffe/go-spiffe/v2/workloadapi"
)
func main() {
ctx := context.Background()
src, err := workloadapi.NewX509Source(ctx,
workloadapi.WithClientOptions(
workloadapi.WithAddr("unix:///run/spire/agent-sockets/api.sock"),
),
)
if err != nil {
log.Fatal(err)
}
defer src.Close()
svid, _ := src.GetX509SVID()
log.Printf("I am %s", svid.ID)
// Server config that requires any client SVID from our trust domain
serverTLS := tlsconfig.MTLSServerConfig(src, src, tlsconfig.AuthorizeMemberOf(
svid.ID.TrustDomain(),
))
srv := &http.Server{
Addr: ":8443",
TLSConfig: serverTLS,
Handler: handler{},
}
log.Fatal(srv.ListenAndServeTLS("", ""))
}
The X509Source runs a goroutine that keeps the SVID fresh. The TLS config consumes the source, so cert rotation is transparent to the application. No file watching, no SIGHUP, no nothing.
4.1 Authorizing specific peers
AuthorizeMemberOf accepts any SVID from your trust domain. Usually too broad. To restrict to a specific service:
allowed := spiffeid.RequireFromString("spiffe://ai.internal/ns/agents/sa/orchestrator")
serverTLS := tlsconfig.MTLSServerConfig(src, src, tlsconfig.AuthorizeID(allowed))
Now only the orchestrator can connect. Other workloads get their TLS handshake rejected.
5. Fetching an SVID from Python
The py-spiffe library is the official Python client. Less mature than go-spiffe but workable.
# requires py-spiffe>=2.0
from pyspiffe.workloadapi.default_workload_api_client import (
DefaultWorkloadApiClient,
)
from pyspiffe.spiffe_id.trust_domain import TrustDomain
import ssl
client = DefaultWorkloadApiClient(
spiffe_socket_path="unix:///run/spire/agent-sockets/api.sock"
)
# Synchronous fetch
svid = client.fetch_x509_svid()
print("I am", svid.spiffe_id)
# Build an SSL context for a server
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile=svid.cert_chain_pem_path,
keyfile=svid.private_key_pem_path)
ctx.verify_mode = ssl.CERT_REQUIRED
ctx.load_verify_locations(cadata=client.fetch_x509_bundles().get(
TrustDomain.parse("ai.internal")
).x509_authorities_as_pem)
For long-running processes, watch the stream so you get rotation updates:
def on_update(svid):
print(f"new svid, expires {svid.cert_chain[0].not_valid_after}")
# update your server's TLS context here
client.watch_x509_context(on_update)
Python tooling is less mature than Go for this. Most Python services end up running Envoy as a sidecar (per the Zero Trust post) and treating Envoy as the SVID consumer.
6. Federation Between Trust Domains
If you operate multiple trust domains (per-region, per-environment, per-acquired-company), federation lets workloads from one trust the workloads of another. The mechanism is a bundle exchange.
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterFederatedTrustDomain
metadata:
name: staging
spec:
className: spire-mgmt
trustDomain: "staging.ai.internal"
bundleEndpointURL: "https://spire-server.staging.ai.internal/bundle"
bundleEndpointProfile:
type: https_spiffe
endpointSPIFFEID: "spiffe://staging.ai.internal/spire/server"
The SPIRE server fetches the staging trust bundle on a schedule. Workloads can now authenticate peers presenting SVIDs from staging.ai.internal. Federation is symmetric; configure it on both sides.
+-------------+ +---------------+
| prod | bundle exchange | staging |
| spire-server|<------ HTTPS ---------> | spire-server |
+-------------+ +---------------+
| |
v v
+-------------+ +---------------+
| prod |<---- mTLS with both --->| staging |
| workload | trust bundles | workload |
+-------------+ +---------------+
7. Operational Details Nobody Mentions
A few things the documentation glosses over.
7.1 Backup the root key
The SPIRE server’s root CA key, if lost, requires bootstrapping every workload. Back it up to a KMS. The server supports KMS-backed keys natively:
spire-server:
caKeyType: ec-p256
upstreamAuthority:
awsSecretsManager:
enabled: true
keyName: "spire-root-key"
Or use Vault as the upstream authority. Either way, don’t store the root key only in the StatefulSet’s PVC.
7.2 Plan for agent restarts
When a SPIRE agent restarts, every workload it served briefly loses SVID rotation. With one-hour TTLs and 30-minute rotation triggers, you have maybe 30 minutes of grace. For high-availability, run multiple agents per node (the DaemonSet handles this) and make sure your workload library reconnects.
7.3 Watch the clock
Cert validation is time-sensitive. If a node’s clock drifts more than a couple of minutes, SVIDs appear expired or not-yet-valid. Run NTP. Alert on clock skew above 30 seconds.
8. Common Pitfalls
Four pitfalls I’ve hit.
8.1 The agent socket isn’t mounted
If your pod can’t see /run/spire/agent-sockets/api.sock, the workload API call fails with a confusing error. Mount the socket via a hostPath volume or use the CSI driver that ships with the Helm chart. The CSI driver is preferred because it scopes the mount to authorized workloads.
8.2 Workload selectors not matching
The workload API call returns an empty SVID list because no registration matched. Run kubectl exec spire-agent -- /opt/spire/bin/spire-agent api fetch x509 -socketPath ... from inside the pod to see what selectors the agent observes. Common cause: pod labels not matching the registration’s selector.
8.3 Federation bundle staleness
The bundle exchange relies on each side fetching the other’s bundle on a schedule. If the schedule lags or the endpoint is unreachable, you keep using a stale bundle. Monitor the last-refresh metric on both sides.
8.4 Trust domain typos
Trust domains are strings and SPIFFE IDs are case-sensitive. A typo (Prod.ai.internal vs prod.ai.internal) means no SVID. Lock the trust domain in a single place in your config and reference it.
9. Troubleshooting
Three failure modes I see most often.
9.1 SVID expired errors
If workloads start failing with expired SVID errors, the SPIRE agent is either down or unable to reach the server. Check the agent logs, then the server logs. Usually network policy or DNS.
9.2 SVID issued but mTLS fails
Two SVIDs, both valid, mTLS handshake still fails. Almost always trust bundle mismatch. The verifier doesn’t have the issuer’s bundle. Either you’re federating across trust domains and the federation isn’t set up, or one side is using a stale bundle.
9.3 Controller manager not reconciling
You created a ClusterSPIFFEID but no registration appears in spire-server entry show. Check the controller manager logs in the SPIRE server pod. Usually a webhook configuration error or a missing RBAC binding.
10. Wrapping Up
SPIFFE and SPIRE are the boring infrastructure layer that makes Zero Trust actually possible. Once you have it, every other security control gets easier: policy decisions can name workloads by stable identity, audit logs identify the actor cryptographically, mTLS gives you data-in-flight encryption with peer authentication for free.
The first install is the hardest part. The Helm chart gets you 90% of the way; the remaining 10% is registering your workloads correctly and wiring the workload API client into your apps. Budget a week for the first service, a day each for the next ten, an hour each after that.
For more reading, the SPIFFE specification and the SPIRE concepts guide are the canonical references. My next post in this series, Policy as Code with OPA 1.0, shows how to make policy decisions against these identities.