Canary Deployments
What Is a Canary Deployment?
A canary deployment routes a small percentage of production traffic to the new version while the old version serves the rest. Unlike feature flags (which control features at the application level), canary deployments control at the infrastructure level -- the user does not know they are hitting a different version.
The name comes from the "canary in a coal mine" -- a small group of users acts as an early warning system. If the canary version shows degraded metrics, traffic is shifted back to the stable version before most users are affected.
Deployment Strategy Comparison
| Strategy | Traffic Split | Rollback Speed | Infrastructure Cost | Observability Need |
|---|---|---|---|---|
| Canary | 1-10% new, rest old | Seconds (shift traffic) | Low (few new pods) | High (compare metrics) |
| Blue-Green | 100% switch | Seconds (DNS/LB switch) | High (2x infrastructure) | Medium |
| Rolling | Gradual pod replacement | Minutes (scale down new) | Low | Medium |
| Shadow/Dark | 0% user-facing (mirror) | N/A (no user impact) | Medium (duplicate processing) | High |
When to Use Each
- Canary: Default choice for critical services where you want statistical validation before full rollout
- Blue-Green: When you need instant, complete rollback capability (e.g., database schema changes)
- Rolling: For non-critical services where gradual replacement is sufficient
- Shadow: For validating a complete rewrite against production traffic without user impact
Canary Analysis with Kayenta (Spinnaker)
Kayenta is Netflix's automated canary analysis tool, integrated with Spinnaker. It compares metrics between the canary and baseline versions and produces a statistical judgment.
{
"canaryConfig": {
"name": "checkout-service-canary",
"judge": {
"judgeConfigurations": {},
"name": "NetflixACAJudge-v1.0"
},
"metrics": [
{
"name": "error_rate",
"query": {
"type": "prometheus",
"customInlineTemplate": "sum(rate(http_requests_total{status=~\"5..\",app=\"checkout\",version=\"${scope}\"}[5m])) / sum(rate(http_requests_total{app=\"checkout\",version=\"${scope}\"}[5m]))"
},
"analysisConfigurations": {
"canary": {
"direction": "increase",
"critical": true,
"mustHaveData": true
}
},
"scopeName": "default"
},
{
"name": "latency_p99",
"query": {
"type": "prometheus",
"customInlineTemplate": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app=\"checkout\",version=\"${scope}\"}[5m])) by (le))"
},
"analysisConfigurations": {
"canary": {
"direction": "increase",
"critical": true
}
},
"scopeName": "default"
},
{
"name": "saturation_cpu",
"query": {
"type": "prometheus",
"customInlineTemplate": "avg(rate(container_cpu_usage_seconds_total{app=\"checkout\",version=\"${scope}\"}[5m]))"
},
"analysisConfigurations": {
"canary": {
"direction": "increase",
"critical": false
}
},
"scopeName": "default"
}
],
"classifier": {
"groupWeights": {
"Errors": 40,
"Latency": 35,
"Saturation": 25
}
}
}
}
How Kayenta Scores Canaries
- Collect metrics from both canary and baseline for the analysis window
- Compare distributions using the Mann-Whitney U test (non-parametric)
- Score each metric as Pass, Marginal, or Fail
- Apply group weights (Errors 40%, Latency 35%, Saturation 25%)
- Produce a final score (0-100). Typically, >70 = promote, <50 = rollback, 50-70 = extend observation
Canary with Argo Rollouts (Kubernetes-Native)
For teams not using Spinnaker, Argo Rollouts provides Kubernetes-native canary deployments:
# argo-canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
spec:
replicas: 10
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
trafficRouting:
istio:
virtualService:
name: checkout-vsvc
routes:
- primary
steps:
- setWeight: 5 # 5% to canary
- pause: { duration: 5m }
- analysis:
templates:
- templateName: canary-analysis
args:
- name: service-name
value: checkout-service
- setWeight: 25 # 25% to canary
- pause: { duration: 10m }
- analysis:
templates:
- templateName: canary-analysis
- setWeight: 50 # 50% to canary
- pause: { duration: 10m }
- analysis:
templates:
- templateName: canary-analysis
- setWeight: 100 # promote canary to stable
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-analysis
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result < 0.01
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{status=~"5..",app="checkout",
rollouts-pod-template-hash="{{args.canary-hash}}"}[5m]))
/
sum(rate(http_requests_total{app="checkout",
rollouts-pod-template-hash="{{args.canary-hash}}"}[5m]))
- name: latency-p99
interval: 1m
successCondition: result < 0.5
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99, sum(rate(
http_request_duration_seconds_bucket{app="checkout",
rollouts-pod-template-hash="{{args.canary-hash}}"}[5m])) by (le))
Key Decisions for Canary Deployments
How Much Traffic to Send to Canary?
| Traffic Percentage | Use Case | Risk Level |
|---|---|---|
| 1% | High-risk changes (payment, auth) | Very low |
| 5% | Standard feature releases | Low |
| 10% | Low-risk changes with high confidence | Low |
| 25% | Changes that need more traffic volume for statistical significance | Medium |
How Long to Observe?
The observation window depends on traffic volume and the statistical significance required:
- High traffic services (>1000 rps): 10-15 minutes provides enough data points
- Medium traffic (100-1000 rps): 30-60 minutes
- Low traffic (<100 rps): 2-6 hours (consider synthetic traffic augmentation)
What Metrics to Compare?
At minimum, compare these between canary and baseline:
- Error rate (critical -- always include)
- Latency percentiles (p50, p95, p99)
- Saturation metrics (CPU, memory per pod)
- Business metrics (conversion rate, revenue per request -- if available in real-time)
Common Canary Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Insufficient traffic | Cannot reach statistical significance | Increase canary percentage or observation window |
| Only checking averages | Masks tail latency regressions | Compare p95 and p99, not just mean |
| No automatic rollback | Human delay allows more users to be affected | Configure automatic rollback on metric threshold |
| Ignoring business metrics | Technically fast but functionally broken | Include conversion rate and error count in canary analysis |
| Same-version canary | Canary always passes because it is identical to baseline | Verify canary is actually running the new version |
| Cache warming effects | Canary starts slow due to cold caches | Allow a warm-up period before starting metric comparison |
Canary Deployment Checklist
Before enabling canary deployments:
- Metrics pipeline can differentiate traffic by version (labels, headers, or pod identity)
- Automated rollback is configured (not just manual intervention)
- Analysis window is long enough for your traffic volume
- Both canary and baseline are monitored by the same dashboards
- Alert routing accounts for canary failures (do not page for expected experiments)
- The team understands that a rollback is a success, not a failure -- you caught a problem before it reached all users