Canary Deployments

What Is a Canary Deployment?

A canary deployment routes a small percentage of production traffic to the new version while the old version serves the rest. Unlike feature flags (which control features at the application level), canary deployments control at the infrastructure level -- the user does not know they are hitting a different version.

The name comes from the "canary in a coal mine" -- a small group of users acts as an early warning system. If the canary version shows degraded metrics, traffic is shifted back to the stable version before most users are affected.

Deployment Strategy Comparison

Strategy	Traffic Split	Rollback Speed	Infrastructure Cost	Observability Need
Canary	1-10% new, rest old	Seconds (shift traffic)	Low (few new pods)	High (compare metrics)
Blue-Green	100% switch	Seconds (DNS/LB switch)	High (2x infrastructure)	Medium
Rolling	Gradual pod replacement	Minutes (scale down new)	Low	Medium
Shadow/Dark	0% user-facing (mirror)	N/A (no user impact)	Medium (duplicate processing)	High

When to Use Each

Canary: Default choice for critical services where you want statistical validation before full rollout
Blue-Green: When you need instant, complete rollback capability (e.g., database schema changes)
Rolling: For non-critical services where gradual replacement is sufficient
Shadow: For validating a complete rewrite against production traffic without user impact

Canary Analysis with Kayenta (Spinnaker)

Kayenta is Netflix's automated canary analysis tool, integrated with Spinnaker. It compares metrics between the canary and baseline versions and produces a statistical judgment.

{
  "canaryConfig": {
    "name": "checkout-service-canary",
    "judge": {
      "judgeConfigurations": {},
      "name": "NetflixACAJudge-v1.0"
    },
    "metrics": [
      {
        "name": "error_rate",
        "query": {
          "type": "prometheus",
          "customInlineTemplate": "sum(rate(http_requests_total{status=~\"5..\",app=\"checkout\",version=\"${scope}\"}[5m])) / sum(rate(http_requests_total{app=\"checkout\",version=\"${scope}\"}[5m]))"
        },
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "critical": true,
            "mustHaveData": true
          }
        },
        "scopeName": "default"
      },
      {
        "name": "latency_p99",
        "query": {
          "type": "prometheus",
          "customInlineTemplate": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app=\"checkout\",version=\"${scope}\"}[5m])) by (le))"
        },
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "critical": true
          }
        },
        "scopeName": "default"
      },
      {
        "name": "saturation_cpu",
        "query": {
          "type": "prometheus",
          "customInlineTemplate": "avg(rate(container_cpu_usage_seconds_total{app=\"checkout\",version=\"${scope}\"}[5m]))"
        },
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "critical": false
          }
        },
        "scopeName": "default"
      }
    ],
    "classifier": {
      "groupWeights": {
        "Errors": 40,
        "Latency": 35,
        "Saturation": 25
      }
    }
  }
}

How Kayenta Scores Canaries

Collect metrics from both canary and baseline for the analysis window
Compare distributions using the Mann-Whitney U test (non-parametric)
Score each metric as Pass, Marginal, or Fail
Apply group weights (Errors 40%, Latency 35%, Saturation 25%)
Produce a final score (0-100). Typically, >70 = promote, <50 = rollback, 50-70 = extend observation

Canary with Argo Rollouts (Kubernetes-Native)

For teams not using Spinnaker, Argo Rollouts provides Kubernetes-native canary deployments:

# argo-canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        istio:
          virtualService:
            name: checkout-vsvc
            routes:
              - primary
      steps:
        - setWeight: 5       # 5% to canary
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: canary-analysis
            args:
              - name: service-name
                value: checkout-service
        - setWeight: 25      # 25% to canary
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: canary-analysis
        - setWeight: 50      # 50% to canary
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: canary-analysis
        - setWeight: 100     # promote canary to stable

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result < 0.01
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{status=~"5..",app="checkout",
              rollouts-pod-template-hash="{{args.canary-hash}}"}[5m]))
            /
            sum(rate(http_requests_total{app="checkout",
              rollouts-pod-template-hash="{{args.canary-hash}}"}[5m]))
    - name: latency-p99
      interval: 1m
      successCondition: result < 0.5
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99, sum(rate(
              http_request_duration_seconds_bucket{app="checkout",
                rollouts-pod-template-hash="{{args.canary-hash}}"}[5m])) by (le))

Key Decisions for Canary Deployments

How Much Traffic to Send to Canary?

Traffic Percentage	Use Case	Risk Level
1%	High-risk changes (payment, auth)	Very low
5%	Standard feature releases	Low
10%	Low-risk changes with high confidence	Low
25%	Changes that need more traffic volume for statistical significance	Medium

How Long to Observe?

The observation window depends on traffic volume and the statistical significance required:

High traffic services (>1000 rps): 10-15 minutes provides enough data points
Medium traffic (100-1000 rps): 30-60 minutes
Low traffic (<100 rps): 2-6 hours (consider synthetic traffic augmentation)

What Metrics to Compare?

At minimum, compare these between canary and baseline:

Error rate (critical -- always include)
Latency percentiles (p50, p95, p99)
Saturation metrics (CPU, memory per pod)
Business metrics (conversion rate, revenue per request -- if available in real-time)

Common Canary Pitfalls

Pitfall	Problem	Solution
Insufficient traffic	Cannot reach statistical significance	Increase canary percentage or observation window
Only checking averages	Masks tail latency regressions	Compare p95 and p99, not just mean
No automatic rollback	Human delay allows more users to be affected	Configure automatic rollback on metric threshold
Ignoring business metrics	Technically fast but functionally broken	Include conversion rate and error count in canary analysis
Same-version canary	Canary always passes because it is identical to baseline	Verify canary is actually running the new version
Cache warming effects	Canary starts slow due to cold caches	Allow a warm-up period before starting metric comparison

Canary Deployment Checklist

Before enabling canary deployments:

Metrics pipeline can differentiate traffic by version (labels, headers, or pod identity)
Automated rollback is configured (not just manual intervention)
Analysis window is long enough for your traffic volume
Both canary and baseline are monitored by the same dashboards
Alert routing accounts for canary failures (do not page for expected experiments)
The team understands that a rollback is a success, not a failure -- you caught a problem before it reached all users