Litmus Chaos Experiments

What Is LitmusChaos?

LitmusChaos is a CNCF (Cloud Native Computing Foundation) incubating project that provides a complete framework for practicing chaos engineering on Kubernetes. It offers a rich library of pre-built experiments, a declarative YAML-based experiment definition, and built-in probes to validate system behavior during chaos.

For QA architects working with Kubernetes-based systems, Litmus is the most practical starting point for chaos engineering because it integrates natively with the Kubernetes API and follows familiar Kubernetes resource patterns.

Litmus Architecture

  +-------------------+
  | ChaosCenter UI    |  Optional web dashboard for experiment management
  +--------+----------+
           |
           v
  +--------+----------+
  | Chaos Operator    |  Watches for ChaosEngine resources
  | (K8s Controller)  |
  +--------+----------+
           |
           v
  +--------+----------+
  | ChaosEngine       |  Links an experiment to a target application
  +--------+----------+
           |
           v
  +--------+----------+
  | ChaosExperiment   |  Defines WHAT failure to inject (pod-kill, network-loss, etc.)
  +--------+----------+
           |
           v
  +--------+----------+
  | Runner Pod        |  Executes the experiment and collects results
  +--------+----------+
           |
           v
  +--------+----------+
  | ChaosResult       |  Stores the experiment outcome (Pass/Fail/Awaited)
  +------------------+

Writing Your First Experiment: Pod Delete

The pod-delete experiment verifies that your application survives pod termination -- the most basic resilience test. If your app cannot handle a pod being killed, it cannot handle anything.

Step 1: Define the ChaosExperiment

# chaos-experiment-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: ["", "apps"]
        resources: ["pods", "deployments"]
        verbs: ["get", "list", "delete"]
    args:
      - -c
      - ./experiments -name pod-delete
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "60"          # chaos lasts 60 seconds
      - name: CHAOS_INTERVAL
        value: "10"          # kill a pod every 10 seconds
      - name: FORCE
        value: "false"       # graceful termination (SIGTERM)
      - name: TARGET_PODS
        value: ""            # random pod selection
      - name: PODS_AFFECTED_PERC
        value: "50"          # kill 50% of pods

Step 2: Create the ChaosEngine

The ChaosEngine links the experiment to your target application and defines observability probes:

# chaos-engine-checkout.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: checkout-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=checkout-service
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: PODS_AFFECTED_PERC
              value: "50"
        probe:
          - name: "checkout-availability"
            type: httpProbe
            mode: Continuous
            httpProbe/inputs:
              url: "https://checkout.internal/health"
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            runProperties:
              probeTimeout: 5s
              interval: 5s
              retry: 2
              probePollingInterval: 2s

Step 3: Apply and Monitor

# Apply the experiment and engine
kubectl apply -f chaos-experiment-pod-delete.yaml
kubectl apply -f chaos-engine-checkout.yaml

# Watch experiment progress
kubectl get chaosengine checkout-chaos -n production -w

# Check results when complete
kubectl get chaosresult checkout-chaos-pod-delete -n production -o yaml

Litmus Probes: Validating Behavior During Chaos

Probes are what transform chaos experiments from "break and hope" into "break and measure." Litmus supports four probe types:

HTTP Probe

Continuously checks that an HTTP endpoint returns expected responses during chaos:

probe:
  - name: "api-health-check"
    type: httpProbe
    mode: Continuous
    httpProbe/inputs:
      url: "https://api.internal/health"
      method:
        get:
          criteria: ==
          responseCode: "200"
    runProperties:
      probeTimeout: 10s
      interval: 5s
      retry: 3

Command Probe

Runs a shell command and evaluates its exit code or output:

probe:
  - name: "database-connectivity"
    type: cmdProbe
    mode: Edge    # runs at start and end of chaos
    cmdProbe/inputs:
      command: "pg_isready -h db.internal -p 5432"
      comparator:
        type: int
        criteria: ==
        value: "0"   # exit code 0 = success
    runProperties:
      probeTimeout: 15s

Prometheus Probe

Queries Prometheus and asserts metric values stay within bounds:

probe:
  - name: "error-rate-within-slo"
    type: promProbe
    mode: Continuous
    promProbe/inputs:
      endpoint: "http://prometheus.monitoring:9090"
      query: >
        sum(rate(http_requests_total{status=~"5..",service="checkout"}[5m]))
        /
        sum(rate(http_requests_total{service="checkout"}[5m]))
      comparator:
        type: float
        criteria: <
        value: "0.01"    # error rate must stay under 1%
    runProperties:
      probeTimeout: 10s
      interval: 30s

Kubernetes Probe

Checks the state of Kubernetes resources:

probe:
  - name: "min-replicas-available"
    type: k8sProbe
    mode: Continuous
    k8sProbe/inputs:
      group: apps
      version: v1
      resource: deployments
      namespace: production
      fieldSelector: "metadata.name=checkout-service"
      operation: present
    runProperties:
      probeTimeout: 10s
      interval: 10s

Common Experiment Patterns

Network Chaos: Simulating Latency

# network-latency-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-network-latency
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "120"
      - name: NETWORK_LATENCY
        value: "500"         # 500ms additional latency
      - name: JITTER
        value: "100"         # +/- 100ms jitter
      - name: DESTINATION_IPS
        value: ""            # all destinations
      - name: DESTINATION_HOSTS
        value: "payment-service.production.svc.cluster.local"

DNS Chaos: Simulating Resolution Failures

# dns-chaos-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-dns-error
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "60"
      - name: TARGET_HOSTNAMES
        value: "external-api.vendor.com"
      - name: CHAOS_TYPE
        value: "error"     # return NXDOMAIN for target hostnames

Disk Fill: Simulating Storage Pressure

# disk-fill-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: disk-fill
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "120"
      - name: FILL_PERCENTAGE
        value: "90"         # fill disk to 90%
      - name: CONTAINER_PATH
        value: "/var/log"   # target the log directory

Automating Litmus in CI/CD

Integrate chaos experiments into your deployment pipeline so that resilience is validated on every release:

# .github/workflows/chaos-gate.yml
name: Chaos Resilience Gate
on:
  push:
    branches: [main]

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup kubectl
        uses: azure/setup-kubectl@v3

      - name: Deploy to staging
        run: kubectl apply -f k8s/staging/

      - name: Wait for rollout
        run: kubectl rollout status deployment/checkout-service -n staging --timeout=300s

      - name: Run pod-delete chaos experiment
        run: |
          kubectl apply -f chaos/experiments/pod-delete.yaml
          kubectl apply -f chaos/engines/checkout-pod-delete.yaml

      - name: Wait for chaos completion
        run: |
          kubectl wait --for=condition=complete \
            chaosresult/checkout-chaos-pod-delete \
            -n staging --timeout=600s

      - name: Validate chaos result
        run: |
          VERDICT=$(kubectl get chaosresult checkout-chaos-pod-delete \
            -n staging -o jsonpath='{.status.experimentStatus.verdict}')
          echo "Chaos experiment verdict: $VERDICT"
          if [ "$VERDICT" != "Pass" ]; then
            echo "FAIL: System did not survive chaos experiment"
            exit 1
          fi

Experiment Design Best Practices

One variable at a time. Do not inject network latency and kill pods simultaneously in your first experiments. Isolate variables to understand each failure mode independently.
Define probes before running. If you run an experiment without probes, you are just breaking things without measuring the impact.
Start with the smallest blast radius. Kill one pod, not 50%. Add 100ms latency, not 5 seconds. Increase gradually as you build confidence.
Document every experiment. Record the hypothesis, configuration, results, and action items. This creates institutional knowledge about your system's failure modes.
Make experiments idempotent. Running the same experiment twice should not leave the system in a different state. Litmus handles cleanup automatically, but verify your probes are not creating side effects.