Litmus Chaos Experiments
What Is LitmusChaos?
LitmusChaos is a CNCF (Cloud Native Computing Foundation) incubating project that provides a complete framework for practicing chaos engineering on Kubernetes. It offers a rich library of pre-built experiments, a declarative YAML-based experiment definition, and built-in probes to validate system behavior during chaos.
For QA architects working with Kubernetes-based systems, Litmus is the most practical starting point for chaos engineering because it integrates natively with the Kubernetes API and follows familiar Kubernetes resource patterns.
Litmus Architecture
+-------------------+
| ChaosCenter UI | Optional web dashboard for experiment management
+--------+----------+
|
v
+--------+----------+
| Chaos Operator | Watches for ChaosEngine resources
| (K8s Controller) |
+--------+----------+
|
v
+--------+----------+
| ChaosEngine | Links an experiment to a target application
+--------+----------+
|
v
+--------+----------+
| ChaosExperiment | Defines WHAT failure to inject (pod-kill, network-loss, etc.)
+--------+----------+
|
v
+--------+----------+
| Runner Pod | Executes the experiment and collects results
+--------+----------+
|
v
+--------+----------+
| ChaosResult | Stores the experiment outcome (Pass/Fail/Awaited)
+------------------+
Writing Your First Experiment: Pod Delete
The pod-delete experiment verifies that your application survives pod termination -- the most basic resilience test. If your app cannot handle a pod being killed, it cannot handle anything.
Step 1: Define the ChaosExperiment
# chaos-experiment-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-delete
namespace: litmus
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments"]
verbs: ["get", "list", "delete"]
args:
- -c
- ./experiments -name pod-delete
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # chaos lasts 60 seconds
- name: CHAOS_INTERVAL
value: "10" # kill a pod every 10 seconds
- name: FORCE
value: "false" # graceful termination (SIGTERM)
- name: TARGET_PODS
value: "" # random pod selection
- name: PODS_AFFECTED_PERC
value: "50" # kill 50% of pods
Step 2: Create the ChaosEngine
The ChaosEngine links the experiment to your target application and defines observability probes:
# chaos-engine-checkout.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: checkout-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=checkout-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: PODS_AFFECTED_PERC
value: "50"
probe:
- name: "checkout-availability"
type: httpProbe
mode: Continuous
httpProbe/inputs:
url: "https://checkout.internal/health"
method:
get:
criteria: ==
responseCode: "200"
runProperties:
probeTimeout: 5s
interval: 5s
retry: 2
probePollingInterval: 2s
Step 3: Apply and Monitor
# Apply the experiment and engine
kubectl apply -f chaos-experiment-pod-delete.yaml
kubectl apply -f chaos-engine-checkout.yaml
# Watch experiment progress
kubectl get chaosengine checkout-chaos -n production -w
# Check results when complete
kubectl get chaosresult checkout-chaos-pod-delete -n production -o yaml
Litmus Probes: Validating Behavior During Chaos
Probes are what transform chaos experiments from "break and hope" into "break and measure." Litmus supports four probe types:
HTTP Probe
Continuously checks that an HTTP endpoint returns expected responses during chaos:
probe:
- name: "api-health-check"
type: httpProbe
mode: Continuous
httpProbe/inputs:
url: "https://api.internal/health"
method:
get:
criteria: ==
responseCode: "200"
runProperties:
probeTimeout: 10s
interval: 5s
retry: 3
Command Probe
Runs a shell command and evaluates its exit code or output:
probe:
- name: "database-connectivity"
type: cmdProbe
mode: Edge # runs at start and end of chaos
cmdProbe/inputs:
command: "pg_isready -h db.internal -p 5432"
comparator:
type: int
criteria: ==
value: "0" # exit code 0 = success
runProperties:
probeTimeout: 15s
Prometheus Probe
Queries Prometheus and asserts metric values stay within bounds:
probe:
- name: "error-rate-within-slo"
type: promProbe
mode: Continuous
promProbe/inputs:
endpoint: "http://prometheus.monitoring:9090"
query: >
sum(rate(http_requests_total{status=~"5..",service="checkout"}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
comparator:
type: float
criteria: <
value: "0.01" # error rate must stay under 1%
runProperties:
probeTimeout: 10s
interval: 30s
Kubernetes Probe
Checks the state of Kubernetes resources:
probe:
- name: "min-replicas-available"
type: k8sProbe
mode: Continuous
k8sProbe/inputs:
group: apps
version: v1
resource: deployments
namespace: production
fieldSelector: "metadata.name=checkout-service"
operation: present
runProperties:
probeTimeout: 10s
interval: 10s
Common Experiment Patterns
Network Chaos: Simulating Latency
# network-latency-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-network-latency
namespace: litmus
spec:
definition:
scope: Namespaced
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: NETWORK_LATENCY
value: "500" # 500ms additional latency
- name: JITTER
value: "100" # +/- 100ms jitter
- name: DESTINATION_IPS
value: "" # all destinations
- name: DESTINATION_HOSTS
value: "payment-service.production.svc.cluster.local"
DNS Chaos: Simulating Resolution Failures
# dns-chaos-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-dns-error
namespace: litmus
spec:
definition:
scope: Namespaced
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: TARGET_HOSTNAMES
value: "external-api.vendor.com"
- name: CHAOS_TYPE
value: "error" # return NXDOMAIN for target hostnames
Disk Fill: Simulating Storage Pressure
# disk-fill-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: disk-fill
namespace: litmus
spec:
definition:
scope: Namespaced
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: FILL_PERCENTAGE
value: "90" # fill disk to 90%
- name: CONTAINER_PATH
value: "/var/log" # target the log directory
Automating Litmus in CI/CD
Integrate chaos experiments into your deployment pipeline so that resilience is validated on every release:
# .github/workflows/chaos-gate.yml
name: Chaos Resilience Gate
on:
push:
branches: [main]
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Deploy to staging
run: kubectl apply -f k8s/staging/
- name: Wait for rollout
run: kubectl rollout status deployment/checkout-service -n staging --timeout=300s
- name: Run pod-delete chaos experiment
run: |
kubectl apply -f chaos/experiments/pod-delete.yaml
kubectl apply -f chaos/engines/checkout-pod-delete.yaml
- name: Wait for chaos completion
run: |
kubectl wait --for=condition=complete \
chaosresult/checkout-chaos-pod-delete \
-n staging --timeout=600s
- name: Validate chaos result
run: |
VERDICT=$(kubectl get chaosresult checkout-chaos-pod-delete \
-n staging -o jsonpath='{.status.experimentStatus.verdict}')
echo "Chaos experiment verdict: $VERDICT"
if [ "$VERDICT" != "Pass" ]; then
echo "FAIL: System did not survive chaos experiment"
exit 1
fi
Experiment Design Best Practices
- One variable at a time. Do not inject network latency and kill pods simultaneously in your first experiments. Isolate variables to understand each failure mode independently.
- Define probes before running. If you run an experiment without probes, you are just breaking things without measuring the impact.
- Start with the smallest blast radius. Kill one pod, not 50%. Add 100ms latency, not 5 seconds. Increase gradually as you build confidence.
- Document every experiment. Record the hypothesis, configuration, results, and action items. This creates institutional knowledge about your system's failure modes.
- Make experiments idempotent. Running the same experiment twice should not leave the system in a different state. Litmus handles cleanup automatically, but verify your probes are not creating side effects.