Kubernetes Scaling and Container Performance Testing
Validating Kubernetes Auto-Scaling
Kubernetes Horizontal Pod Autoscaler (HPA) promises automatic scaling based on CPU, memory, or custom metrics. But "configured" does not mean "working." Performance testing must validate that auto-scaling behaves correctly under real traffic conditions: that it scales up fast enough to absorb traffic spikes and scales down gracefully without disrupting active connections.
k6 Test for HPA Validation
// k6-container-scaling-test.js
// Verify Kubernetes HPA responds correctly to traffic spikes
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend } from 'k6/metrics';
const scalingLatency = new Trend('scaling_response_time', true);
export const options = {
scenarios: {
spike: {
executor: 'ramping-arrival-rate',
startRate: 10,
timeUnit: '1s',
preAllocatedVUs: 50,
maxVUs: 500,
stages: [
{ duration: '1m', target: 10 }, // baseline
{ duration: '30s', target: 200 }, // sudden spike
{ duration: '5m', target: 200 }, // sustain spike (HPA should scale)
{ duration: '30s', target: 10 }, // drop back
{ duration: '5m', target: 10 }, // verify scale-down
],
},
},
thresholds: {
// Even during spike, 95th percentile should stay under 2s
// (once HPA has scaled, which may take 1-2 minutes)
http_req_duration: ['p(95)<2000'],
http_req_failed: ['rate<0.05'],
},
};
export default function () {
const res = http.get('https://app.example.com/api/heavy-computation');
scalingLatency.add(res.timings.duration);
check(res, {
'status is 200': (r) => r.status === 200,
'no 503 (service unavailable)': (r) => r.status !== 503,
});
}
What to Monitor During the Test
While k6 runs, monitor the Kubernetes cluster in a parallel terminal or dashboard:
# Watch pod count change in real-time
kubectl get pods -l app=my-service -w
# Watch HPA status
kubectl get hpa my-service-hpa -w
# Check HPA events for scaling decisions
kubectl describe hpa my-service-hpa
Expected Timeline
T+0:00 - 10 req/s, 3 pods (baseline)
T+1:00 - Spike to 200 req/s, latency increases immediately
T+1:30 - HPA detects CPU > target, begins scaling
T+2:00 - New pods scheduled, pulling images
T+2:30 - New pods running, latency begins to decrease
T+3:00 - Full scale-up complete (e.g., 15 pods), latency normalized
T+6:30 - Traffic drops to 10 req/s
T+7:00 - HPA begins scale-down (cooldown period)
T+11:30 - Scale-down complete, back to 3 pods
Critical question: What happens to users during T+1:00 to T+3:00 (the scaling gap)? This is where you discover if your HPA configuration is adequate.
HPA Configuration for Performance
Basic CPU-Based HPA
# hpa-cpu-based.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scale up when avg CPU > 60%
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # react to spikes quickly
policies:
- type: Percent
value: 100 # can double pod count per scaling event
periodSeconds: 60
- type: Pods
value: 5 # or add 5 pods, whichever is larger
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min before scaling down
policies:
- type: Percent
value: 10 # remove at most 10% of pods per interval
periodSeconds: 60
Custom Metrics HPA (Requests Per Second)
CPU-based HPA is often too slow for traffic-driven scaling. Custom metrics based on request rate can be more responsive:
# hpa-custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100 # target 100 req/s per pod
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Performance Testing Architecture Decision Matrix
Different architectures have different performance concerns. Use this matrix to select the right testing strategy:
| Architecture | Key Concern | Primary Tool | Key Metric |
|---|---|---|---|
| Monolith | Thread pool exhaustion | k6 / JMeter | Concurrent connections |
| Microservices | Inter-service latency, cascading failures | k6 + distributed tracing | End-to-end p99 latency |
| Serverless | Cold starts, concurrency limits | k6 + CloudWatch | TTFB, concurrent executions |
| Edge/CDN | Cache hit ratio, origin load | k6 from multiple regions | Cache hit %, origin req/s |
| LLM-backed | Token throughput, rate limits | k6 custom metrics | TPS, TTFT, rate limit hits |
| Event-driven | Queue depth, consumer lag | k6 + queue metrics | Consumer lag, processing latency |
Testing Cascading Failures in Microservices
In a microservices architecture, a slow downstream service can cause cascading failures upstream. Test this scenario explicitly:
// k6-cascading-failure-test.js
// Test: What happens when the payment service is slow?
import http from 'k6/http';
import { check } from 'k6';
import { Trend, Rate } from 'k6/metrics';
const orderLatency = new Trend('order_creation_latency', true);
const cascadeErrorRate = new Rate('cascade_errors');
export const options = {
scenarios: {
normal_traffic: {
executor: 'constant-arrival-rate',
rate: 50,
timeUnit: '1s',
duration: '10m',
preAllocatedVUs: 50,
maxVUs: 200,
},
},
thresholds: {
order_creation_latency: ['p(95)<5000'], // total order flow under 5s
cascade_errors: ['rate<0.1'], // under 10% cascade errors
},
};
export default function () {
// This test runs against the order service while separately
// injecting latency into the payment service (via Litmus/Chaos Mesh)
const res = http.post('https://staging.example.com/api/orders',
JSON.stringify({
items: [{ sku: "TEST-1", qty: 1 }],
payment: { method: "card", token: "tok_test" },
}),
{ headers: { 'Content-Type': 'application/json' }, timeout: '30s' }
);
orderLatency.add(res.timings.duration);
check(res, {
'order created or gracefully degraded': (r) =>
r.status === 201 || r.status === 202 || r.status === 503,
'no 500 internal errors': (r) => r.status !== 500,
});
// A 503 with a retry-after header is acceptable (circuit breaker open)
// A 500 is a cascading failure bug
cascadeErrorRate.add(res.status === 500);
}
Run this k6 test simultaneously with a Litmus network-latency experiment on the payment service to validate that circuit breakers, timeouts, and fallback logic work correctly.
Resource Limits and Performance
Kubernetes resource requests and limits directly affect performance. Under-provisioned containers throttle at the worst possible moment:
# resource-config-for-performance.yaml
resources:
requests:
cpu: 500m # guaranteed CPU allocation
memory: 512Mi # guaranteed memory allocation
limits:
cpu: 2000m # burst capacity (4x request)
memory: 1Gi # hard memory limit (OOMKill if exceeded)
Performance Testing Resource Configurations
Test your service under different resource configurations to find the optimal settings:
| Configuration | CPU Request/Limit | Memory Request/Limit | Test Result |
|---|---|---|---|
| Minimal | 100m/500m | 128Mi/256Mi | 50 req/s, p99=2.5s, OOMKill under load |
| Conservative | 250m/1000m | 256Mi/512Mi | 100 req/s, p99=800ms, stable |
| Optimal | 500m/2000m | 512Mi/1Gi | 200 req/s, p99=400ms, stable |
| Generous | 1000m/4000m | 1Gi/2Gi | 200 req/s, p99=350ms, diminishing returns |
The "optimal" configuration achieves target performance without wasting resources. Load testing is the only way to find this sweet spot.
CI Pipeline Integration: Full Performance and Chaos Workflow
# .github/workflows/performance-chaos.yml
name: Performance & Chaos Pipeline
on:
push:
branches: [main]
jobs:
performance-budget:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run build
- name: Lighthouse CI
uses: treosh/lighthouse-ci-action@v11
with:
configPath: ./lighthouserc.json
load-test-staging:
needs: performance-budget
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run k6 load tests
uses: grafana/k6-action@v0.4.0
with:
filename: tests/performance/k6-load-test.js
flags: --out json=results.json
env:
K6_TARGET_URL: ${{ secrets.STAGING_URL }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: k6-results
path: results.json
chaos-staging:
needs: load-test-staging
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Litmus
run: |
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
- name: Run chaos experiment
run: |
kubectl apply -f chaos/pod-delete-experiment.yaml
kubectl apply -f chaos/network-delay-experiment.yaml
- name: Wait for chaos completion
run: |
kubectl wait --for=condition=complete chaosresult/checkout-chaos \
--timeout=600s
- name: Verify SLOs held during chaos
run: |
python scripts/verify_slo_during_chaos.py \
--prometheus-url ${{ secrets.PROMETHEUS_URL }} \
--slo-config slo/checkout-api.yaml \
--chaos-window 10m
This pipeline ensures that every merge to main is validated for performance (Lighthouse + k6) and resilience (Litmus chaos). The system must not only be fast -- it must stay fast when things go wrong.