Kubernetes Scaling and Container Performance Testing

Validating Kubernetes Auto-Scaling

Kubernetes Horizontal Pod Autoscaler (HPA) promises automatic scaling based on CPU, memory, or custom metrics. But "configured" does not mean "working." Performance testing must validate that auto-scaling behaves correctly under real traffic conditions: that it scales up fast enough to absorb traffic spikes and scales down gracefully without disrupting active connections.

k6 Test for HPA Validation

// k6-container-scaling-test.js
// Verify Kubernetes HPA responds correctly to traffic spikes
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend } from 'k6/metrics';

const scalingLatency = new Trend('scaling_response_time', true);

export const options = {
  scenarios: {
    spike: {
      executor: 'ramping-arrival-rate',
      startRate: 10,
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 500,
      stages: [
        { duration: '1m', target: 10 },    // baseline
        { duration: '30s', target: 200 },   // sudden spike
        { duration: '5m', target: 200 },    // sustain spike (HPA should scale)
        { duration: '30s', target: 10 },    // drop back
        { duration: '5m', target: 10 },     // verify scale-down
      ],
    },
  },
  thresholds: {
    // Even during spike, 95th percentile should stay under 2s
    // (once HPA has scaled, which may take 1-2 minutes)
    http_req_duration: ['p(95)<2000'],
    http_req_failed: ['rate<0.05'],
  },
};

export default function () {
  const res = http.get('https://app.example.com/api/heavy-computation');
  scalingLatency.add(res.timings.duration);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'no 503 (service unavailable)': (r) => r.status !== 503,
  });
}

What to Monitor During the Test

While k6 runs, monitor the Kubernetes cluster in a parallel terminal or dashboard:

# Watch pod count change in real-time
kubectl get pods -l app=my-service -w

# Watch HPA status
kubectl get hpa my-service-hpa -w

# Check HPA events for scaling decisions
kubectl describe hpa my-service-hpa

Expected Timeline

T+0:00  - 10 req/s, 3 pods (baseline)
T+1:00  - Spike to 200 req/s, latency increases immediately
T+1:30  - HPA detects CPU > target, begins scaling
T+2:00  - New pods scheduled, pulling images
T+2:30  - New pods running, latency begins to decrease
T+3:00  - Full scale-up complete (e.g., 15 pods), latency normalized
T+6:30  - Traffic drops to 10 req/s
T+7:00  - HPA begins scale-down (cooldown period)
T+11:30 - Scale-down complete, back to 3 pods

Critical question: What happens to users during T+1:00 to T+3:00 (the scaling gap)? This is where you discover if your HPA configuration is adequate.

HPA Configuration for Performance

Basic CPU-Based HPA

# hpa-cpu-based.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # scale up when avg CPU > 60%
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # react to spikes quickly
      policies:
        - type: Percent
          value: 100      # can double pod count per scaling event
          periodSeconds: 60
        - type: Pods
          value: 5         # or add 5 pods, whichever is larger
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10       # remove at most 10% of pods per interval
          periodSeconds: 60

Custom Metrics HPA (Requests Per Second)

CPU-based HPA is often too slow for traffic-driven scaling. Custom metrics based on request rate can be more responsive:

# hpa-custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100   # target 100 req/s per pod
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Performance Testing Architecture Decision Matrix

Different architectures have different performance concerns. Use this matrix to select the right testing strategy:

Architecture	Key Concern	Primary Tool	Key Metric
Monolith	Thread pool exhaustion	k6 / JMeter	Concurrent connections
Microservices	Inter-service latency, cascading failures	k6 + distributed tracing	End-to-end p99 latency
Serverless	Cold starts, concurrency limits	k6 + CloudWatch	TTFB, concurrent executions
Edge/CDN	Cache hit ratio, origin load	k6 from multiple regions	Cache hit %, origin req/s
LLM-backed	Token throughput, rate limits	k6 custom metrics	TPS, TTFT, rate limit hits
Event-driven	Queue depth, consumer lag	k6 + queue metrics	Consumer lag, processing latency

Testing Cascading Failures in Microservices

In a microservices architecture, a slow downstream service can cause cascading failures upstream. Test this scenario explicitly:

// k6-cascading-failure-test.js
// Test: What happens when the payment service is slow?
import http from 'k6/http';
import { check } from 'k6';
import { Trend, Rate } from 'k6/metrics';

const orderLatency = new Trend('order_creation_latency', true);
const cascadeErrorRate = new Rate('cascade_errors');

export const options = {
  scenarios: {
    normal_traffic: {
      executor: 'constant-arrival-rate',
      rate: 50,
      timeUnit: '1s',
      duration: '10m',
      preAllocatedVUs: 50,
      maxVUs: 200,
    },
  },
  thresholds: {
    order_creation_latency: ['p(95)<5000'],  // total order flow under 5s
    cascade_errors: ['rate<0.1'],             // under 10% cascade errors
  },
};

export default function () {
  // This test runs against the order service while separately
  // injecting latency into the payment service (via Litmus/Chaos Mesh)
  const res = http.post('https://staging.example.com/api/orders',
    JSON.stringify({
      items: [{ sku: "TEST-1", qty: 1 }],
      payment: { method: "card", token: "tok_test" },
    }),
    { headers: { 'Content-Type': 'application/json' }, timeout: '30s' }
  );

  orderLatency.add(res.timings.duration);

  check(res, {
    'order created or gracefully degraded': (r) =>
      r.status === 201 || r.status === 202 || r.status === 503,
    'no 500 internal errors': (r) => r.status !== 500,
  });

  // A 503 with a retry-after header is acceptable (circuit breaker open)
  // A 500 is a cascading failure bug
  cascadeErrorRate.add(res.status === 500);
}

Run this k6 test simultaneously with a Litmus network-latency experiment on the payment service to validate that circuit breakers, timeouts, and fallback logic work correctly.

Resource Limits and Performance

Kubernetes resource requests and limits directly affect performance. Under-provisioned containers throttle at the worst possible moment:

# resource-config-for-performance.yaml
resources:
  requests:
    cpu: 500m        # guaranteed CPU allocation
    memory: 512Mi    # guaranteed memory allocation
  limits:
    cpu: 2000m       # burst capacity (4x request)
    memory: 1Gi      # hard memory limit (OOMKill if exceeded)

Performance Testing Resource Configurations

Test your service under different resource configurations to find the optimal settings:

Configuration	CPU Request/Limit	Memory Request/Limit	Test Result
Minimal	100m/500m	128Mi/256Mi	50 req/s, p99=2.5s, OOMKill under load
Conservative	250m/1000m	256Mi/512Mi	100 req/s, p99=800ms, stable
Optimal	500m/2000m	512Mi/1Gi	200 req/s, p99=400ms, stable
Generous	1000m/4000m	1Gi/2Gi	200 req/s, p99=350ms, diminishing returns

The "optimal" configuration achieves target performance without wasting resources. Load testing is the only way to find this sweet spot.

CI Pipeline Integration: Full Performance and Chaos Workflow

# .github/workflows/performance-chaos.yml
name: Performance & Chaos Pipeline
on:
  push:
    branches: [main]

jobs:
  performance-budget:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run build
      - name: Lighthouse CI
        uses: treosh/lighthouse-ci-action@v11
        with:
          configPath: ./lighthouserc.json

  load-test-staging:
    needs: performance-budget
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run k6 load tests
        uses: grafana/k6-action@v0.4.0
        with:
          filename: tests/performance/k6-load-test.js
          flags: --out json=results.json
        env:
          K6_TARGET_URL: ${{ secrets.STAGING_URL }}
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: k6-results
          path: results.json

  chaos-staging:
    needs: load-test-staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Litmus
        run: |
          kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
      - name: Run chaos experiment
        run: |
          kubectl apply -f chaos/pod-delete-experiment.yaml
          kubectl apply -f chaos/network-delay-experiment.yaml
      - name: Wait for chaos completion
        run: |
          kubectl wait --for=condition=complete chaosresult/checkout-chaos \
            --timeout=600s
      - name: Verify SLOs held during chaos
        run: |
          python scripts/verify_slo_during_chaos.py \
            --prometheus-url ${{ secrets.PROMETHEUS_URL }} \
            --slo-config slo/checkout-api.yaml \
            --chaos-window 10m

This pipeline ensures that every merge to main is validated for performance (Lighthouse + k6) and resilience (Litmus chaos). The system must not only be fast -- it must stay fast when things go wrong.