Load Testing LLM Endpoints

Why LLM Load Testing Requires Special Attention

Traditional load testing assumes predictable response times and linear scaling. LLM endpoints violate both assumptions:

Variable response times. A simple query may complete in 1 second; a complex reasoning task may take 30 seconds.
Token-based rate limits. You can exhaust your rate limit with a few large requests or many small ones.
Cold starts. Serverless LLM deployments can add 5-30 seconds to the first request after idle.
Provider-side queuing. When the provider is under load, requests queue server-side, making client-side concurrency irrelevant.
Non-linear cost. Unlike traditional APIs, each LLM request costs money proportional to token count. A load test that sends 10,000 requests to GPT-4 can cost hundreds of dollars.

k6 Script for LLM Endpoint Load Testing

// k6-llm-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend, Counter, Rate } from 'k6/metrics';

// Custom LLM-specific metrics
const ttft = new Trend('time_to_first_token', true);
const totalGenTime = new Trend('total_generation_time', true);
const tokensPerSecond = new Trend('tokens_per_second');
const rateLimitHits = new Counter('rate_limit_hits');
const timeoutRate = new Rate('timeout_rate');

export const options = {
  scenarios: {
    // Simulate gradual ramp to find the breaking point
    ramp_to_limit: {
      executor: 'ramping-arrival-rate',
      startRate: 1,
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 200,
      stages: [
        { duration: '2m', target: 5 },    // 5 req/s
        { duration: '3m', target: 10 },   // 10 req/s
        { duration: '3m', target: 20 },   // 20 req/s -- likely hits rate limits
        { duration: '2m', target: 5 },    // cool down
      ],
    },
  },
  thresholds: {
    time_to_first_token: ['p(95)<3000'],     // TTFT under 3s for 95th pctile
    total_generation_time: ['p(95)<15000'],   // Total gen under 15s
    tokens_per_second: ['avg>30'],            // At least 30 tok/s average
    timeout_rate: ['rate<0.05'],              // Under 5% timeouts
  },
};

// Varied prompts to simulate realistic usage
const prompts = [
  "Summarize the key differences between REST and GraphQL in 3 sentences.",
  "Write a Python function that validates an email address using regex.",
  "Explain the CAP theorem to a junior developer.",
  "Generate a SQL query to find the top 10 customers by revenue last quarter.",
  "What are the SOLID principles? Give a one-line explanation of each.",
];

export default function () {
  const prompt = prompts[Math.floor(Math.random() * prompts.length)];

  const payload = JSON.stringify({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    max_tokens: 256,
    stream: false,
  });

  const startTime = Date.now();

  const res = http.post(
    'https://api.example.com/v1/chat/completions',
    payload,
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${__ENV.LLM_API_KEY}`,
      },
      timeout: '30s',
    }
  );

  const elapsed = Date.now() - startTime;

  // Track rate limit responses separately
  if (res.status === 429) {
    rateLimitHits.add(1);
    console.warn(`Rate limited at ${new Date().toISOString()}`);
    sleep(5); // back off to avoid cascading rate limits
    return;
  }

  // Track timeouts
  timeoutRate.add(res.status === 0 || elapsed > 29000);

  if (res.status === 200) {
    const body = JSON.parse(res.body);
    const completionTokens = body.usage?.completion_tokens || 0;
    const totalTime = elapsed / 1000; // seconds

    // Approximate TTFT (for accurate TTFT, use streaming with k6 WebSocket)
    ttft.add(elapsed * 0.15); // rough heuristic for non-streaming
    totalGenTime.add(elapsed);

    if (totalTime > 0 && completionTokens > 0) {
      tokensPerSecond.add(completionTokens / totalTime);
    }

    check(res, {
      'status is 200': (r) => r.status === 200,
      'response has content': () => body.choices?.[0]?.message?.content?.length > 0,
      'under token budget': () => body.usage?.total_tokens < 1000,
    });
  }

  sleep(Math.random() * 2 + 0.5); // think time between requests
}

Cold Start Testing Pattern

Serverless LLM deployments (AWS Bedrock, self-hosted on Lambda/Cloud Run) exhibit cold start penalties that can dramatically affect user experience:

// k6-cold-start-test.js
import http from 'k6/http';
import { Trend } from 'k6/metrics';
import { sleep } from 'k6';

const coldStartLatency = new Trend('cold_start_latency', true);
const warmLatency = new Trend('warm_latency', true);

export const options = {
  iterations: 10,
  vus: 1, // sequential to isolate cold starts
};

export default function () {
  // Cold start: wait long enough for the instance to scale down
  sleep(300); // 5 minutes idle -- adjust based on your provider's scale-down policy

  const coldRes = http.post('https://llm.example.com/v1/completions', JSON.stringify({
    prompt: "Hello", max_tokens: 5,
  }), { headers: { 'Content-Type': 'application/json' }, timeout: '60s' });

  coldStartLatency.add(coldRes.timings.duration);
  console.log(`Cold start: ${coldRes.timings.duration}ms`);

  // Warm requests: rapid fire while the instance is hot
  for (let i = 0; i < 5; i++) {
    const warmRes = http.post('https://llm.example.com/v1/completions', JSON.stringify({
      prompt: "Hello", max_tokens: 5,
    }), { headers: { 'Content-Type': 'application/json' }, timeout: '30s' });

    warmLatency.add(warmRes.timings.duration);
    sleep(1);
  }
}

Cold Start Mitigation Strategies

Strategy	How It Works	Trade-off
Provisioned concurrency	Pre-warm N instances (AWS Lambda, Cloud Run min-instances)	Cost: you pay for idle capacity
Keep-alive pings	Periodic health check requests prevent scale-to-zero	Minimal cost, adds complexity
Model caching	Keep model weights in memory across invocations	Requires persistent runtime (not pure serverless)
Edge deployment	Deploy smaller models at the edge for low-latency inference	Limited model capability

Rate Limit Testing

Understanding your provider's rate limit behavior is critical for capacity planning:

// k6-rate-limit-test.js
import http from 'k6/http';
import { Counter, Trend } from 'k6/metrics';

const rateLimits = new Counter('rate_limit_responses');
const responseStatus = new Counter('response_status');
const retryAfter = new Trend('retry_after_seconds');

export const options = {
  scenarios: {
    burst: {
      executor: 'constant-arrival-rate',
      rate: 100,       // deliberately exceed expected rate limit
      timeUnit: '1s',
      duration: '2m',
      preAllocatedVUs: 100,
      maxVUs: 200,
    },
  },
};

export default function () {
  const res = http.post('https://api.example.com/v1/chat/completions',
    JSON.stringify({
      model: "gpt-4o-mini",  // use cheapest model for rate limit testing
      messages: [{ role: "user", content: "Say hi" }],
      max_tokens: 5,
    }),
    { headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.LLM_API_KEY}`,
    }}
  );

  responseStatus.add(1, { status: String(res.status) });

  if (res.status === 429) {
    rateLimits.add(1);
    const retryHeader = res.headers['Retry-After'];
    if (retryHeader) {
      retryAfter.add(parseFloat(retryHeader));
    }
  }
}

Cost-Aware Load Testing

LLM load tests cost real money. Plan your budget:

# estimate_load_test_cost.py
def estimate_cost(
    requests: int,
    avg_prompt_tokens: int = 200,
    avg_completion_tokens: int = 150,
    model: str = "gpt-4o",
) -> dict:
    """Estimate the cost of a load test run."""
    pricing = {
        "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
        "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
        "claude-3-5-sonnet": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    }

    if model not in pricing:
        return {"error": f"Unknown model: {model}"}

    p = pricing[model]
    input_cost = requests * avg_prompt_tokens * p["input"]
    output_cost = requests * avg_completion_tokens * p["output"]

    return {
        "model": model,
        "requests": requests,
        "estimated_input_cost": f"${input_cost:.2f}",
        "estimated_output_cost": f"${output_cost:.2f}",
        "estimated_total_cost": f"${input_cost + output_cost:.2f}",
    }

# Example: 1000 requests to GPT-4o
print(estimate_cost(1000, model="gpt-4o"))
# {'model': 'gpt-4o', 'requests': 1000,
#  'estimated_input_cost': '$0.50', 'estimated_output_cost': '$1.50',
#  'estimated_total_cost': '$2.00'}

Cost Optimization Tips for Load Testing

Use the cheapest model for rate limit and throughput testing. You do not need GPT-4o to test whether your rate limiter works. Use gpt-4o-mini.
Minimize max_tokens. Set max_tokens: 5 for tests that only measure latency, not output quality.
Cache where possible. If your system has a semantic cache, verify it works under load by sending repeated queries.
Test in short bursts. Instead of a 30-minute sustained test, use a 5-minute ramp with aggressive rate increases to find the breaking point quickly.
Budget per test run. Set a hard cost ceiling (e.g., $10 per CI run) and design tests within that constraint.

CI Integration for LLM Load Tests

# .github/workflows/llm-performance.yml
name: LLM Performance Gate
on:
  push:
    branches: [main]
    paths:
      - 'src/ai/**'       # Only run when AI-related code changes
      - 'prompts/**'

jobs:
  llm-load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run LLM load test
        uses: grafana/k6-action@v0.4.0
        with:
          filename: tests/performance/k6-llm-load-test.js
          flags: --out json=llm-results.json
        env:
          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}

      - name: Check cost threshold
        run: |
          python scripts/check_llm_test_cost.py llm-results.json --max-cost 10.00

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: llm-load-test-results
          path: llm-results.json

LLM load testing requires a mindset shift from "how many requests per second" to "how does cost, latency, and quality behave as concurrency increases." The tools are the same (k6, Locust), but the metrics and constraints are fundamentally different.