QA Engineer Skills 2026QA-2026LLM Performance Metrics

LLM Performance Metrics

Why LLM Performance Is Different

LLM-backed features introduce performance characteristics that traditional load testing does not account for. Unlike a REST API that returns a complete response in one round-trip, LLM endpoints have variable response times, token-based throughput limits, cold start latency, and provider rate limiting. A QA architect who applies traditional latency percentiles to LLM endpoints will miss the metrics that actually matter to users.


Key Metrics for LLM Performance

Primary Metrics

Metric Description Typical Range Why It Matters
Time to First Token (TTFT) Latency before the first token is generated 200ms - 5s Perceived responsiveness -- users notice when streaming starts
Tokens per Second (TPS) Generation throughput after first token 30-100 tok/s User experience for streaming responses
Total Generation Time End-to-end time including all tokens 1s - 60s Request timeout planning and SLO definitions
Cold Start Latency First request after idle period (serverless) 5s - 30s Serverless deployment planning
Rate Limit Headroom Distance from provider rate limit (tokens/min or requests/min) Varies by tier Burst handling capacity
Context Window Utilization Prompt + completion token ratio 10-100% Cost and latency correlation

Secondary Metrics

Metric Description Why It Matters
Token Budget Compliance Percentage of responses within max_tokens Prevents runaway costs
Retry Rate Percentage of requests requiring retry (429, 500, timeout) Service reliability
Provider Fallback Rate How often the system falls back to a secondary provider Primary provider stability
Cache Hit Rate Percentage of requests served from semantic cache Cost optimization and latency
Streaming Drop Rate Percentage of streams that disconnect before completion Network reliability

Understanding TTFT vs Total Generation Time

Request sent                      Response complete
    |                                    |
    |--[TTFT]-->|                        |
    |           |---[Token streaming]--->|
    |           |                        |
    |           First token              Last token
    |                                    |
    |--------[Total Generation Time]---->|

TTFT is what determines perceived responsiveness. A user staring at a blank screen for 3 seconds feels slow, even if the total response arrives in 5 seconds. With streaming, a 500ms TTFT followed by 4.5 seconds of token-by-token output feels much faster than a 5-second wait for a complete response.

Rule of thumb:

  • TTFT < 1s: Users perceive the response as "instant"
  • TTFT 1-3s: Acceptable with a loading indicator
  • TTFT > 3s: Users begin to disengage

Measuring LLM Metrics

Non-Streaming Measurement

For non-streaming endpoints, you can only measure total generation time directly. TTFT must be estimated:

# llm_metrics_collector.py
import time
import json
from dataclasses import dataclass

@dataclass
class LLMMetrics:
    ttft_ms: float
    total_generation_ms: float
    tokens_per_second: float
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    model: str

def measure_non_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
    """Measure LLM performance for a non-streaming request."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        stream=False,
    )

    total_ms = (time.perf_counter() - start) * 1000
    completion_tokens = response.usage.completion_tokens
    tps = completion_tokens / (total_ms / 1000) if total_ms > 0 else 0

    return LLMMetrics(
        ttft_ms=total_ms * 0.15,  # heuristic: ~15% of total time is prefill
        total_generation_ms=total_ms,
        tokens_per_second=tps,
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=response.usage.total_tokens,
        model=model,
    )

Streaming Measurement (Accurate TTFT)

For accurate TTFT, you must use the streaming API:

import time

def measure_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
    """Measure LLM performance for a streaming request with accurate TTFT."""
    start = time.perf_counter()
    ttft = None
    token_count = 0
    full_response = ""

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        stream=True,
        stream_options={"include_usage": True},
    )

    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if ttft is None:
                ttft = (time.perf_counter() - start) * 1000
            full_response += chunk.choices[0].delta.content
            token_count += 1

        # Usage is in the final chunk when stream_options is set
        if hasattr(chunk, 'usage') and chunk.usage:
            prompt_tokens = chunk.usage.prompt_tokens
            completion_tokens = chunk.usage.completion_tokens

    total_ms = (time.perf_counter() - start) * 1000
    generation_time = total_ms - (ttft or 0)
    tps = completion_tokens / (generation_time / 1000) if generation_time > 0 else 0

    return LLMMetrics(
        ttft_ms=ttft or total_ms,
        total_generation_ms=total_ms,
        tokens_per_second=tps,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
        model=model,
    )

Context Window Economics

The size of the prompt directly affects latency and cost. Understanding this relationship is critical for performance optimization:

Context Usage Typical TTFT Impact Cost Impact Optimization
< 1K tokens Baseline Baseline None needed
1-4K tokens +100-300ms 2-4x Summarize context
4-16K tokens +300ms-1s 4-16x RAG with relevance filtering
16-64K tokens +1-5s 16-64x Aggressive context pruning
64-128K tokens +5-15s 64-128x Redesign the approach

Practical optimization strategies:

  1. Prompt caching. Many providers cache the prefix of repeated prompts, reducing TTFT for subsequent requests with the same system prompt.
  2. Context pruning. Remove irrelevant conversation history before sending to the model.
  3. Semantic caching. Cache responses for semantically similar queries to avoid LLM calls entirely.
  4. Model routing. Send simple queries to smaller, faster models. Reserve large models for complex tasks.

Setting SLOs for LLM Features

LLM SLOs should be defined separately from traditional API SLOs:

# llm-slo-definition.yaml
service: ai-chatbot
llm_slos:
  - name: streaming_responsiveness
    metric: time_to_first_token
    target: p95 < 2000ms
    window: 7d

  - name: generation_throughput
    metric: tokens_per_second
    target: avg > 40 tok/s
    window: 7d

  - name: completion_time
    metric: total_generation_time
    target: p95 < 10000ms
    window: 7d

  - name: availability
    metric: successful_requests / total_requests
    target: 99.5%
    window: 30d
    # Note: lower than typical API SLOs because LLM providers
    # have higher baseline error rates

  - name: rate_limit_headroom
    metric: requests_used / rate_limit
    target: peak < 80%
    window: 1d
    alert_at: 70%

Benchmarking Across Providers

When evaluating LLM providers, run standardized benchmarks:

# llm_benchmark.py
BENCHMARK_PROMPTS = [
    {"name": "short_qa", "prompt": "What is 2+2?", "expected_tokens": 10},
    {"name": "medium_summary", "prompt": "Summarize the key differences between REST and GraphQL in 3 sentences.", "expected_tokens": 80},
    {"name": "long_generation", "prompt": "Write a Python function that implements binary search with detailed docstring.", "expected_tokens": 200},
]

def benchmark_provider(client, model: str, runs: int = 10) -> dict:
    """Benchmark a provider/model combination."""
    results = {}
    for prompt_config in BENCHMARK_PROMPTS:
        metrics = []
        for _ in range(runs):
            m = measure_streaming(client, prompt_config["prompt"], model)
            metrics.append(m)

        results[prompt_config["name"]] = {
            "ttft_p50": sorted([m.ttft_ms for m in metrics])[len(metrics)//2],
            "ttft_p95": sorted([m.ttft_ms for m in metrics])[int(len(metrics)*0.95)],
            "tps_avg": sum(m.tokens_per_second for m in metrics) / len(metrics),
            "total_p50": sorted([m.total_generation_ms for m in metrics])[len(metrics)//2],
        }
    return results

This data informs provider selection, fallback prioritization, and SLO calibration. Run benchmarks weekly to track provider performance trends.