LLM Performance Metrics

Why LLM Performance Is Different

LLM-backed features introduce performance characteristics that traditional load testing does not account for. Unlike a REST API that returns a complete response in one round-trip, LLM endpoints have variable response times, token-based throughput limits, cold start latency, and provider rate limiting. A QA architect who applies traditional latency percentiles to LLM endpoints will miss the metrics that actually matter to users.

Key Metrics for LLM Performance

Primary Metrics

Metric	Description	Typical Range	Why It Matters
Time to First Token (TTFT)	Latency before the first token is generated	200ms - 5s	Perceived responsiveness -- users notice when streaming starts
Tokens per Second (TPS)	Generation throughput after first token	30-100 tok/s	User experience for streaming responses
Total Generation Time	End-to-end time including all tokens	1s - 60s	Request timeout planning and SLO definitions
Cold Start Latency	First request after idle period (serverless)	5s - 30s	Serverless deployment planning
Rate Limit Headroom	Distance from provider rate limit (tokens/min or requests/min)	Varies by tier	Burst handling capacity
Context Window Utilization	Prompt + completion token ratio	10-100%	Cost and latency correlation

Secondary Metrics

Metric	Description	Why It Matters
Token Budget Compliance	Percentage of responses within max_tokens	Prevents runaway costs
Retry Rate	Percentage of requests requiring retry (429, 500, timeout)	Service reliability
Provider Fallback Rate	How often the system falls back to a secondary provider	Primary provider stability
Cache Hit Rate	Percentage of requests served from semantic cache	Cost optimization and latency
Streaming Drop Rate	Percentage of streams that disconnect before completion	Network reliability

Understanding TTFT vs Total Generation Time

Request sent                      Response complete
    |                                    |
    |--[TTFT]-->|                        |
    |           |---[Token streaming]--->|
    |           |                        |
    |           First token              Last token
    |                                    |
    |--------[Total Generation Time]---->|

TTFT is what determines perceived responsiveness. A user staring at a blank screen for 3 seconds feels slow, even if the total response arrives in 5 seconds. With streaming, a 500ms TTFT followed by 4.5 seconds of token-by-token output feels much faster than a 5-second wait for a complete response.

Rule of thumb:

TTFT < 1s: Users perceive the response as "instant"
TTFT 1-3s: Acceptable with a loading indicator
TTFT > 3s: Users begin to disengage

Measuring LLM Metrics

Non-Streaming Measurement

For non-streaming endpoints, you can only measure total generation time directly. TTFT must be estimated:

# llm_metrics_collector.py
import time
import json
from dataclasses import dataclass

@dataclass
class LLMMetrics:
    ttft_ms: float
    total_generation_ms: float
    tokens_per_second: float
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    model: str

def measure_non_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
    """Measure LLM performance for a non-streaming request."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        stream=False,
    )

    total_ms = (time.perf_counter() - start) * 1000
    completion_tokens = response.usage.completion_tokens
    tps = completion_tokens / (total_ms / 1000) if total_ms > 0 else 0

    return LLMMetrics(
        ttft_ms=total_ms * 0.15,  # heuristic: ~15% of total time is prefill
        total_generation_ms=total_ms,
        tokens_per_second=tps,
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=response.usage.total_tokens,
        model=model,
    )

Streaming Measurement (Accurate TTFT)

For accurate TTFT, you must use the streaming API:

import time

def measure_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
    """Measure LLM performance for a streaming request with accurate TTFT."""
    start = time.perf_counter()
    ttft = None
    token_count = 0
    full_response = ""

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        stream=True,
        stream_options={"include_usage": True},
    )

    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if ttft is None:
                ttft = (time.perf_counter() - start) * 1000
            full_response += chunk.choices[0].delta.content
            token_count += 1

        # Usage is in the final chunk when stream_options is set
        if hasattr(chunk, 'usage') and chunk.usage:
            prompt_tokens = chunk.usage.prompt_tokens
            completion_tokens = chunk.usage.completion_tokens

    total_ms = (time.perf_counter() - start) * 1000
    generation_time = total_ms - (ttft or 0)
    tps = completion_tokens / (generation_time / 1000) if generation_time > 0 else 0

    return LLMMetrics(
        ttft_ms=ttft or total_ms,
        total_generation_ms=total_ms,
        tokens_per_second=tps,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
        model=model,
    )

Context Window Economics

The size of the prompt directly affects latency and cost. Understanding this relationship is critical for performance optimization:

Context Usage	Typical TTFT Impact	Cost Impact	Optimization
< 1K tokens	Baseline	Baseline	None needed
1-4K tokens	+100-300ms	2-4x	Summarize context
4-16K tokens	+300ms-1s	4-16x	RAG with relevance filtering
16-64K tokens	+1-5s	16-64x	Aggressive context pruning
64-128K tokens	+5-15s	64-128x	Redesign the approach

Practical optimization strategies:

Prompt caching. Many providers cache the prefix of repeated prompts, reducing TTFT for subsequent requests with the same system prompt.
Context pruning. Remove irrelevant conversation history before sending to the model.
Semantic caching. Cache responses for semantically similar queries to avoid LLM calls entirely.
Model routing. Send simple queries to smaller, faster models. Reserve large models for complex tasks.

Setting SLOs for LLM Features

LLM SLOs should be defined separately from traditional API SLOs:

# llm-slo-definition.yaml
service: ai-chatbot
llm_slos:
  - name: streaming_responsiveness
    metric: time_to_first_token
    target: p95 < 2000ms
    window: 7d

  - name: generation_throughput
    metric: tokens_per_second
    target: avg > 40 tok/s
    window: 7d

  - name: completion_time
    metric: total_generation_time
    target: p95 < 10000ms
    window: 7d

  - name: availability
    metric: successful_requests / total_requests
    target: 99.5%
    window: 30d
    # Note: lower than typical API SLOs because LLM providers
    # have higher baseline error rates

  - name: rate_limit_headroom
    metric: requests_used / rate_limit
    target: peak < 80%
    window: 1d
    alert_at: 70%

Benchmarking Across Providers

When evaluating LLM providers, run standardized benchmarks:

# llm_benchmark.py
BENCHMARK_PROMPTS = [
    {"name": "short_qa", "prompt": "What is 2+2?", "expected_tokens": 10},
    {"name": "medium_summary", "prompt": "Summarize the key differences between REST and GraphQL in 3 sentences.", "expected_tokens": 80},
    {"name": "long_generation", "prompt": "Write a Python function that implements binary search with detailed docstring.", "expected_tokens": 200},
]

def benchmark_provider(client, model: str, runs: int = 10) -> dict:
    """Benchmark a provider/model combination."""
    results = {}
    for prompt_config in BENCHMARK_PROMPTS:
        metrics = []
        for _ in range(runs):
            m = measure_streaming(client, prompt_config["prompt"], model)
            metrics.append(m)

        results[prompt_config["name"]] = {
            "ttft_p50": sorted([m.ttft_ms for m in metrics])[len(metrics)//2],
            "ttft_p95": sorted([m.ttft_ms for m in metrics])[int(len(metrics)*0.95)],
            "tps_avg": sum(m.tokens_per_second for m in metrics) / len(metrics),
            "total_p50": sorted([m.total_generation_ms for m in metrics])[len(metrics)//2],
        }
    return results

This data informs provider selection, fallback prioritization, and SLO calibration. Run benchmarks weekly to track provider performance trends.