LLM Performance Metrics
Why LLM Performance Is Different
LLM-backed features introduce performance characteristics that traditional load testing does not account for. Unlike a REST API that returns a complete response in one round-trip, LLM endpoints have variable response times, token-based throughput limits, cold start latency, and provider rate limiting. A QA architect who applies traditional latency percentiles to LLM endpoints will miss the metrics that actually matter to users.
Key Metrics for LLM Performance
Primary Metrics
| Metric | Description | Typical Range | Why It Matters |
|---|---|---|---|
| Time to First Token (TTFT) | Latency before the first token is generated | 200ms - 5s | Perceived responsiveness -- users notice when streaming starts |
| Tokens per Second (TPS) | Generation throughput after first token | 30-100 tok/s | User experience for streaming responses |
| Total Generation Time | End-to-end time including all tokens | 1s - 60s | Request timeout planning and SLO definitions |
| Cold Start Latency | First request after idle period (serverless) | 5s - 30s | Serverless deployment planning |
| Rate Limit Headroom | Distance from provider rate limit (tokens/min or requests/min) | Varies by tier | Burst handling capacity |
| Context Window Utilization | Prompt + completion token ratio | 10-100% | Cost and latency correlation |
Secondary Metrics
| Metric | Description | Why It Matters |
|---|---|---|
| Token Budget Compliance | Percentage of responses within max_tokens | Prevents runaway costs |
| Retry Rate | Percentage of requests requiring retry (429, 500, timeout) | Service reliability |
| Provider Fallback Rate | How often the system falls back to a secondary provider | Primary provider stability |
| Cache Hit Rate | Percentage of requests served from semantic cache | Cost optimization and latency |
| Streaming Drop Rate | Percentage of streams that disconnect before completion | Network reliability |
Understanding TTFT vs Total Generation Time
Request sent Response complete
| |
|--[TTFT]-->| |
| |---[Token streaming]--->|
| | |
| First token Last token
| |
|--------[Total Generation Time]---->|
TTFT is what determines perceived responsiveness. A user staring at a blank screen for 3 seconds feels slow, even if the total response arrives in 5 seconds. With streaming, a 500ms TTFT followed by 4.5 seconds of token-by-token output feels much faster than a 5-second wait for a complete response.
Rule of thumb:
- TTFT < 1s: Users perceive the response as "instant"
- TTFT 1-3s: Acceptable with a loading indicator
- TTFT > 3s: Users begin to disengage
Measuring LLM Metrics
Non-Streaming Measurement
For non-streaming endpoints, you can only measure total generation time directly. TTFT must be estimated:
# llm_metrics_collector.py
import time
import json
from dataclasses import dataclass
@dataclass
class LLMMetrics:
ttft_ms: float
total_generation_ms: float
tokens_per_second: float
prompt_tokens: int
completion_tokens: int
total_tokens: int
model: str
def measure_non_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
"""Measure LLM performance for a non-streaming request."""
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
stream=False,
)
total_ms = (time.perf_counter() - start) * 1000
completion_tokens = response.usage.completion_tokens
tps = completion_tokens / (total_ms / 1000) if total_ms > 0 else 0
return LLMMetrics(
ttft_ms=total_ms * 0.15, # heuristic: ~15% of total time is prefill
total_generation_ms=total_ms,
tokens_per_second=tps,
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=response.usage.total_tokens,
model=model,
)
Streaming Measurement (Accurate TTFT)
For accurate TTFT, you must use the streaming API:
import time
def measure_streaming(client, prompt: str, model: str = "gpt-4o") -> LLMMetrics:
"""Measure LLM performance for a streaming request with accurate TTFT."""
start = time.perf_counter()
ttft = None
token_count = 0
full_response = ""
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if ttft is None:
ttft = (time.perf_counter() - start) * 1000
full_response += chunk.choices[0].delta.content
token_count += 1
# Usage is in the final chunk when stream_options is set
if hasattr(chunk, 'usage') and chunk.usage:
prompt_tokens = chunk.usage.prompt_tokens
completion_tokens = chunk.usage.completion_tokens
total_ms = (time.perf_counter() - start) * 1000
generation_time = total_ms - (ttft or 0)
tps = completion_tokens / (generation_time / 1000) if generation_time > 0 else 0
return LLMMetrics(
ttft_ms=ttft or total_ms,
total_generation_ms=total_ms,
tokens_per_second=tps,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
model=model,
)
Context Window Economics
The size of the prompt directly affects latency and cost. Understanding this relationship is critical for performance optimization:
| Context Usage | Typical TTFT Impact | Cost Impact | Optimization |
|---|---|---|---|
| < 1K tokens | Baseline | Baseline | None needed |
| 1-4K tokens | +100-300ms | 2-4x | Summarize context |
| 4-16K tokens | +300ms-1s | 4-16x | RAG with relevance filtering |
| 16-64K tokens | +1-5s | 16-64x | Aggressive context pruning |
| 64-128K tokens | +5-15s | 64-128x | Redesign the approach |
Practical optimization strategies:
- Prompt caching. Many providers cache the prefix of repeated prompts, reducing TTFT for subsequent requests with the same system prompt.
- Context pruning. Remove irrelevant conversation history before sending to the model.
- Semantic caching. Cache responses for semantically similar queries to avoid LLM calls entirely.
- Model routing. Send simple queries to smaller, faster models. Reserve large models for complex tasks.
Setting SLOs for LLM Features
LLM SLOs should be defined separately from traditional API SLOs:
# llm-slo-definition.yaml
service: ai-chatbot
llm_slos:
- name: streaming_responsiveness
metric: time_to_first_token
target: p95 < 2000ms
window: 7d
- name: generation_throughput
metric: tokens_per_second
target: avg > 40 tok/s
window: 7d
- name: completion_time
metric: total_generation_time
target: p95 < 10000ms
window: 7d
- name: availability
metric: successful_requests / total_requests
target: 99.5%
window: 30d
# Note: lower than typical API SLOs because LLM providers
# have higher baseline error rates
- name: rate_limit_headroom
metric: requests_used / rate_limit
target: peak < 80%
window: 1d
alert_at: 70%
Benchmarking Across Providers
When evaluating LLM providers, run standardized benchmarks:
# llm_benchmark.py
BENCHMARK_PROMPTS = [
{"name": "short_qa", "prompt": "What is 2+2?", "expected_tokens": 10},
{"name": "medium_summary", "prompt": "Summarize the key differences between REST and GraphQL in 3 sentences.", "expected_tokens": 80},
{"name": "long_generation", "prompt": "Write a Python function that implements binary search with detailed docstring.", "expected_tokens": 200},
]
def benchmark_provider(client, model: str, runs: int = 10) -> dict:
"""Benchmark a provider/model combination."""
results = {}
for prompt_config in BENCHMARK_PROMPTS:
metrics = []
for _ in range(runs):
m = measure_streaming(client, prompt_config["prompt"], model)
metrics.append(m)
results[prompt_config["name"]] = {
"ttft_p50": sorted([m.ttft_ms for m in metrics])[len(metrics)//2],
"ttft_p95": sorted([m.ttft_ms for m in metrics])[int(len(metrics)*0.95)],
"tps_avg": sum(m.tokens_per_second for m in metrics) / len(metrics),
"total_p50": sorted([m.total_generation_ms for m in metrics])[len(metrics)//2],
}
return results
This data informs provider selection, fallback prioritization, and SLO calibration. Run benchmarks weekly to track provider performance trends.