Distributed Tracing with OpenTelemetry
Why Distributed Tracing?
In a microservices architecture, a single user request may traverse 5, 10, or 20 services. When that request is slow or fails, you need to answer: which service is the bottleneck? Distributed tracing provides the answer by tracking each request's journey across service boundaries, recording timing, status, and metadata at every hop.
OpenTelemetry Architecture
OpenTelemetry (OTel) is the CNCF standard for collecting telemetry data. It provides vendor-neutral APIs, SDKs, and a collector for traces, metrics, and logs.
[Service A] [Service B] [Service C]
OTel SDK OTel SDK OTel SDK
| | |
v v v
+-------------------------------------------------------+
| OpenTelemetry Collector |
| (receives, processes, exports telemetry data) |
+-------+-----------+-----------+-----------+-----------+
| | | |
v v v v
[Jaeger] [Grafana [Datadog] [Cloud
Tempo] provider]
Why OpenTelemetry Specifically?
- Vendor neutral. Switch backends (Jaeger, Datadog, Grafana Tempo) without changing application code
- CNCF standard. Backed by Google, Microsoft, Splunk, and most observability vendors
- Auto-instrumentation. Instrument HTTP, database, and messaging libraries with zero code changes
- Context propagation. Automatically propagates trace context across service boundaries via HTTP headers
How Traces, Spans, and Context Work Together
User clicks "Submit Order"
|
v
[TRACE: order-submit-abc123] -------- spans the entire request flow
|
+---> [Service: API Gateway]
| LOG: "Received POST /orders from user_id=42"
| METRIC: http_requests_total{method="POST", path="/orders"} +1
|
+---> [Service: Order Service]
| LOG: "Creating order for user_id=42, items=3, total=$149.99"
| METRIC: order_creation_duration_seconds = 0.045
| SPAN: order-service.create_order (45ms)
|
+---> [Service: Payment Service]
| LOG: "Charging $149.99 to card ending 4242"
| METRIC: payment_processing_duration_seconds = 1.2
| SPAN: payment-service.charge (1200ms) <-- slow!
|
+---> [Service: Inventory Service]
LOG: "Reserved 3 items for order_id=ORD-789"
METRIC: inventory_reservations_total +3
SPAN: inventory-service.reserve (23ms)
The trace reveals the payment service is the bottleneck (1200ms out of ~1300ms total). The metric confirms this is a trend. The log provides specific context for debugging.
Instrumenting a Python Service
Automatic Instrumentation Setup
# otel_setup.py -- OpenTelemetry configuration for a Flask service
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
def setup_telemetry(service_name: str, service_version: str):
"""Initialize OpenTelemetry with OTLP export."""
resource = Resource.create({
"service.name": service_name,
"service.version": service_version,
"deployment.environment": "production",
})
# Traces
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)
# Metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://otel-collector:4317"),
export_interval_millis=10000,
)
metrics.set_meter_provider(MeterProvider(
resource=resource, metric_readers=[metric_reader]
))
# Auto-instrument frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
Manual Spans for Business Logic
# app.py -- Adding manual spans for business-critical operations
from opentelemetry import trace
tracer = trace.get_tracer("order-service")
def process_order(order_data):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_data["id"])
span.set_attribute("order.item_count", len(order_data["items"]))
span.set_attribute("order.total_cents", order_data["total"])
with tracer.start_as_current_span("validate_order"):
validate(order_data)
with tracer.start_as_current_span("charge_payment") as payment_span:
result = charge(order_data["payment"])
payment_span.set_attribute("payment.method", result.method)
payment_span.set_attribute("payment.processor", result.processor)
with tracer.start_as_current_span("reserve_inventory"):
reserve(order_data["items"])
span.set_attribute("order.status", "completed")
Trace-Based Testing
A powerful pattern is writing assertions against traces -- verifying not just that a request succeeded, but that it followed the expected path through the system:
# test_trace_assertions.py
import requests
import time
def test_order_flow_creates_expected_trace(trace_client):
"""Verify the order flow hits all expected services in the correct order."""
# Trigger the order flow
response = requests.post("https://api.example.com/orders", json={
"items": [{"sku": "LAPTOP-1", "qty": 1}],
"payment": {"method": "card", "token": "tok_test_123"},
})
assert response.status_code == 201
trace_id = response.headers["X-Trace-Id"]
# Allow time for spans to propagate to the backend
time.sleep(5)
trace_data = trace_client.fetch_trace(trace_id)
# Assert all expected services are present
service_names = [span["service_name"] for span in trace_data["spans"]]
assert "api-gateway" in service_names
assert "order-service" in service_names
assert "payment-service" in service_names
assert "inventory-service" in service_names
# Assert ordering: payment happens before inventory reservation
payment_span = next(s for s in trace_data["spans"]
if s["operation"] == "charge_payment")
inventory_span = next(s for s in trace_data["spans"]
if s["operation"] == "reserve_inventory")
assert payment_span["end_time"] <= inventory_span["start_time"]
# Assert performance: total trace duration under 3 seconds
root_span = next(s for s in trace_data["spans"] if s["parent_id"] is None)
assert root_span["duration_ms"] < 3000
# Assert no error spans
error_spans = [s for s in trace_data["spans"] if s.get("status") == "ERROR"]
assert len(error_spans) == 0, f"Unexpected errors in trace: {error_spans}"
What Trace-Based Tests Catch
| Assertion | What It Catches |
|---|---|
| Service present in trace | Missing service call (regression in integration) |
| Span ordering | Race conditions, incorrect orchestration |
| Total trace duration | End-to-end performance regression |
| No error spans | Swallowed errors, silent failures |
| Expected attributes on spans | Missing context propagation |
| Span count | Unexpected service calls (N+1 queries, extra retries) |
OpenTelemetry Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
# Tail-based sampling: keep all error traces, sample 10% of successful traces
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic-sample
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Sampling Strategies
In high-traffic systems, collecting every trace is expensive. Choose a sampling strategy:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Head-based (probability) | Decide at trace start whether to sample | Simple, predictable cost | May miss rare errors |
| Tail-based | Decide after trace completes, based on outcome | Keeps all errors and slow traces | Higher memory usage in collector |
| Rate-limited | Sample N traces per second | Predictable cost | Misses bursts |
| Always-on for errors | Sample 100% of error traces | Never misses failures | Does not reduce volume of error traces |
Recommendation: Use tail-based sampling with always-on for errors and slow traces. This gives you 100% visibility into problems while keeping costs manageable for successful fast requests.
Distributed tracing is the backbone of observability in microservices. Combined with structured logging and metrics, it provides the complete picture needed for effective production testing.