Distributed Tracing with OpenTelemetry

Why Distributed Tracing?

In a microservices architecture, a single user request may traverse 5, 10, or 20 services. When that request is slow or fails, you need to answer: which service is the bottleneck? Distributed tracing provides the answer by tracking each request's journey across service boundaries, recording timing, status, and metadata at every hop.

OpenTelemetry Architecture

OpenTelemetry (OTel) is the CNCF standard for collecting telemetry data. It provides vendor-neutral APIs, SDKs, and a collector for traces, metrics, and logs.

  [Service A]        [Service B]         [Service C]
  OTel SDK           OTel SDK            OTel SDK
      |                  |                    |
      v                  v                    v
  +-------------------------------------------------------+
  |              OpenTelemetry Collector                  |
  |  (receives, processes, exports telemetry data)        |
  +-------+-----------+-----------+-----------+-----------+
          |           |           |           |
          v           v           v           v
       [Jaeger]   [Grafana    [Datadog]   [Cloud
                   Tempo]                  provider]

Why OpenTelemetry Specifically?

Vendor neutral. Switch backends (Jaeger, Datadog, Grafana Tempo) without changing application code
CNCF standard. Backed by Google, Microsoft, Splunk, and most observability vendors
Auto-instrumentation. Instrument HTTP, database, and messaging libraries with zero code changes
Context propagation. Automatically propagates trace context across service boundaries via HTTP headers

How Traces, Spans, and Context Work Together

User clicks "Submit Order"
    |
    v
[TRACE: order-submit-abc123] -------- spans the entire request flow
    |
    +---> [Service: API Gateway]
    |         LOG: "Received POST /orders from user_id=42"
    |         METRIC: http_requests_total{method="POST", path="/orders"} +1
    |
    +---> [Service: Order Service]
    |         LOG: "Creating order for user_id=42, items=3, total=$149.99"
    |         METRIC: order_creation_duration_seconds = 0.045
    |         SPAN: order-service.create_order (45ms)
    |
    +---> [Service: Payment Service]
    |         LOG: "Charging $149.99 to card ending 4242"
    |         METRIC: payment_processing_duration_seconds = 1.2
    |         SPAN: payment-service.charge (1200ms)  <-- slow!
    |
    +---> [Service: Inventory Service]
              LOG: "Reserved 3 items for order_id=ORD-789"
              METRIC: inventory_reservations_total +3
              SPAN: inventory-service.reserve (23ms)

The trace reveals the payment service is the bottleneck (1200ms out of ~1300ms total). The metric confirms this is a trend. The log provides specific context for debugging.

Instrumenting a Python Service

Automatic Instrumentation Setup

# otel_setup.py -- OpenTelemetry configuration for a Flask service
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def setup_telemetry(service_name: str, service_version: str):
    """Initialize OpenTelemetry with OTLP export."""
    resource = Resource.create({
        "service.name": service_name,
        "service.version": service_version,
        "deployment.environment": "production",
    })

    # Traces
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
    )
    trace.set_tracer_provider(trace_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317"),
        export_interval_millis=10000,
    )
    metrics.set_meter_provider(MeterProvider(
        resource=resource, metric_readers=[metric_reader]
    ))

    # Auto-instrument frameworks
    FlaskInstrumentor().instrument()
    RequestsInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()

Manual Spans for Business Logic

# app.py -- Adding manual spans for business-critical operations
from opentelemetry import trace

tracer = trace.get_tracer("order-service")

def process_order(order_data):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_data["id"])
        span.set_attribute("order.item_count", len(order_data["items"]))
        span.set_attribute("order.total_cents", order_data["total"])

        with tracer.start_as_current_span("validate_order"):
            validate(order_data)

        with tracer.start_as_current_span("charge_payment") as payment_span:
            result = charge(order_data["payment"])
            payment_span.set_attribute("payment.method", result.method)
            payment_span.set_attribute("payment.processor", result.processor)

        with tracer.start_as_current_span("reserve_inventory"):
            reserve(order_data["items"])

        span.set_attribute("order.status", "completed")

Trace-Based Testing

A powerful pattern is writing assertions against traces -- verifying not just that a request succeeded, but that it followed the expected path through the system:

# test_trace_assertions.py
import requests
import time

def test_order_flow_creates_expected_trace(trace_client):
    """Verify the order flow hits all expected services in the correct order."""
    # Trigger the order flow
    response = requests.post("https://api.example.com/orders", json={
        "items": [{"sku": "LAPTOP-1", "qty": 1}],
        "payment": {"method": "card", "token": "tok_test_123"},
    })
    assert response.status_code == 201
    trace_id = response.headers["X-Trace-Id"]

    # Allow time for spans to propagate to the backend
    time.sleep(5)
    trace_data = trace_client.fetch_trace(trace_id)

    # Assert all expected services are present
    service_names = [span["service_name"] for span in trace_data["spans"]]
    assert "api-gateway" in service_names
    assert "order-service" in service_names
    assert "payment-service" in service_names
    assert "inventory-service" in service_names

    # Assert ordering: payment happens before inventory reservation
    payment_span = next(s for s in trace_data["spans"]
                       if s["operation"] == "charge_payment")
    inventory_span = next(s for s in trace_data["spans"]
                         if s["operation"] == "reserve_inventory")
    assert payment_span["end_time"] <= inventory_span["start_time"]

    # Assert performance: total trace duration under 3 seconds
    root_span = next(s for s in trace_data["spans"] if s["parent_id"] is None)
    assert root_span["duration_ms"] < 3000

    # Assert no error spans
    error_spans = [s for s in trace_data["spans"] if s.get("status") == "ERROR"]
    assert len(error_spans) == 0, f"Unexpected errors in trace: {error_spans}"

What Trace-Based Tests Catch

Assertion	What It Catches
Service present in trace	Missing service call (regression in integration)
Span ordering	Race conditions, incorrect orchestration
Total trace duration	End-to-end performance regression
No error spans	Swallowed errors, silent failures
Expected attributes on spans	Missing context propagation
Span count	Unexpected service calls (N+1 queries, extra retries)

OpenTelemetry Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Tail-based sampling: keep all error traces, sample 10% of successful traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Sampling Strategies

In high-traffic systems, collecting every trace is expensive. Choose a sampling strategy:

Strategy	Description	Pros	Cons
Head-based (probability)	Decide at trace start whether to sample	Simple, predictable cost	May miss rare errors
Tail-based	Decide after trace completes, based on outcome	Keeps all errors and slow traces	Higher memory usage in collector
Rate-limited	Sample N traces per second	Predictable cost	Misses bursts
Always-on for errors	Sample 100% of error traces	Never misses failures	Does not reduce volume of error traces

Recommendation: Use tail-based sampling with always-on for errors and slow traces. This gives you 100% visibility into problems while keeping costs manageable for successful fast requests.

Distributed tracing is the backbone of observability in microservices. Combined with structured logging and metrics, it provides the complete picture needed for effective production testing.