Great Stack to Doesn't Work #9 — Distributed Tracing: "Why Does This Request Take 3 Seconds?"

Great Stack to Doesn't Work #9 Distributed Tracing: "Why Does This Request Take 3 Seconds?" A survival guide for when everything goes wrong in production. A user clicks "Place Order." The spinner spins. Three seconds pass. The order completes. Three seconds. For a button click. The product manager asks: "Why does this take 3 seconds?" You check the API gateway. 50ms. You check the order service. 80ms. You check the payment service. 120ms. You check the inventory service. 60ms. The total is 310ms. Where's the other 2,690ms? It's in the gaps. The network hops. The serialization. The queue wait times. The connection establishment. The TLS handshakes. The parts of the request lifecycle that no single service can see because they happen between services. Distributed tracing makes the gaps visible. A trace is the complete journey of a request through your system. From the user's browser click to the final database write and back. One trace, one request. A span is a single operation within that trace. "Order service: validate order" is a span. "Payment service: charge card" is a span. "Database: INSERT into orders" is a span. Spans have a start time, duration, status, and parent span. Spans nest. The "process order" span contains "validate order," "check inventory," "charge payment," and "send confirmation" as child spans. Each child can have its own children. The full tree is the trace. Trace context is the thread that connects spans across services. When Service A calls Service B, it passes a trace ID and a parent span ID in HTTP headers. Service B creates a new span with that trace ID and parent. Now both services' spans are part of the same trace. Without context propagation, each service creates an isolated trace. You can see what happened inside each service, but you can't see the full request journey. The gaps between services — the 2,690ms — stay invisible. OpenTelemetry (OTel) is the industry standard for instrumentation. It provides SDKs for every major language, a collector for receiving and routing telemetry data, and semantic conventions for consistent naming. Auto-instrumentation covers the basics without code changes: # Python: install the packages pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a install # Run with auto-instrumentation opentelemetry-instrument \ --service_name order-service \ --traces_exporter otlp \ --metrics_exporter otlp \ --exporter_otlp_endpoint http://otel-collector:4317 \ python app.py Auto-instrumentation hooks into HTTP frameworks, database drivers, and messaging libraries. It creates spans for incoming requests, outgoing HTTP calls, database queries, and message queue operations automatically. Manual instrumentation adds business-specific spans: from opentelemetry import trace tracer = trace.get_tracer("order-service") def process_order(order): with tracer.start_as_current_span("process_order") as span: span.set_attribute("order.id", order.id) span.set_attribute("order.total", order.total) span.set_attribute("order.items_count", len(order.items)) with tracer.start_as_current_span("validate_order"): validate(order) with tracer.start_as_current_span("check_inventory"): check_inventory(order.items) with tracer.start_as_current_span("charge_payment"): charge(order.payment_method, order.total) The auto-instrumented spans tell you "the order service called the payment service." The manual spans tell you "inside the order service, validation took 10ms, inventory check took 50ms, and the payment charge took 200ms." Both are necessary for complete visibility. When Service A calls Service B, the trace context travels in HTTP headers. Two standards dominate: W3C Trace Context (the modern standard): traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 tracestate: vendor=value The traceparent header encodes: version, trace ID (32 hex chars), parent span ID (16 hex chars), and trace flags (sampled or not). B3 (Zipkin's original format): X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736 X-B3-SpanId: 00f067aa0ba902b7 X-B3-Sampled: 1 Or the compact single-header version: b3: 4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-1 If you're starting fresh: use W3C. It's the standard, it's supported everywhere, and it's what OpenTelemetry defaults to. If you have existing Zipkin infrastructure: B3 works fine. OTel collectors can translate between formats. The critical rule: every service in the request path must propagate context. If Service A → B → C → D, and Service C doesn't propagate headers, the trace breaks at C. You'll see A → B in one trace and D in a separate trace with no connection. This is exactly how we lost 3 weeks debugging the "where's the other 2 seconds?" problem. At 10,000 requests per second, tracing every request generates enormous amounts of data. A single trace might have 30 spans, each with attributes and events. At 10K rps, that's 300K spans per second. Storing and indexing all of them is expensive and often unnecessary. Head-based sampling decides at the start of the trace whether to record it. Simple and predictable. # OTel Collector config processors: probabilistic_sampler: sampling_percentage: 10 # Keep 10% of traces The problem: you decide before knowing if the trace is interesting. A 10% sample rate means you'll capture 10% of errors — but if errors are 0.1% of traffic, most sampled traces are successful requests you don't care about. Tail-based sampling decides after the trace completes. It can keep all error traces, all slow traces, and sample normal traces. processors: tail_sampling: policies: - name: errors type: status_code status_code: {status_codes: [ERROR]} - name: slow-requests type: latency latency: {threshold_ms: 1000} - name: normal type: probabilistic probabilistic: {sampling_percentage: 5} This keeps 100% of errors, 100% of requests over 1 second, and 5% of everything else. The interesting traces are always captured. The boring ones are sampled. The trade-off: tail-based sampling requires buffering complete traces in memory before deciding. The OTel Collector needs enough memory to hold all in-flight traces. For high-throughput services, this can be significant. Adaptive sampling adjusts the rate dynamically. Under normal conditions, sample 5%. When error rates spike, automatically increase to 50% or 100%. This captures detail when you need it and saves resources when you don't. Jaeger: The mature choice. Built by Uber, donated to CNCF. Strong UI for trace exploration. Supports Elasticsearch, Cassandra, and Kafka as storage backends. If you need a standalone tracing system with its own storage and UI, Jaeger is battle-tested. Grafana Tempo: The cost-efficient choice. Stores traces in object storage (S3, GCS) without indexing. This makes it dramatically cheaper than Jaeger for high volumes — object storage costs pennies per GB. The trade-off: you can't search traces by arbitrary attributes. You search by trace ID, service name, or through Grafana's integration with logs and metrics (find the trace ID in a log, click through to the trace). If you're already in the Grafana ecosystem (Prometheus + Loki + Grafana), Tempo is the natural addition. Zipkin: The original. Simple, lightweight, easy to deploy. Good for smaller setups. Less feature-rich than Jaeger but also less complex. The decision: if you're running Grafana, choose Tempo. If you need standalone trace search by attributes, choose Jaeger. If you want the simplest possible setup, choose Zipkin. The real value of distributed tracing isn't seeing individual traces. It's correlating traces with metrics and logs. In Grafana, with Prometheus + Loki + Tempo: Dashboard shows a latency spike (Prometheus metric). Click on the spike → Grafana shows exemplar traces during that window (Prometheus exemplars link to Tempo trace IDs). Open the trace → See the full span tree. One span in the payment service took 2.4 seconds. Click on the slow span → Grafana links to Loki logs filtered by that trace ID and time window. The log shows: "connection timeout to payment provider, retry 3 of 3." From "something is slow" to "the payment provider is timing out" in 4 clicks. No grep. No manual log correlation. No guessing. The prerequisites: Metrics: Use exemplars to embed trace IDs in Prometheus metrics. Logs: Include trace_id and span_id in every structured log line. Traces: Use OpenTelemetry to generate spans with service.name and standard attributes. Grafana: Configure data source correlations between Prometheus, Loki, and Tempo. A span that says "HTTP POST /api/orders 200 180ms" is useful. A span that says "HTTP POST /api/orders 200 180ms, order_id=12345, items=3, total=$299.97, customer_tier=premium, warehouse=us-east" is actionable. Attributes are key-value pairs attached to spans: span.set_attribute("order.id", order_id) span.set_attribute("order.items_count", len(items)) span.set_attribute("customer.tier", customer.tier) span.set_attribute("db.statement", "INSERT INTO orders...") Events are timestamped messages within a span's lifetime: span.add_event("inventory_check_passed", { "warehouse": "us-east", "all_items_available": True }) span.add_event("payment_initiated", { "provider": "stripe", "amount": 299.97 }) Attributes describe the span. Events describe what happened during the span. Both are searchable (if your backend supports it) and both make the difference between a trace you can look at and a trace you can learn from. Semantic conventions: OpenTelemetry defines standard attribute names. Use them. http.method, http.status_code, http.url db.system, db.statement, db.operation messaging.system, messaging.destination rpc.system, rpc.method Standard names mean your dashboards and alerts work across services without custom parsing. Checkout flow. User clicks "Pay." Seven microservices involved: API Gateway → Order Service → Inventory Service → Pricing Service → Payment Service → Notification Service → Analytics Service. Each service reported latency under 100ms. Total measured by the user: 3.2 seconds. Distributed tracing was deployed but nobody had looked at a full trace end-to-end. The trace revealed: API Gateway → Order Service: 15ms network latency (normal). Order Service: 80ms internal processing. Then calls Inventory and Pricing sequentially. Not in parallel. Inventory: 90ms. Pricing: 70ms. Sequential total: 160ms wasted. Inventory Service → Database: 45ms. But the span showed 3 round trips: check stock, reserve stock, confirm reservation. Each was a separate database call with its own connection establishment. With connection pooling and a single transaction: 12ms. Order Service → Payment Service: 120ms. Normal. But the trace showed a 400ms gap between "inventory check complete" and "payment initiated." The order service was logging — synchronously writing to a file on an NFS mount. 400ms for a log write. Payment Service → External Payment Provider: 800ms. Expected. External API, nothing to optimize. Payment Service → Notification Service: 200ms. But the notification was sent synchronously. The user waited for the email to queue before seeing "Order confirmed." Analytics event: 150ms. Also synchronous. Fixes: Parallelize Inventory and Pricing calls: saved 70ms. Connection pooling on Inventory's database: saved 33ms. Async logging (switch from synchronous file write to async buffer): saved 400ms. Async notification (fire-and-forget to a message queue): saved 200ms. Async analytics (same pattern): saved 150ms. Total saved: ~850ms. Plus the parallelization saved another 70ms. New checkout time: ~2.1 seconds. The 800ms payment provider call was the irreducible minimum. None of this was visible without distributed tracing. Each service saw "I processed my part in under 100ms." The trace showed "yes, but you waited 400ms for a log write and called two services sequentially that could have been parallel." A team deployed OpenTelemetry across 12 services. Traces looked great — for 11 of them. Service #7 (a legacy Java service running an older framework) didn't propagate W3C trace headers. Every trace that passed through Service #7 broke into two fragments: spans before it and spans after it. The team spent 3 weeks thinking their tracing setup was misconfigured. They rebuilt collectors, redeployed agents, checked network policies. The actual problem: Service #7's HTTP client library was configured with a custom interceptor that stripped unknown headers. The traceparent header was being removed at the HTTP client level. Fix: one line. Add traceparent and tracestate to the allowed headers list. The lesson: trace context propagation is all-or-nothing. One service that doesn't propagate breaks every trace that touches it. When deploying tracing, verify propagation at every service boundary, not just at the edges. A high-traffic platform set sampling to 1% because storage was expensive. Normal operations: 1% sampling captured enough data for general analysis. Then a subtle bug appeared. One in every 10,000 requests hit a code path that caused a 30-second timeout. Error rate: 0.01%. With 1% sampling and 0.01% error rate, the probability of capturing one of these traces was 0.0001%. They processed 1 million requests before capturing a single instance of the slow trace. For 2 weeks, users complained about random timeouts. The team could see the error rate in metrics but had zero traces showing the actual failure path. They eventually found it by adding targeted debug logging to the suspected code path — the thing distributed tracing was supposed to eliminate. After the incident, they switched to tail-based sampling: 100% of errors and slow requests, 1% of everything else. Storage costs increased 30%. Debugging time decreased by 90%. Distributed tracing answers the question that logs and metrics can't: "What happened to this specific request across all the services it touched?" Context propagation is the foundation. If one service doesn't propagate headers, the trace breaks. Verify propagation across every service boundary before trusting your traces. Sampling strategy matters more than you think. Head-based sampling is simple but misses rare events. Tail-based sampling captures what matters but needs memory. Choose based on your traffic volume and your tolerance for missing interesting traces. The biggest wins from tracing are always in the gaps: sequential calls that should be parallel, synchronous operations that should be async, and network overhead that shouldn't exist. No single service can see these problems. The trace reveals them instantly. Have you found the 'hidden gap' in a request's journey using distributed tracing? What was the surprise? And what sampling strategy do you use in production? If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale. Follow me: LinkedIn — Mehmet TURAÇ X/Twitter — @TuracTheThinker This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Key Takeaways

•Great Stack to Doesn't Work #9 Distributed Tracing: "Why Does This Request Take 3 Seconds?" A survival guide for when everything goes wrong in production. A user clicks "Place Order." The spinner spins

•This story was reported by Dev.to, covering developments in the dev space.

•AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.

Great Stack to Doesn't Work #9 — Distributed Tracing: "Why Does This Request Take 3 Seconds?"

Key Takeaways

•This story was reported by Dev.to, covering developments in the dev space.

•AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.

Great Stack to Doesn't Work #9 — Distributed Tracing: "Why Does This Request Take 3 Seconds?"

Key Takeaways

Related Articles

I Built an AI System Design Coach — Clone It, Try It, Break It

Migrating Off OpenClaw Without Downtime — and the Offset That Made Hermes Look Dead

The contractor economy: why startups hire specialists before full-time teams

How to Secure Local LLM Model Files: A Zero Trust Guide

Discussion

Great Stack to Doesn't Work #9 — Distributed Tracing: "Why Does This Request Take 3 Seconds?"

Key Takeaways

Related Articles

I Built an AI System Design Coach — Clone It, Try It, Break It

Migrating Off OpenClaw Without Downtime — and the Offset That Made Hermes Look Dead

The contractor economy: why startups hire specialists before full-time teams

How to Secure Local LLM Model Files: A Zero Trust Guide

Discussion

Related Articles

Dev.to
I Built an AI System Design Coach — Clone It, Try It, Break It
The problem I kept running into Every time I practiced system design, I'd hit the same wall: I'd sketch out a rough idea for "design WhatsApp" or "build a URL shortener," and then... nothing. No feedback. No indication if my choice of Kafka over RabbitMQ was justified or just cargo-culting. No one

Dev.to
Migrating Off OpenClaw Without Downtime — and the Offset That Made Hermes Look Dead
A while back I wrote a comparison of OpenClaw and Hermes — two open-source, self-hosted AI agents I run on the same bare-metal box, both wired to Telegram. The verdict was that they're complementary: OpenClaw as the dependable gateway for scheduled delivery, Hermes as the agent that builds context o

Dev.to
The contractor economy: why startups hire specialists before full-time teams
The contractor economy: why startups hire specialists before full-time teams Startups used to follow a familiar hiring pattern: raise money, build a core team, open more roles, and slowly fill the gaps with full-time hires. That model still works for some companies. But it is no longer the only de

Dev.to
How to Secure Local LLM Model Files: A Zero Trust Guide
When you download a model file for your homelab, you aren't just grabbing data; you are importing an untrusted dependency with execution privileges. The EU Code of Practice on AI emphasizes provenance and transparency, but those concepts often get lost in translation when moving from regulated enter