Visualization
Analyze distributed traces with powerful visualization tools.
Jaeger UI
Industry-standard trace visualization.
Setup
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=trueAccess
http://localhost:16686Search Traces
By Service:
Service: user-serviceBy Operation:
Operation: GET /api/users/:idBy Tags:
Tags:
http.status_code = 500
user.id = 123
cache.hit = falseBy Duration:
Min Duration: 1s
Max Duration: 10sBy Time:
Lookback: Last 1 hour
Custom: 2025-01-28 10:00 to 2025-01-28 11:00Trace View
Timeline shows:
- Span duration (bar length)
- Parent-child relationships (indentation)
- Concurrent operations (overlapping bars)
- Critical path (longest chain)
Span Details
Click any span to see:
Overview:
Operation: redis.GET
Duration: 1.23ms
Start: 2025-01-28 10:15:23.456Attributes:
{
"db.system": "redis",
"db.operation": "GET",
"db.redis.key": "user:123",
"redisx.plugin": "cache",
"redisx.cache.hit": false
}Events:
[2025-01-28 10:15:23.456] cache.lookup
[2025-01-28 10:15:23.458] cache.missLogs:
[10:15:23.456] INFO: Cache lookup for user:123
[10:15:23.458] WARN: Cache miss, querying databaseService Dependencies
System Architecture:
Dependency Graph shows:
- Services called
- Call frequency
- Average latency
- Error rates
Performance Analysis
Trace Comparison:
Compare similar requests to find performance regressions.
Baseline (v1.0.0): 50ms
Current (v1.1.0): 150ms
Regression: +100ms (+200%)
Slowdown in: database.query (10ms → 110ms)Histogram:
Latency Distribution:
< 10ms: ████████████████████ 40%
10-50ms: ██████████████ 28%
50-100ms: ██████ 12%
100-500ms: ████ 15%
> 500ms: █ 5%Grafana Tempo
Store and query traces in Grafana.
Setup
# docker-compose.yml
version: '3.8'
services:
tempo:
image: grafana/tempo:latest
ports:
- "4318:4318" # OTLP HTTP
- "3200:3200" # Tempo API
volumes:
- ./tempo.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yamlTempo Configuration
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
query_frontend:
search:
max_duration: 1hGrafana Data Source
# grafana-datasources.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempoAccess Grafana
http://localhost:3000Explore Traces
1. Navigate to Explore
Grafana → Explore → Tempo2. Search by Trace ID
Trace ID: 7f9c8a3b2e1d4c5a6b7e8f9a0b1c2d3e3. Search by Tags
{
service.name="user-service"
http.status_code="500"
cache.hit="false"
}4. Search by Duration
duration > 1sTraceQL
Query language for traces.
Find slow traces:
{ duration > 1s }Find errors:
{ status = error }Find cache misses:
{ redisx.cache.hit = false }Complex query:
{
service.name = "user-service" &&
http.method = "GET" &&
http.status_code >= 500 &&
duration > 500ms
}Aggregate:
{ service.name = "user-service" } | rate() by (http.status_code)Correlate with Metrics
Link traces to metrics:
# Find slow requests
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{
service="user-service"
}[5m])
) > 1Then explore traces:
Grafana → Explore → Tempo
Search: { service.name="user-service" duration>1s }Correlate with Logs
Link traces to logs:
# promtail-config.yaml
scrape_configs:
- job_name: app-logs
static_configs:
- targets:
- localhost
labels:
job: app-logs
pipeline_stages:
- json:
expressions:
trace_id: trace_id
- labels:
trace_id:View logs for trace:
Grafana → Explore → Loki
{job="app-logs"} |= "trace_id: abc123"Dashboards
Trace Statistics Dashboard
{
"dashboard": {
"title": "Trace Analytics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(traces_total[5m])) by (service)"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "sum(rate(traces_total{status=\"error\"}[5m])) by (service)"
}]
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, rate(trace_duration_seconds_bucket[5m]))"
}]
}
]
}
}Plugin Performance Dashboard
Cache Performance:
# Cache hit rate
sum(rate(redisx_cache_hits_total[5m])) /
(sum(rate(redisx_cache_hits_total[5m])) + sum(rate(redisx_cache_misses_total[5m])))
# Cache stampede preventions
rate(redisx_cache_stampede_prevented_total[5m])Lock Contention:
# Active locks
sum(redisx_locks_active)
# Lock wait time
histogram_quantile(0.95,
rate(redisx_lock_wait_duration_seconds_bucket[5m])
)Rate Limit:
# Block rate
sum(rate(redisx_ratelimit_requests_total{status="rejected"}[5m])) /
sum(rate(redisx_ratelimit_requests_total[5m]))Analysis Patterns
Finding Performance Bottlenecks
1. Find slowest endpoint:
Search: duration > 1s
Group by: http.target
Sort: Duration DESC2. Analyze trace:
GET /api/users/123 (2.5s)
├── cache.get (2ms)
├── database.query (2.4s) ← BOTTLENECK!
└── cache.set (3ms)3. Root cause:
database.query attributes:
db.statement: "SELECT * FROM users WHERE id = $1"
db.rows_returned: 1
db.plan: "Seq Scan on users" ← Missing index!Detecting Cache Issues
1. Find cache misses:
{ redisx.cache.hit = false }2. Analyze pattern:
99% cache misses for endpoint: GET /api/products/:id
Cause: TTL too short (60s)
Solution: Increase TTL to 300sIdentifying Lock Contention
1. Find long lock waits:
{ redisx.lock.retry_count > 3 }2. Analyze contention:
lock.acquire process:order-123 (5s)
├── Event: lock.retry (attempt: 1, waited: 1s)
├── Event: lock.retry (attempt: 2, waited: 2s)
├── Event: lock.retry (attempt: 3, waited: 3s)
└── Event: lock.timeout
Cause: Long-running process holding lock
Solution: Reduce lock TTL or optimize processTracking Distributed Transactions
1. Search by transaction ID:
{ transaction.id = "txn-123" }2. Trace flow:
Service A: POST /api/orders (500ms)
├── lock.acquire order:123 (2ms)
├── Service B: POST /internal/payment (200ms)
│ ├── lock.acquire payment:123 (2ms)
│ ├── payment.process (180ms)
│ └── lock.release payment:123 (1ms)
├── stream.publish orders (10ms)
└── lock.release order:123 (1ms)Alerting on Traces
Grafana Alerts
High error rate:
sum(rate(traces_total{status="error"}[5m])) /
sum(rate(traces_total[5m])) > 0.05Slow traces:
histogram_quantile(0.95,
rate(trace_duration_seconds_bucket[5m])
) > 1Cache performance degradation:
sum(rate(redisx_cache_hits_total[5m])) /
(sum(rate(redisx_cache_hits_total[5m])) + sum(rate(redisx_cache_misses_total[5m]))) < 0.7Best Practices
1. Use meaningful span names
// ✅ Good
'user.get', 'order.create', 'payment.process'
// ❌ Bad
'operation', 'function', 'handler'2. Add contextual attributes
span.setAttributes({
'user.id': user.id,
'user.role': user.role,
'order.total': order.total,
});3. Record important events
span.addEvent('validation.started');
span.addEvent('payment.succeeded');
span.addEvent('email.sent');4. Create dashboards for key metrics
- Request rate by service
- Error rate by endpoint
- P95/P99 latency
- Cache hit rates
- Lock contention
5. Set up alerts
- High error rate
- Slow response times
- Cache degradation
- Lock timeouts
Tools Comparison
| Feature | Jaeger | Grafana Tempo |
|---|---|---|
| UI | Dedicated trace UI | General observability |
| Search | Service, tags, duration | TraceQL |
| Storage | Cassandra, Elasticsearch | S3, local, GCS |
| Correlation | Traces only | Traces + metrics + logs |
| Cost | Lower | Higher (full stack) |
| Ease of use | Simple | More features |
Next Steps
- Configuration — Configure tracing
- Spans — Create custom spans
- Troubleshooting — Debug tracing