Skip to content

Observability Strategy

Effective observability requires knowing what to measure and why. This guide explains the key metrics and their significance.

Three Pillars

PillarAnswersUse For
MetricsHow much? How fast?Dashboards, alerts
LogsWhat happened?Post-incident analysis
TracesWhere's the time spent?Performance debugging

Golden Signals

For any service, monitor these four:

SignalMetricRedisX Example
LatencyResponse time distributioncache_operation_duration_seconds
TrafficRequest ratecache_operations_total
ErrorsError ratecache_errors_total
SaturationResource utilizationredis_pool_active_connections

Per-Plugin Metrics

Cache Metrics

MetricTypeWhat It Tells You
cache_hits_totalCounterCache effectiveness
cache_misses_totalCounterDatabase load
cache_hit_rateGaugeOverall efficiency
cache_latency_secondsHistogramUser-facing impact
cache_evictions_totalCounterMemory pressure

Key indicator: Hit rate

Hit Rate = hits / (hits + misses)

> 90% = Healthy
80-90% = Acceptable
< 80% = Investigate

Lock Metrics

MetricTypeWhat It Tells You
lock_acquisitions_totalCounterLock usage
lock_timeouts_totalCounterContention issues
lock_wait_secondsHistogramWait time impact
lock_held_secondsHistogramOperation duration

Key indicator: Timeout rate

Timeout Rate = timeouts / acquisitions

< 1% = Healthy
1-5% = Monitor
> 5% = Action needed

Rate Limit Metrics

MetricTypeWhat It Tells You
ratelimit_allowed_totalCounterNormal traffic
ratelimit_rejected_totalCounterThrottled requests
ratelimit_remainingGaugeHeadroom

Key indicator: Rejection rate

Rejection Rate = rejected / (allowed + rejected)

< 5% = Normal operation
5-20% = High load or attack
> 20% = Possible attack or limit too low

Stream Metrics

MetricTypeWhat It Tells You
stream_messages_totalCounterThroughput
stream_consumer_lagGaugeProcessing backlog
stream_dlq_sizeGaugeFailure rate
stream_processing_secondsHistogramHandler performance

Key indicator: Consumer lag

Lag = pending messages waiting

< 100 = Healthy
100-1000 = Monitor
> 1000 = Scale consumers

Alert Strategy

What to Alert On

AlertSeverityCondition
Redis downCriticalredis_up == 0
Cache hit rate lowWarninghit_rate < 0.8 for 5m
Lock timeout spikeWarningtimeout_rate > 0.05 for 5m
Consumer lag highWarninglag > 1000 for 5m
DLQ not emptyWarningdlq_size > 0 for 1m

What NOT to Alert On

MetricWhy Not
Individual request errorsToo noisy, use error rate
Brief latency spikesNormal variance, use percentiles
Single cache missExpected behavior

Dashboard Design

Overview Dashboard

┌─────────────────┬─────────────────┬─────────────────┐
│  Cache Hit Rate │  Lock Timeouts  │  Stream Lag     │
│     92.5%       │     0.3%        │     45          │
├─────────────────┴─────────────────┴─────────────────┤
│                 Operations / Second                  │
│  ████████████████████████████████  2.5k ops/sec    │
├─────────────────┬─────────────────┬─────────────────┤
│ Latency (p99)   │ Error Rate      │ Redis Memory    │
│    2.3ms        │    0.01%        │   1.2GB / 4GB   │
└─────────────────┴─────────────────┴─────────────────┘

Investigation Dashboard

  • Time-series of all metrics
  • Breakdown by operation type
  • Breakdown by key pattern
  • Correlation with deployments

Tracing Strategy

What to Trace

OperationTrace?Why
Cache get/setYesSee L1 vs L2 latency
Lock acquire/releaseYesIdentify contention
Stream publish/consumeYesFollow message flow
Rate limit checkOptionalUsually fast

Span Attributes

typescript
// Useful span attributes
span.setAttribute('cache.key', key);
span.setAttribute('cache.hit', hit);
span.setAttribute('cache.layer', 'L1' | 'L2');
span.setAttribute('lock.key', key);
span.setAttribute('lock.acquired', acquired);
span.setAttribute('stream.name', stream);
span.setAttribute('stream.message_id', id);

Sampling Strategy

EnvironmentSample RateReasoning
Development100%Debug everything
Staging100%Catch issues early
Production1-10%Cost/storage balance
Production (errors)100%Always capture errors

Next Steps

Released under the MIT License.