Runbooks

Step-by-step procedures for responding to common incidents.

Runbook: Redis Connection Lost

Alert: RedisConnectionLost

Symptoms

Application errors mentioning Redis
Cache miss rate 100%
Lock operations failing

Investigation

bash

# 1. Check Redis status
redis-cli ping

# 2. Check network connectivity
nc -zv redis-host 6379

# 3. Check Redis logs
kubectl logs statefulset/redis

# 4. Check application logs
kubectl logs -l app=my-service --tail=100 | grep -i redis

Resolution

Cause	Action
Redis crashed	Restart Redis, check memory
Network issue	Check security groups, DNS
Auth failure	Verify credentials
Max connections	Increase maxclients or reduce pool

Recovery

Once Redis is back, verify connectivity
Application should auto-reconnect
Monitor cache hit rate returning to normal
Check for any DLQ messages during outage

Runbook: High Cache Miss Rate

Alert: CacheHitRateBelow80

Symptoms

Cache hit rate < 80%
Increased database load
Higher latency

Investigation

bash

# Check cache key distribution
redis-cli --scan --pattern "cache:*" | wc -l

# Check memory usage
redis-cli INFO memory

# Check eviction rate
redis-cli INFO stats | grep evicted

Resolution

Cause	Action
TTL too short	Increase TTL
Cache evictions	Increase Redis memory
Cold start	Wait for cache to warm
Tag invalidation	Review invalidation patterns

Runbook: Lock Contention

Alert: HighLockContention

Symptoms

Lock timeout rate > 5%
Increased request latency
Timeouts in lock-protected operations

Investigation

bash

# Check active locks
redis-cli KEYS "lock:*"

# Check lock TTLs
redis-cli TTL "lock:payment:order-123"

# Check who holds lock
redis-cli GET "lock:payment:order-123"

Resolution

Cause	Action
Long operations	Reduce lock scope
Stuck lock	Wait for TTL or manual release
Too many workers	Reduce concurrency
Lock key too broad	Make keys more specific

Manual Lock Release (Emergency)

Last Resort Only

Manual lock release can cause data corruption if the lock holder is still running.

bash

redis-cli DEL "lock:stuck-key"

Runbook: Stream Consumer Lag

Alert: StreamConsumerLagHigh

Symptoms

Consumer lag > 1000 messages
Delayed job processing
Growing stream length

Investigation

bash

# Check stream length
redis-cli XLEN jobs

# Check consumer group info
redis-cli XINFO GROUPS jobs

# Check pending messages
redis-cli XPENDING jobs workers

Resolution

Cause	Action
Not enough consumers	Scale up workers
Slow processing	Optimize handlers
Consumer crashed	Check for stuck pending
DLQ processing slow	Address DLQ issues

Scale Consumers

bash

kubectl scale deployment job-workers --replicas=20

Runbook: DLQ Growing

Alert: DLQNotEmpty

Symptoms

Messages in DLQ
Job failures
Missing functionality

Investigation

bash

# Check DLQ length
redis-cli XLEN jobs:dlq

# Read DLQ messages
redis-cli XRANGE jobs:dlq - + COUNT 10

Resolution

Analyze failures — Check error messages in DLQ
Fix root cause — Code bug, external service down, etc.
Replay messages — Move from DLQ back to main stream
Clear DLQ — After messages processed

Replay DLQ Messages

bash

# Manual replay script
redis-cli XRANGE jobs:dlq - + COUNT 100 | while read msg; do
  # Re-publish to main stream
  redis-cli XADD jobs '*' $msg
done

Runbook: Rate Limit Blocking Legitimate Traffic

Alert: HighRateLimitRejections

Symptoms

High 429 response rate
Customer complaints
Normal traffic being blocked

Investigation

bash

# Check rate limit keys
redis-cli KEYS "ratelimit:*" | head -20

# Check specific user/IP
redis-cli GET "ratelimit:ip:1.2.3.4"

Resolution

Cause	Action
Limit too low	Increase limit
Traffic spike	Temporary increase
Bot traffic	Block at WAF
Shared IP (NAT)	Use user ID instead of IP

Temporary Limit Increase

typescript

// In emergency, deploy with increased limits
@RateLimit({ limit: 1000, window: 60 }) // Was 100

Emergency Contacts

Role	Contact
On-call engineer	PagerDuty
Redis admin	#redis-support
Application team	#app-team

Post-Incident

After resolving:

Document timeline in incident ticket
Update runbook if new learnings
Schedule post-mortem if needed
Create follow-up tasks for prevention

Runbooks ​

Runbook: Redis Connection Lost ​

Symptoms ​

Investigation ​

Resolution ​

Recovery ​

Runbook: High Cache Miss Rate ​

Symptoms ​

Investigation ​

Resolution ​

Runbook: Lock Contention ​

Symptoms ​

Investigation ​

Resolution ​

Manual Lock Release (Emergency) ​

Runbook: Stream Consumer Lag ​

Symptoms ​

Investigation ​

Resolution ​

Scale Consumers ​

Runbook: DLQ Growing ​

Symptoms ​

Investigation ​

Resolution ​

Replay DLQ Messages ​

Runbook: Rate Limit Blocking Legitimate Traffic ​

Symptoms ​

Investigation ​

Resolution ​

Temporary Limit Increase ​

Emergency Contacts ​

Post-Incident ​

Runbooks

Runbook: Redis Connection Lost

Symptoms

Investigation

Resolution

Recovery

Runbook: High Cache Miss Rate

Symptoms

Investigation

Resolution

Runbook: Lock Contention

Symptoms

Investigation

Resolution

Manual Lock Release (Emergency)

Runbook: Stream Consumer Lag

Symptoms

Investigation

Resolution

Scale Consumers

Runbook: DLQ Growing

Symptoms

Investigation

Resolution

Replay DLQ Messages

Runbook: Rate Limit Blocking Legitimate Traffic

Symptoms

Investigation

Resolution

Temporary Limit Increase

Emergency Contacts

Post-Incident