Troubleshooting
Debug common issues with metrics collection and visualization.
Metrics Not Appearing
Problem: /metrics endpoint returns empty
Symptoms:
- Endpoint exists but no metrics shown
- Prometheus scrapes succeed but no data
Solutions:
1. Check if metrics plugin is enabled:
// Verify plugin is registered
new MetricsPlugin({
enabled: true, // Make sure this is true
})2. Check if endpoint is configured:
new MetricsPlugin({
exposeEndpoint: true, // Must be true
endpoint: '/metrics', // Verify path
})3. Test endpoint manually:
curl http://localhost:3000/metrics
# Should see output like:
# # HELP redisx_commands_total Total Redis commands executed
# # TYPE redisx_commands_total counter
# redisx_commands_total{command="GET",client="default",status="success"} 1234. Check application logs:
# Look for metrics plugin initialization
# Should see: "MetricsPlugin initialized"Problem: Some metrics missing
Causes:
1. Plugin metrics disabled:
new MetricsPlugin({
pluginMetrics: false, // ← This disables plugin metrics
})
// Fix: Set to true
pluginMetrics: true2. No operations performed yet:
// Cache metrics won't appear until cache is used
await cache.get('key'); // Now cache_hits_total appears3. Labels don't match:
# ❌ Wrong - label mismatch
redisx_cache_hits_total{layer="L1"}
# ✅ Correct - check actual label value
redisx_cache_hits_total{layer="l1"}High Cardinality
Problem: Too many time series
Symptoms:
- Prometheus using lots of memory
- Slow query performance
- OOM errors
Check cardinality:
# Find metrics with high cardinality
topk(10, count by (__name__)({__name__=~".+"}))Causes:
1. Using high-cardinality labels:
// ❌ Bad - Creates millions of series!
metrics.incrementCounter('requests_total', {
userId: req.user.id, // 1M users
requestId: req.id, // Infinite
timestamp: Date.now().toString(), // Infinite
});
// ✅ Good - Low cardinality
metrics.incrementCounter('requests_total', {
endpoint: '/api/users', // ~100 endpoints
method: 'GET', // 9 methods
status: '200', // ~50 statuses
});Solutions:
1. Remove high-cardinality labels:
// Remove user-specific labels
// Use aggregation labels instead2. Limit label values:
const allowedEndpoints = ['/api/users', '/api/products', '/api/orders'];
if (allowedEndpoints.includes(endpoint)) {
metrics.incrementCounter('requests_total', { endpoint });
} else {
metrics.incrementCounter('requests_total', { endpoint: 'other' });
}3. Increase Prometheus resources:
# docker-compose.yml
prometheus:
deploy:
resources:
limits:
memory: 4GMetrics Not Updating
Problem: Gauge shows stale value
Cause: Gauge not being updated periodically
Solution:
// ❌ Wrong - Set once, never updated
this.metrics.setGauge('queue_size', await this.getQueueSize());
// ✅ Correct - Update periodically
setInterval(async () => {
const size = await this.getQueueSize();
this.metrics.setGauge('queue_size', size);
}, 15000); // Every 15 secondsProblem: Counter not incrementing
Check:
// Verify counter is actually being called
console.log('Incrementing counter');
this.metrics.incrementCounter('my_counter_total');
// Check for typos in metric name — must match registered name exactly
this.metrics.incrementCounter('my_conter_total'); // ❌ Typo!Prometheus Scrape Failures
Problem: Target down in Prometheus UI
Check connectivity:
# Can Prometheus reach the app?
curl http://app:3000/metrics
# Check from Prometheus container
docker exec prometheus curl http://app:3000/metricsCheck scrape config:
# prometheus.yml
scrape_configs:
- job_name: 'nestjs-app'
static_configs:
- targets: ['app:3000'] # ← Verify hostname and port
metrics_path: '/metrics' # ← Verify pathCheck logs:
# Prometheus logs
docker logs prometheus | grep error
# Application logs
docker logs app | grep metricsProblem: Scrape timeout
Increase timeout:
scrape_configs:
- job_name: 'nestjs-app'
scrape_timeout: 30s # Increase from default 10sOptimize metrics endpoint:
// Reduce number of metrics
// Disable expensive metrics
new MetricsPlugin({
collectDefaultMetrics: false,
})Query Performance
Problem: Slow Prometheus queries
Optimize queries:
# ❌ Slow - Calculates for all time series
rate(http_requests_total[5m])
# ✅ Faster - Filter first
rate(http_requests_total{endpoint="/api/users"}[5m])Use recording rules:
# Pre-calculate expensive queries
groups:
- name: redis_rules
interval: 15s
rules:
- record: redis:cache:hit_rate
expr: |
sum(rate(myapp_redis_cache_hits_total[5m])) /
(sum(rate(myapp_redis_cache_hits_total[5m])) + sum(rate(myapp_redis_cache_misses_total[5m])))Then query the recording:
# Fast - Pre-calculated
redis:cache:hit_rateGrafana Issues
Problem: No data in Grafana panel
1. Check data source:
Grafana → Data Sources → Prometheus
- URL: http://prometheus:9090 ✓
- Access: Proxy ✓
- Test connection: Success ✓2. Test query directly in Prometheus:
# Visit Prometheus UI: http://localhost:9090
# Run query there first
myapp_redis_cache_hits_total3. Check time range:
Panel → Query Options → Time Range
- Make sure it covers period with data4. Check query syntax:
# ❌ Wrong - Syntax error
redisx_cache_hits_total{layer=l1}
# ✅ Correct - Quoted value
redisx_cache_hits_total{layer="l1"}Problem: Gaps in graph
Causes:
1. Scrape interval too long:
# prometheus.yml
scrape_configs:
- job_name: 'app'
scrape_interval: 60s # Too long, reduce to 15s2. Application downtime:
Graph gaps = Application was down during that time3. Missing data:
// Operation not happening frequently enough
// Add synthetic data or reduce scrape intervalMemory Issues
Problem: High memory usage
Check metric count:
# How many active time series?
curl http://localhost:9090/api/v1/status/tsdb
# Should be < 100,000 for reasonable memory usageReduce metrics:
// Disable features you don't need
new MetricsPlugin({
collectDefaultMetrics: false, // Disable Node.js metrics
commandMetrics: false, // Disable per-command metrics
})Reduce histogram buckets:
new MetricsPlugin({
histogramBuckets: [0.01, 0.1, 1], // Fewer buckets
})Set retention:
# prometheus.yml
storage:
tsdb:
retention.time: 15d # Reduce from default 15d
retention.size: 10GBAuthentication Issues
Problem: 403 Forbidden on /metrics
Add authentication:
// app.module.ts
import { NestFactory } from '@nestjs/core';
import * as basicAuth from 'express-basic-auth';
const app = await NestFactory.create(AppModule);
app.use('/metrics', basicAuth({
users: { prometheus: process.env.METRICS_PASSWORD },
challenge: true,
}));Configure Prometheus:
scrape_configs:
- job_name: 'nestjs-app'
basic_auth:
username: 'prometheus'
password: 'your-password'Debugging Checklist
- [ ] MetricsPlugin is enabled
- [ ] Endpoint is accessible (
curl /metrics) - [ ] Prometheus can reach application
- [ ] Scrape config is correct
- [ ] Metrics are being generated (operations happening)
- [ ] Labels match in queries
- [ ] Time range includes data
- [ ] No high cardinality issues
- [ ] Prometheus has enough resources
Debug Commands
# Check metrics endpoint
curl http://localhost:3000/metrics
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check active time series count
curl http://localhost:9090/api/v1/status/tsdb
# Check Prometheus config
curl http://localhost:9090/api/v1/status/config
# Validate Prometheus config
promtool check config prometheus.yml
# Test query
curl 'http://localhost:9090/api/v1/query?query=up'Common Errors
| Error | Cause | Solution |
|---|---|---|
HELP line already seen | Metric defined twice | Check for duplicate metric names |
Target down | Can't reach endpoint | Check network, firewall, hostname |
Context deadline exceeded | Scrape timeout | Increase scrape_timeout |
OOM killed | Too much memory | Reduce cardinality, increase memory |
No data points | No operations yet | Trigger operations or wait |
Next Steps
- Testing — Test metrics collection
- Recipes — Real-world patterns
- Configuration — Review config
- Prometheus — Check Prometheus setup
- Grafana — Fix visualization issues