Metrics Implementation Deployment Checklist
Pre-Deployment (Development)
Metrics Implementation Deployment Checklist
Pre-Deployment (Development)
-
Code changes compiled without errors
- Run:
bun run type-check - Check for no TypeScript errors in metrics files
- Run:
-
Prometheus endpoint responds
- URL:
http://localhost:3001/metrics - Expected: Text format with metric definitions
- URL:
-
New metrics appear in output
- Check for:
autocomplete_result_count,parts_fitment_coverage_ratio - All 20 new metrics should be present
- Check for:
-
No metric cardinality warnings
- Expected series count: ~38
- No unbounded label values
-
Integration tests pass (if applicable)
- Run:
bun test
- Run:
Development Deployment
-
Deploy to dev/staging environment
- Build:
bun run build - Deploy Docker image with metrics registry
- Build:
-
Service starts without errors
- Check logs for metric initialization
- No startup warnings about metrics
-
Prometheus scraping configured
- Add scrape config for new service instance
- Target:
http://dev-search-service:3001/metrics - Interval: 30 seconds
-
Verify Prometheus has targets
- URL:
http://prometheus:9090/targets - Status: UP for search service
- URL:
-
Test PromQL queries
- Query:
rate(autocomplete_queries_total[5m]) - Should return data within 2 minutes
- Query:
Monitoring Setup
-
Grafana datasource configured
- Type: Prometheus
- URL: Prometheus instance
- Test connection: successful
-
Import dashboard
- File:
scripts/grafana-dashboard-autocomplete-metrics.json - Location: Dashboards → Import
- Panels load without errors
- File:
-
Dashboard displays metrics
- All 10 panels show data
- Time range: Last 6 hours
-
Create custom panels if needed
- Example: Zero result rate trend
- Verify queries execute successfully
Alerting Setup
-
AlertManager configured
- Connection to Prometheus: working
- Notification channels configured
-
Alert rules added
- Copy from
METRICS_INTEGRATION.md - Rules file:
/etc/prometheus/rules/search-alerts.yml
- Copy from
-
Alert rules validated
- Command:
amtool check-config alertmanager.yml - Reload:
curl -X POST localhost:9093/-/reload
- Command:
-
Test alerts
- Trigger: Force a metric to alert threshold
- Verify: Notification received
- Channels: Slack, PagerDuty, email
Fitment Coverage Job Setup
-
Script tested locally
- Command:
bun scripts/calculate-fitment-coverage.ts --dry-run - Output: Displays fitment statistics
- Command:
-
Elasticsearch connection verified
- Script connects successfully
- Queries execute without errors
-
Dry-run shows accurate numbers
- Compare manual ES queries with script output
- Verify calculations are correct
-
Schedule cron job (Linux/macOS)
- Add to crontab:
0 * * * * cd /path && bun scripts/calculate-fitment-coverage.ts - Verify: Job runs hourly
- Add to crontab:
-
Or create Cloud Run Job
- Image: Search service with script
- Schedule: Cloud Scheduler (hourly)
- Test execution: successful
Performance Validation
-
Metric recording overhead
- Baseline latency: measure before metrics
- After metrics: should not increase by > 5%
- Expected overhead: < 0.1ms per request
-
Monitor resource usage
- Prometheus memory: should remain stable
- CPU: no significant increase
- Disk: ensure retention policy set
-
Cardinality monitoring
- Command: Check Prometheus series count
- Expected: ~38 series
- Alert if exceeds 1000
Quality Assurance
-
Metric labels are consistent
- Review all metrics
- No typos in label names
- Fixed enum values only
-
PromQL queries in docs work
- Test each example query
- Verify expected results
- Update docs if needed
-
Dashboard is user-friendly
- Panel names are clear
- Tooltips are helpful
- Colors are intuitive
-
Documentation is complete
- All files reviewed
- No broken links
- Examples are accurate
Production Deployment
-
Code reviewed and approved
- All changes reviewed
- No breaking changes
- Documentation reviewed
-
Production Prometheus configured
- Scrape targets updated
- Retention policy set (e.g., 30 days)
- Storage configured
-
Production alerts configured
- All rules deployed
- Notification channels active
- Test alert confirmed
-
Production dashboard imported
- Grafana updated
- Dashboard accessible
- Panels displaying data
-
Fitment coverage job scheduled
- Cron job active
- Or Cloud Run Job deployed
- Logs checked for successful execution
-
Monitoring alerts active
- Team notified of deployment
- Escalation paths configured
- On-call rotation updated
Post-Deployment Monitoring (24 hours)
-
Verify all metrics collecting data
- Spot-check 5 random metrics
- All have recent timestamps
-
Review metric distributions
- Zero result rate: reasonable?
- Latency: within expected range?
- Fitment coverage: expected percentage?
-
Check alert health
- No false positives?
- Alert messages clear?
- Routing working correctly?
-
Monitor resource consumption
- Prometheus memory: stable?
- CPU utilization: normal?
- Disk growth: reasonable?
-
Review logs
- Any errors in metric recording?
- Script execution logs clean?
- No cardinality warnings?
One Week Review
-
Analyze metric trends
- Create summary report
- Identify patterns
- Note any anomalies
-
Evaluate alert effectiveness
- Which alerts have fired?
- Were they actionable?
- Any false positives?
-
Gather team feedback
- Dashboard usability
- Metric relevance
- Documentation clarity
-
Plan optimizations
- Which metrics are most valuable?
- Any metrics to remove?
- New metrics needed?
Monthly Maintenance
-
Review alert thresholds
- Based on actual data
- Adjust if needed
- Document changes
-
Update documentation
- Add new discoveries
- Clarify confusing sections
- Update examples
-
Analyze cost
- Prometheus storage usage
- Consider retention policy
- Optimize if needed
-
Plan improvements
- Query optimization
- Dashboard enhancements
- New metric proposals
Rollback Plan
If issues occur:
-
Metric collection errors
- Disable metric recording in code
- Restart service
- Metrics endpoint still available
-
High cardinality issues
- Reduce sampling from 20% to 10%
- Remove problematic metric
- Restart Prometheus
-
Alerting false positives
- Disable specific alert rule
- Adjust threshold
- Re-enable after fix
-
Dashboard issues
- Delete dashboard
- Recreate or restore backup
- Update JSON if needed
-
Complete rollback
- Revert code changes to services/search
- Remove metrics from Prometheus scrape config
- Delete Grafana dashboard
- Disable alert rules
Success Criteria
- All metrics defined and exported
- Metrics integrated into autocomplete handler
- Documentation complete and clear
- Grafana dashboard ready
- Fitment calculator script working
- Code compiles without errors
- Low cardinality design validated
- Prometheus scraping confirmed (dev)
- Alerts functioning (dev)
- Dashboard displaying data (dev)
- Fitment job running (dev)
- Performance validated (dev)
- Team trained on metrics
- Deployed to production
- Monitoring validated (24 hours)
- Weekly review completed
Sign-Off
- Developer: Code changes complete
- QA: Testing verified
- Operations: Monitoring configured
- Team Lead: Approved for production
Contact Information
Questions about metrics:
- See
METRICS.mdfor complete reference - See
METRICS_INTEGRATION.mdfor setup - See
METRICS_QUICK_REFERENCE.mdfor common queries
On-call support:
- Check Grafana dashboard for current status
- Review AlertManager for active alerts
- Query Prometheus for detailed data
Last Updated: 2025-11-12 Status: Ready for Deployment