CROP
ProjectsParts Services

Metrics Implementation Deployment Checklist

Pre-Deployment (Development)

Metrics Implementation Deployment Checklist

Pre-Deployment (Development)

  • Code changes compiled without errors

    • Run: bun run type-check
    • Check for no TypeScript errors in metrics files
  • Prometheus endpoint responds

    • URL: http://localhost:3001/metrics
    • Expected: Text format with metric definitions
  • New metrics appear in output

    • Check for: autocomplete_result_count, parts_fitment_coverage_ratio
    • All 20 new metrics should be present
  • No metric cardinality warnings

    • Expected series count: ~38
    • No unbounded label values
  • Integration tests pass (if applicable)

    • Run: bun test

Development Deployment

  • Deploy to dev/staging environment

    • Build: bun run build
    • Deploy Docker image with metrics registry
  • Service starts without errors

    • Check logs for metric initialization
    • No startup warnings about metrics
  • Prometheus scraping configured

    • Add scrape config for new service instance
    • Target: http://dev-search-service:3001/metrics
    • Interval: 30 seconds
  • Verify Prometheus has targets

    • URL: http://prometheus:9090/targets
    • Status: UP for search service
  • Test PromQL queries

    • Query: rate(autocomplete_queries_total[5m])
    • Should return data within 2 minutes

Monitoring Setup

  • Grafana datasource configured

    • Type: Prometheus
    • URL: Prometheus instance
    • Test connection: successful
  • Import dashboard

    • File: scripts/grafana-dashboard-autocomplete-metrics.json
    • Location: Dashboards → Import
    • Panels load without errors
  • Dashboard displays metrics

    • All 10 panels show data
    • Time range: Last 6 hours
  • Create custom panels if needed

    • Example: Zero result rate trend
    • Verify queries execute successfully

Alerting Setup

  • AlertManager configured

    • Connection to Prometheus: working
    • Notification channels configured
  • Alert rules added

    • Copy from METRICS_INTEGRATION.md
    • Rules file: /etc/prometheus/rules/search-alerts.yml
  • Alert rules validated

    • Command: amtool check-config alertmanager.yml
    • Reload: curl -X POST localhost:9093/-/reload
  • Test alerts

    • Trigger: Force a metric to alert threshold
    • Verify: Notification received
    • Channels: Slack, PagerDuty, email

Fitment Coverage Job Setup

  • Script tested locally

    • Command: bun scripts/calculate-fitment-coverage.ts --dry-run
    • Output: Displays fitment statistics
  • Elasticsearch connection verified

    • Script connects successfully
    • Queries execute without errors
  • Dry-run shows accurate numbers

    • Compare manual ES queries with script output
    • Verify calculations are correct
  • Schedule cron job (Linux/macOS)

    • Add to crontab: 0 * * * * cd /path && bun scripts/calculate-fitment-coverage.ts
    • Verify: Job runs hourly
  • Or create Cloud Run Job

    • Image: Search service with script
    • Schedule: Cloud Scheduler (hourly)
    • Test execution: successful

Performance Validation

  • Metric recording overhead

    • Baseline latency: measure before metrics
    • After metrics: should not increase by > 5%
    • Expected overhead: < 0.1ms per request
  • Monitor resource usage

    • Prometheus memory: should remain stable
    • CPU: no significant increase
    • Disk: ensure retention policy set
  • Cardinality monitoring

    • Command: Check Prometheus series count
    • Expected: ~38 series
    • Alert if exceeds 1000

Quality Assurance

  • Metric labels are consistent

    • Review all metrics
    • No typos in label names
    • Fixed enum values only
  • PromQL queries in docs work

    • Test each example query
    • Verify expected results
    • Update docs if needed
  • Dashboard is user-friendly

    • Panel names are clear
    • Tooltips are helpful
    • Colors are intuitive
  • Documentation is complete

    • All files reviewed
    • No broken links
    • Examples are accurate

Production Deployment

  • Code reviewed and approved

    • All changes reviewed
    • No breaking changes
    • Documentation reviewed
  • Production Prometheus configured

    • Scrape targets updated
    • Retention policy set (e.g., 30 days)
    • Storage configured
  • Production alerts configured

    • All rules deployed
    • Notification channels active
    • Test alert confirmed
  • Production dashboard imported

    • Grafana updated
    • Dashboard accessible
    • Panels displaying data
  • Fitment coverage job scheduled

    • Cron job active
    • Or Cloud Run Job deployed
    • Logs checked for successful execution
  • Monitoring alerts active

    • Team notified of deployment
    • Escalation paths configured
    • On-call rotation updated

Post-Deployment Monitoring (24 hours)

  • Verify all metrics collecting data

    • Spot-check 5 random metrics
    • All have recent timestamps
  • Review metric distributions

    • Zero result rate: reasonable?
    • Latency: within expected range?
    • Fitment coverage: expected percentage?
  • Check alert health

    • No false positives?
    • Alert messages clear?
    • Routing working correctly?
  • Monitor resource consumption

    • Prometheus memory: stable?
    • CPU utilization: normal?
    • Disk growth: reasonable?
  • Review logs

    • Any errors in metric recording?
    • Script execution logs clean?
    • No cardinality warnings?

One Week Review

  • Analyze metric trends

    • Create summary report
    • Identify patterns
    • Note any anomalies
  • Evaluate alert effectiveness

    • Which alerts have fired?
    • Were they actionable?
    • Any false positives?
  • Gather team feedback

    • Dashboard usability
    • Metric relevance
    • Documentation clarity
  • Plan optimizations

    • Which metrics are most valuable?
    • Any metrics to remove?
    • New metrics needed?

Monthly Maintenance

  • Review alert thresholds

    • Based on actual data
    • Adjust if needed
    • Document changes
  • Update documentation

    • Add new discoveries
    • Clarify confusing sections
    • Update examples
  • Analyze cost

    • Prometheus storage usage
    • Consider retention policy
    • Optimize if needed
  • Plan improvements

    • Query optimization
    • Dashboard enhancements
    • New metric proposals

Rollback Plan

If issues occur:

  1. Metric collection errors

    • Disable metric recording in code
    • Restart service
    • Metrics endpoint still available
  2. High cardinality issues

    • Reduce sampling from 20% to 10%
    • Remove problematic metric
    • Restart Prometheus
  3. Alerting false positives

    • Disable specific alert rule
    • Adjust threshold
    • Re-enable after fix
  4. Dashboard issues

    • Delete dashboard
    • Recreate or restore backup
    • Update JSON if needed
  5. Complete rollback

    • Revert code changes to services/search
    • Remove metrics from Prometheus scrape config
    • Delete Grafana dashboard
    • Disable alert rules

Success Criteria

  • All metrics defined and exported
  • Metrics integrated into autocomplete handler
  • Documentation complete and clear
  • Grafana dashboard ready
  • Fitment calculator script working
  • Code compiles without errors
  • Low cardinality design validated
  • Prometheus scraping confirmed (dev)
  • Alerts functioning (dev)
  • Dashboard displaying data (dev)
  • Fitment job running (dev)
  • Performance validated (dev)
  • Team trained on metrics
  • Deployed to production
  • Monitoring validated (24 hours)
  • Weekly review completed

Sign-Off

  • Developer: Code changes complete
  • QA: Testing verified
  • Operations: Monitoring configured
  • Team Lead: Approved for production

Contact Information

Questions about metrics:

  • See METRICS.md for complete reference
  • See METRICS_INTEGRATION.md for setup
  • See METRICS_QUICK_REFERENCE.md for common queries

On-call support:

  • Check Grafana dashboard for current status
  • Review AlertManager for active alerts
  • Query Prometheus for detailed data

Last Updated: 2025-11-12 Status: Ready for Deployment

On this page