CROP

Monitoring & Metrics

Prometheus metrics, alerting, and monitoring for parts-services Search service — covering autocomplete quality, fitment coverage, data quality, and system health.

Monitoring & Metrics

The Search service exposes Prometheus metrics at GET /metrics for tracking autocomplete quality, equipment fitment coverage, data quality, and system health. All metrics follow Prometheus naming conventions and use low-cardinality labels (~38 total series).

Quick Reference

Accessing Metrics

# View all metrics
curl http://search-service:3001/metrics

# Filter specific metric
curl http://search-service:3001/metrics | grep autocomplete_result_count

# Prometheus UI
http://prometheus:9090/targets

Key Metrics at a Glance

Autocomplete Quality (P1)

MetricTypeTargetAlert Threshold
autocomplete_zero_results_totalCounter< 5% rate> 10%
autocomplete_empty_fallback_totalCounter< 2% rate> 5%
autocomplete_result_countHistogramP95 >= 5P95 < 3

Fitment Coverage (P2)

MetricTypeTargetAlert Threshold
parts_fitment_coverage_ratioGauge (0-1)>= 0.8< 0.6
equipment_fitment_coverage_ratioGauge (0-1)>= 0.8< 0.5
fitment_data_points_total / parts_with_fitment_totalCalculated>= 2< 1.5

Data Quality (P3)

MetricTypeTargetAlert Threshold
catalog_ready_ratioGauge (0-1)>= 0.8< 0.6
missing_image_ratioGauge (0-1)< 0.2> 0.3

Essential PromQL Queries

# Zero result rate (5m window)
(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m])) * 100

# Fallback rate
(rate(autocomplete_empty_fallback_total[5m]) / rate(autocomplete_queries_total[5m])) * 100

# Result count P95
histogram_quantile(0.95, rate(autocomplete_result_count_bucket[5m]))

# Query latency P95 (ms)
histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) * 1000

# Fitment coverage trend
parts_fitment_coverage_ratio

# Catalog ready percentage
catalog_ready_ratio * 100

Metric Types

All metrics use three Prometheus types:

  • Counter -- Monotonically increasing (total requests, errors). Use rate() to query.
  • Gauge -- Point-in-time value (coverage ratios, document counts). Query directly.
  • Histogram -- Value distributions with buckets (latency, result counts). Use histogram_quantile().

All Metrics

Autocomplete Quality

autocomplete_result_count (Histogram) -- Number of suggestions returned per query.

  • Labels: query_type (short, pn_like, textual, equipment), mode (auto, parts, equipment)
  • Buckets: [0, 1, 3, 5, 10, 20, 50]

autocomplete_zero_results_total (Counter) -- Queries returning zero results.

  • Labels: query_length (short, medium, long), query_type

autocomplete_empty_fallback_total (Counter) -- Enhanced autocomplete falling back to legacy.

  • Labels: intent (short, pn_like, textual, equipment)

autocomplete_suggestion_types_total (Counter) -- Distribution of suggestion types returned.

  • Labels: type (sku, category, manufacturer, equipment, description, brand)
  • Sampled at 20% to reduce cardinality.

autocomplete_ranking_score (Histogram) -- Relevance scores of returned suggestions.

  • Labels: suggestion_type
  • Buckets: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, 1.0]
  • Sampled at 20%.

Autocomplete Performance

autocomplete_elasticsearch_latency_ms (Histogram) -- ES query latency per section.

  • Labels: section_type (sku, equipment, category, name, brand)
  • Buckets: [10, 25, 50, 100, 200, 500, 1000, 2000]

autocomplete_parallel_queries_count (Histogram) -- Parallel ES queries per request.

  • Buckets: [1, 2, 3, 4, 5, 10]

autocomplete_cache_hits_total (Counter) -- Cache hits by type.

  • Labels: cache_type (query_result, section_data, suggestion_dedup)

Fitment Coverage

parts_fitment_coverage_ratio (Gauge, 0-1) -- Ratio of parts with equipment fitment data.

  • Labels: manufacturer, collection (nh_unified, mchale, hotsy)
  • 0.7+ = good coverage.

parts_with_fitment_total (Gauge) -- Absolute count of parts with fitment.

  • Labels: manufacturer, collection

fitment_data_points_total (Gauge) -- Total individual fitment entries.

  • Labels: manufacturer, collection
  • Richness indicator: fitment_data_points_total / parts_with_fitment_total = avg fitments per part.

equipment_fitment_coverage_ratio (Gauge, 0-1) -- Fitment coverage from query perspective.

Equipment Fitment Queries

equipment_fitment_queries_total (Counter) -- Equipment fitment query volume.

  • Labels: mode (browse, search, related)

equipment_fitment_query_latency_ms (Histogram) -- Equipment query latency.

  • Buckets: [10, 50, 100, 200, 500, 1000]

Index Health

index_document_count (Gauge) -- Documents in search index.

  • Labels: index_name (parts_current)

index_size_bytes (Gauge) -- Index size in bytes.

  • Labels: index_name

Data Quality

quality_score_distribution (Histogram) -- Part quality scores (0-100).

  • Labels: manufacturer
  • Buckets: [0, 20, 40, 60, 80, 100]
  • Ranges: 0-20 Poor, 21-40 Fair, 41-60 Good, 61-80 Very Good, 81-100 Excellent.

catalog_ready_ratio (Gauge, 0-1) -- Parts with quality_score >= 80.

  • Labels: manufacturer

missing_image_ratio (Gauge, 0-1) -- Parts missing primary images.

  • Labels: manufacturer

Alert Rules

Critical (P1) -- 5 minute evaluation

groups:
  - name: search-service
    interval: 30s
    rules:
      - alert: AutocompleteZeroResultsHigh
        expr: >
          (sum(rate(autocomplete_zero_results_total[5m])) /
           sum(rate(autocomplete_queries_total[5m]))) > 0.1
        for: 5m
        annotations:
          summary: "Zero result rate > 10%"

      - alert: AutocompleteFallbackHigh
        expr: >
          (sum(rate(autocomplete_empty_fallback_total[5m])) /
           sum(rate(autocomplete_queries_total[5m]))) > 0.05
        for: 5m
        annotations:
          summary: "Fallback rate > 5%"

      - alert: AutocompleteLatencyHigh
        expr: >
          histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        annotations:
          summary: "P95 latency > 500ms"

Warning (P2) -- 10 minute evaluation

      - alert: FitmentCoverageLow
        expr: parts_fitment_coverage_ratio{manufacturer="nh"} < 0.6
        for: 10m

      - alert: CatalogQualityLow
        expr: catalog_ready_ratio{manufacturer="nh"} < 0.6
        for: 10m

      - alert: MissingImageRatioHigh
        expr: missing_image_ratio{manufacturer="nh"} > 0.3
        for: 10m

Grafana Dashboard

Import: scripts/grafana-dashboard-autocomplete-metrics.json

Panels:

  1. Query Volume -- QPS breakdown (success, error, aborted)
  2. Zero Result Rate -- Percentage gauge
  3. Result Count Distribution -- P50, P95, P99
  4. Query Latency -- Performance by percentile
  5. Suggestion Type Distribution -- Breakdown by type
  6. Fallback Rate -- Enhanced mode fallback percentage
  7. Fitment Coverage (NH) -- Manufacturer coverage gauge
  8. Equipment Fitment Coverage -- Query-level gauge
  9. Catalog Ready -- Data quality gauge
  10. Missing Images Ratio -- Image coverage gauge

Adding New Metrics

1. Define in registry

// src/metrics/registry.ts
import { Counter } from 'prom-client';

export const myNewMetric = new Counter({
  name: 'my_new_metric_total',
  help: 'Description of what this measures',
  labelNames: ['label1', 'label2'],
  registers: [register],
});

2. Export from routes

// src/routes/metrics.ts
import { myNewMetric } from '../metrics/registry';
export { myNewMetric };

3. Record in handler

myNewMetric.inc({ label1: 'value', label2: 'value' });

4. Verify

curl http://localhost:3001/metrics | grep my_new_metric

Cardinality Guidelines

  • Use fixed label values (enum-like) only -- never part_id, query_hash, etc.
  • Group high-cardinality values (e.g., query_length: short/medium/long).
  • Sample high-frequency metrics at 20%:
if (Math.random() < 0.2) {
  autocompleteSuggestionTypes.inc({ type: suggestion.type });
}
  • Target: ~38 total series. Alert if exceeding 1000.

Fitment Coverage Calculator

Script: scripts/calculate-fitment-coverage.ts

# Dry run
bun scripts/calculate-fitment-coverage.ts --dry-run

# Specific manufacturer
bun scripts/calculate-fitment-coverage.ts --manufacturer=nh

# Full update
bun scripts/calculate-fitment-coverage.ts

Schedule hourly via cron or Cloud Run Job:

# Cron
0 * * * * cd /path/to/microservices && bun scripts/calculate-fitment-coverage.ts

# Cloud Run Job
gcloud run jobs create calculate-fitment-coverage \
  --image gcr.io/PROJECT/search-service:TAG \
  --command bun \
  --args scripts/calculate-fitment-coverage.ts \
  --set-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200

Deployment Checklist

Pre-Deploy

  • bun run type-check passes
  • curl http://localhost:3001/metrics returns all 20 metrics
  • No cardinality warnings (expected ~38 series)
  • bun test passes

Monitoring Setup

  • Prometheus scrape target added (30s interval)
  • Grafana dashboard imported from scripts/grafana-dashboard-autocomplete-metrics.json
  • Alert rules deployed to AlertManager
  • Fitment coverage job scheduled (hourly)

Post-Deploy (24h)

  • All metrics collecting data
  • Zero result rate within expected range
  • Latency within expected range
  • No false-positive alerts
  • Resource consumption stable (Prometheus memory, CPU, disk)

Rollback Plan

  1. Metric errors -- Disable recording in code, restart service
  2. High cardinality -- Reduce sampling from 20% to 10%, remove problematic metric
  3. Alert false positives -- Disable specific rule, adjust threshold
  4. Complete rollback -- Revert code, remove scrape config, delete dashboard, disable alerts

Performance

Metric recording overhead is negligible:

OperationCost
Counter increment~0.01ms
Histogram observation~0.05ms
Gauge update~0.01ms
20% sampling check~0.001ms
Total per request< 0.1ms

Memory: ~15 KB for all custom metrics (500 bytes per histogram, 200 bytes per counter/gauge).


Troubleshooting

Metrics not in Prometheus:

  1. Verify service: curl http://localhost:3001/metrics
  2. Check scrape config targets: http://prometheus:9090/targets
  3. Confirm scrape interval and metrics_path in Prometheus config

High cardinality warnings: All metrics use fixed enum labels. If series count exceeds 1000, reduce sampling or remove unnecessary labels.

Latency spikes: Use autocomplete_elasticsearch_latency_ms by section_type to isolate which ES section is slow. Check autocomplete_parallel_queries_count for query complexity.


File Structure

microservices/services/search/
├── src/
│   ├── metrics/
│   │   └── registry.ts              # 20 metric definitions
│   └── routes/
│       ├── metrics.ts               # Metric exports
│       └── autocomplete.ts          # Recording logic
└── scripts/
    ├── calculate-fitment-coverage.ts # Hourly fitment job
    └── grafana-dashboard-autocomplete-metrics.json

On this page