CROP
ProjectsParts Services

Metrics Integration Guide

This guide explains how to integrate metrics into the Search service and use them for monitoring autocomplete quality and fitment coverage.

Metrics Integration Guide

This guide explains how to integrate metrics into the Search service and use them for monitoring autocomplete quality and fitment coverage.

Quick Start

1. Metrics are Already Defined

All metrics are pre-defined in src/metrics/registry.ts:

import {
  autocompleteResultCount,
  autocompleteZeroResults,
  fitmentCoverageRatio,
  // ... more metrics
} from '../metrics/registry';

2. Recording Metrics in Autocomplete Handler

The autocomplete route (src/routes/autocomplete.ts) already records:

  • Query volume: autocompleteQueriesTotal.inc({ status })
  • Query latency: autocompleteQueryDuration.observe({ intent }, durationSec)
  • Query intent: autocompleteQueryIntent.inc({ intent })
  • Result count: autocompleteResultCount.observe({ query_type, mode }, count)
  • Zero results: autocompleteZeroResults.inc({ query_length, query_type })
  • Fallback events: autocompleteEmptyFallback.inc({ intent })
  • Suggestion types: autocompleteSuggestionTypes.inc({ type })
  • Ranking scores: autocompleteRankingScore.observe({ suggestion_type }, score)

3. Accessing Metrics

All metrics are automatically exposed at:

GET /metrics

Example response:

# HELP autocomplete_result_count Number of suggestions returned per query
# TYPE autocomplete_result_count histogram
autocomplete_result_count_bucket{query_type="textual",mode="auto",le="3"} 1234
autocomplete_result_count_bucket{query_type="textual",mode="auto",le="5"} 2145
...

Adding New Metrics

To add a new metric:

1. Define in registry.ts

import { Gauge, Counter, Histogram } from 'prom-client';

// Add to src/metrics/registry.ts
export const myNewMetric = new Counter({
  name: 'my_new_metric_total',
  help: 'Description of what this metric measures',
  labelNames: ['label1', 'label2'],
  registers: [register],
});

2. Export from routes/metrics.ts

import {
  myNewMetric,
  // ... other imports
} from '../metrics/registry';

// In export statement
export {
  myNewMetric,
  // ... other exports
};

3. Record Metric in Handler

// In your route handler
myNewMetric.inc({ label1: 'value', label2: 'value' });

4. Test with Prometheus

curl http://localhost:3001/metrics | grep my_new_metric

Fitment Coverage Calculation

Manual Calculation

Run the fitment coverage script manually:

# Calculate and display (dry-run mode)
bun scripts/calculate-fitment-coverage.ts --dry-run

# Calculate for specific manufacturer
bun scripts/calculate-fitment-coverage.ts --manufacturer=nh --dry-run

Scheduled Job (cron)

Add to your cron schedule:

# Update fitment metrics every hour
0 * * * * cd /path/to/microservices && bun scripts/calculate-fitment-coverage.ts

Or use Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: calculate-fitment-coverage
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: fitment-calculator
            image: search-service:latest
            command:
            - bun
            - scripts/calculate-fitment-coverage.ts
            env:
            - name: ELASTICSEARCH_URL
              value: "https://elasticsearch:9200"
            - name: SEARCH_INDEX_NAME
              value: "parts_current"
          restartPolicy: OnFailure

Cloud Run Job

Create and execute via Cloud Run:

# First time: create the job
gcloud run jobs create calculate-fitment-coverage \
  --image gcr.io/PROJECT/search-service:TAG \
  --command bun \
  --args scripts/calculate-fitment-coverage.ts \
  --set-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200

# Execute on schedule
gcloud scheduler jobs create pubsub calculate-fitment-coverage \
  --location us-central1 \
  --schedule "0 * * * *" \
  --topic calculate-fitment-coverage \
  --message-body "{}"

# Link Cloud Run job to scheduler
gcloud run jobs update calculate-fitment-coverage \
  --update-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200

Grafana Integration

1. Import Dashboard

The dashboard JSON is provided at scripts/grafana-dashboard-autocomplete-metrics.json.

Import via UI:

  1. Go to Grafana → Dashboards → Import
  2. Upload grafana-dashboard-autocomplete-metrics.json
  3. Select Prometheus datasource
  4. Click Import

Import via API:

curl -X POST \
  -H "Authorization: Bearer ${GRAFANA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d @scripts/grafana-dashboard-autocomplete-metrics.json \
  http://grafana:3000/api/dashboards/db

2. Dashboard Panels

The dashboard includes:

  1. Query Volume - QPS breakdown (success, error, aborted)
  2. Zero Result Rate - Percentage of queries with no results
  3. Result Distribution - P50, P95, P99 of result counts
  4. Query Latency - Performance by percentile
  5. Suggestion Types - Distribution of returned types
  6. Fallback Rate - Enhanced mode fallback percentage
  7. Fitment Coverage - Parts with equipment data
  8. Equipment Coverage - Query coverage percentage
  9. Catalog Ready - Data quality percentage
  10. Missing Images - Image coverage metric

3. Create Custom Panels

Example: "Average Results by Query Type"

(rate(autocomplete_result_count_sum[5m]) / rate(autocomplete_result_count_count[5m]))
  by (query_type)

Example: "Zero Result Rate Over Time"

(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m]))

Example: "Fitment Coverage Trend"

parts_fitment_coverage_ratio by (manufacturer)

Alerting Rules

Prometheus AlertManager Configuration

Create prometheus/alerts.yml:

groups:
  - name: search-service
    interval: 30s
    rules:
      # Autocomplete Quality Alerts
      - alert: AutocompleteZeroResultsHigh
        expr: |
          (sum(rate(autocomplete_zero_results_total[5m])) /
           sum(rate(autocomplete_queries_total[5m]))) > 0.1
        for: 5m
        annotations:
          summary: "High zero result rate (> 10%)"
          description: "{{ value | humanizePercentage }} of autocomplete queries return zero results"

      - alert: AutocompleteFallbackHigh
        expr: |
          (sum(rate(autocomplete_empty_fallback_total[5m])) /
           sum(rate(autocomplete_queries_total[5m]))) > 0.05
        for: 5m
        annotations:
          summary: "Enhanced autocomplete fallback rate high (> 5%)"
          description: "{{ value | humanizePercentage }} of enhanced queries fall back to legacy"

      - alert: AutocompleteLatencyHigh
        expr: |
          histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        annotations:
          summary: "High autocomplete latency (P95 > 500ms)"
          description: "P95 latency: {{ value | humanizeDuration }}"

      # Fitment Coverage Alerts
      - alert: FitmentCoverageLow
        expr: parts_fitment_coverage_ratio{manufacturer="nh"} < 0.6
        for: 10m
        annotations:
          summary: "Fitment coverage below threshold (< 60%)"
          description: "NH fitment coverage: {{ value | humanizePercentage }}"

      - alert: EquipmentCoverageLow
        expr: equipment_fitment_coverage_ratio < 0.5
        for: 10m
        annotations:
          summary: "Equipment fitment coverage below threshold (< 50%)"
          description: "Coverage: {{ value | humanizePercentage }}"

      # Data Quality Alerts
      - alert: CatalogQualityLow
        expr: catalog_ready_ratio{manufacturer="nh"} < 0.6
        for: 10m
        annotations:
          summary: "Catalog ready ratio below threshold (< 60%)"
          description: "NH catalog ready: {{ value | humanizePercentage }}"

      - alert: MissingImageRatioHigh
        expr: missing_image_ratio{manufacturer="nh"} > 0.3
        for: 10m
        annotations:
          summary: "Missing image ratio above threshold (> 30%)"
          description: "NH missing images: {{ value | humanizePercentage }}"

Alertmanager Routing

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: slack-search-team
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - match:
        service: search-service
      receiver: slack-search-team

receivers:
  - name: slack-search-team
    slack_configs:
      - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        channel: '#search-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}: {{ .Annotations.description }}{{ end }}'

Querying Metrics Programmatically

Using curl

# Get current autocomplete query rate
curl -G http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=rate(autocomplete_queries_total[5m])'

# Get zero result percentage over last hour
curl -G http://prometheus:9090/api/v1/query_range \
  --data-urlencode 'query=(rate(autocomplete_zero_results_total[1h]) / rate(autocomplete_queries_total[1h]))' \
  --data-urlencode 'start=2025-01-01T00:00:00Z' \
  --data-urlencode 'end=2025-01-01T01:00:00Z' \
  --data-urlencode 'step=60s'

Using Python

import requests
from datetime import datetime, timedelta

# Query Prometheus
prometheus_url = "http://prometheus:9090"

def query_metric(query, duration="5m"):
    """Query current metric value"""
    response = requests.get(
        f"{prometheus_url}/api/v1/query",
        params={"query": query}
    )
    return response.json()["data"]["result"]

# Get zero result rate
results = query_metric(
    '(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m]))'
)
print(f"Zero result rate: {float(results[0]['value'][1]) * 100:.2f}%")

# Get fitment coverage
results = query_metric("parts_fitment_coverage_ratio")
for result in results:
    manufacturer = result["metric"]["manufacturer"]
    coverage = float(result["value"][1])
    print(f"{manufacturer}: {coverage * 100:.2f}%")

Metric Cardinality Management

To prevent Prometheus memory issues, maintain low cardinality:

Good Practices

  • Use fixed label values (enums)
  • Group high-cardinality values (e.g., query_length: short/medium/long)
  • Sample high-frequency metrics at 20%
  • Avoid labels with unbounded values

Example: Sampling

// Record every metric
autocompleteQueriesTotal.inc({ status: 'success' });

// But sample detailed metrics to 20%
if (Math.random() < 0.2) {
  autocompleteSuggestionTypes.inc({ type: suggestion.type });
  autocompleteRankingScore.observe({ suggestion_type }, score);
}

Cardinality Check

# Check cardinality in Prometheus
curl http://prometheus:9090/api/v1/label/__name__/values | \
  jq 'map(select(startswith("autocomplete_"))) | .[]'

Troubleshooting

Metrics Not Appearing

  1. Check service is running:

    curl http://localhost:3001/metrics
  2. Verify Prometheus scrape config:

    - job_name: 'search-service'
      static_configs:
        - targets: ['localhost:3001']
      metrics_path: '/metrics'
  3. Check Prometheus targets:

    http://prometheus:9090/targets

High Cardinality Issues

# Find high-cardinality metrics
curl -s http://prometheus:9090/api/v1/series \
  --data-urlencode 'match[]=autocomplete_result_count' | \
  jq '.data | length'

If > 1000 series, reduce cardinality by:

  • Removing unnecessary labels
  • Increasing sampling percentage (reduce from 20% to 10%)
  • Grouping label values

Performance Impact

Metric recording should be < 1ms per request. If slower:

  • Reduce histogram bucket count
  • Increase sampling percentage
  • Profile with --inspect flag

Maintenance

Monthly Tasks

  1. Review Alert Thresholds

    • Check if alerts are firing appropriately
    • Adjust thresholds based on actual values
  2. Prune Old Metrics

    • Prometheus retention: default 15 days
    • Adjust if needed: --storage.tsdb.retention.time=30d
  3. Dashboard Updates

    • Verify all panels are querying correctly
    • Add new metrics as they're created

Quarterly Reviews

  1. Analyze metrics trends
  2. Identify optimization opportunities
  3. Plan for metric additions/removals
  4. Update alerting rules based on SLO changes

On this page