Metrics Integration Guide
This guide explains how to integrate metrics into the Search service and use them for monitoring autocomplete quality and fitment coverage.
Metrics Integration Guide
This guide explains how to integrate metrics into the Search service and use them for monitoring autocomplete quality and fitment coverage.
Quick Start
1. Metrics are Already Defined
All metrics are pre-defined in src/metrics/registry.ts:
import {
autocompleteResultCount,
autocompleteZeroResults,
fitmentCoverageRatio,
// ... more metrics
} from '../metrics/registry';2. Recording Metrics in Autocomplete Handler
The autocomplete route (src/routes/autocomplete.ts) already records:
- Query volume:
autocompleteQueriesTotal.inc({ status }) - Query latency:
autocompleteQueryDuration.observe({ intent }, durationSec) - Query intent:
autocompleteQueryIntent.inc({ intent }) - Result count:
autocompleteResultCount.observe({ query_type, mode }, count) - Zero results:
autocompleteZeroResults.inc({ query_length, query_type }) - Fallback events:
autocompleteEmptyFallback.inc({ intent }) - Suggestion types:
autocompleteSuggestionTypes.inc({ type }) - Ranking scores:
autocompleteRankingScore.observe({ suggestion_type }, score)
3. Accessing Metrics
All metrics are automatically exposed at:
GET /metricsExample response:
# HELP autocomplete_result_count Number of suggestions returned per query
# TYPE autocomplete_result_count histogram
autocomplete_result_count_bucket{query_type="textual",mode="auto",le="3"} 1234
autocomplete_result_count_bucket{query_type="textual",mode="auto",le="5"} 2145
...Adding New Metrics
To add a new metric:
1. Define in registry.ts
import { Gauge, Counter, Histogram } from 'prom-client';
// Add to src/metrics/registry.ts
export const myNewMetric = new Counter({
name: 'my_new_metric_total',
help: 'Description of what this metric measures',
labelNames: ['label1', 'label2'],
registers: [register],
});2. Export from routes/metrics.ts
import {
myNewMetric,
// ... other imports
} from '../metrics/registry';
// In export statement
export {
myNewMetric,
// ... other exports
};3. Record Metric in Handler
// In your route handler
myNewMetric.inc({ label1: 'value', label2: 'value' });4. Test with Prometheus
curl http://localhost:3001/metrics | grep my_new_metricFitment Coverage Calculation
Manual Calculation
Run the fitment coverage script manually:
# Calculate and display (dry-run mode)
bun scripts/calculate-fitment-coverage.ts --dry-run
# Calculate for specific manufacturer
bun scripts/calculate-fitment-coverage.ts --manufacturer=nh --dry-runScheduled Job (cron)
Add to your cron schedule:
# Update fitment metrics every hour
0 * * * * cd /path/to/microservices && bun scripts/calculate-fitment-coverage.tsOr use Kubernetes CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: calculate-fitment-coverage
spec:
schedule: "0 * * * *" # Every hour
jobTemplate:
spec:
template:
spec:
containers:
- name: fitment-calculator
image: search-service:latest
command:
- bun
- scripts/calculate-fitment-coverage.ts
env:
- name: ELASTICSEARCH_URL
value: "https://elasticsearch:9200"
- name: SEARCH_INDEX_NAME
value: "parts_current"
restartPolicy: OnFailureCloud Run Job
Create and execute via Cloud Run:
# First time: create the job
gcloud run jobs create calculate-fitment-coverage \
--image gcr.io/PROJECT/search-service:TAG \
--command bun \
--args scripts/calculate-fitment-coverage.ts \
--set-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200
# Execute on schedule
gcloud scheduler jobs create pubsub calculate-fitment-coverage \
--location us-central1 \
--schedule "0 * * * *" \
--topic calculate-fitment-coverage \
--message-body "{}"
# Link Cloud Run job to scheduler
gcloud run jobs update calculate-fitment-coverage \
--update-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200Grafana Integration
1. Import Dashboard
The dashboard JSON is provided at scripts/grafana-dashboard-autocomplete-metrics.json.
Import via UI:
- Go to Grafana → Dashboards → Import
- Upload
grafana-dashboard-autocomplete-metrics.json - Select Prometheus datasource
- Click Import
Import via API:
curl -X POST \
-H "Authorization: Bearer ${GRAFANA_TOKEN}" \
-H "Content-Type: application/json" \
-d @scripts/grafana-dashboard-autocomplete-metrics.json \
http://grafana:3000/api/dashboards/db2. Dashboard Panels
The dashboard includes:
- Query Volume - QPS breakdown (success, error, aborted)
- Zero Result Rate - Percentage of queries with no results
- Result Distribution - P50, P95, P99 of result counts
- Query Latency - Performance by percentile
- Suggestion Types - Distribution of returned types
- Fallback Rate - Enhanced mode fallback percentage
- Fitment Coverage - Parts with equipment data
- Equipment Coverage - Query coverage percentage
- Catalog Ready - Data quality percentage
- Missing Images - Image coverage metric
3. Create Custom Panels
Example: "Average Results by Query Type"
(rate(autocomplete_result_count_sum[5m]) / rate(autocomplete_result_count_count[5m]))
by (query_type)Example: "Zero Result Rate Over Time"
(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m]))Example: "Fitment Coverage Trend"
parts_fitment_coverage_ratio by (manufacturer)Alerting Rules
Prometheus AlertManager Configuration
Create prometheus/alerts.yml:
groups:
- name: search-service
interval: 30s
rules:
# Autocomplete Quality Alerts
- alert: AutocompleteZeroResultsHigh
expr: |
(sum(rate(autocomplete_zero_results_total[5m])) /
sum(rate(autocomplete_queries_total[5m]))) > 0.1
for: 5m
annotations:
summary: "High zero result rate (> 10%)"
description: "{{ value | humanizePercentage }} of autocomplete queries return zero results"
- alert: AutocompleteFallbackHigh
expr: |
(sum(rate(autocomplete_empty_fallback_total[5m])) /
sum(rate(autocomplete_queries_total[5m]))) > 0.05
for: 5m
annotations:
summary: "Enhanced autocomplete fallback rate high (> 5%)"
description: "{{ value | humanizePercentage }} of enhanced queries fall back to legacy"
- alert: AutocompleteLatencyHigh
expr: |
histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "High autocomplete latency (P95 > 500ms)"
description: "P95 latency: {{ value | humanizeDuration }}"
# Fitment Coverage Alerts
- alert: FitmentCoverageLow
expr: parts_fitment_coverage_ratio{manufacturer="nh"} < 0.6
for: 10m
annotations:
summary: "Fitment coverage below threshold (< 60%)"
description: "NH fitment coverage: {{ value | humanizePercentage }}"
- alert: EquipmentCoverageLow
expr: equipment_fitment_coverage_ratio < 0.5
for: 10m
annotations:
summary: "Equipment fitment coverage below threshold (< 50%)"
description: "Coverage: {{ value | humanizePercentage }}"
# Data Quality Alerts
- alert: CatalogQualityLow
expr: catalog_ready_ratio{manufacturer="nh"} < 0.6
for: 10m
annotations:
summary: "Catalog ready ratio below threshold (< 60%)"
description: "NH catalog ready: {{ value | humanizePercentage }}"
- alert: MissingImageRatioHigh
expr: missing_image_ratio{manufacturer="nh"} > 0.3
for: 10m
annotations:
summary: "Missing image ratio above threshold (> 30%)"
description: "NH missing images: {{ value | humanizePercentage }}"Alertmanager Routing
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: slack-search-team
group_by: ['alertname', 'service']
group_wait: 1m
group_interval: 5m
repeat_interval: 12h
routes:
- match:
service: search-service
receiver: slack-search-team
receivers:
- name: slack-search-team
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: '#search-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}: {{ .Annotations.description }}{{ end }}'Querying Metrics Programmatically
Using curl
# Get current autocomplete query rate
curl -G http://prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(autocomplete_queries_total[5m])'
# Get zero result percentage over last hour
curl -G http://prometheus:9090/api/v1/query_range \
--data-urlencode 'query=(rate(autocomplete_zero_results_total[1h]) / rate(autocomplete_queries_total[1h]))' \
--data-urlencode 'start=2025-01-01T00:00:00Z' \
--data-urlencode 'end=2025-01-01T01:00:00Z' \
--data-urlencode 'step=60s'Using Python
import requests
from datetime import datetime, timedelta
# Query Prometheus
prometheus_url = "http://prometheus:9090"
def query_metric(query, duration="5m"):
"""Query current metric value"""
response = requests.get(
f"{prometheus_url}/api/v1/query",
params={"query": query}
)
return response.json()["data"]["result"]
# Get zero result rate
results = query_metric(
'(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m]))'
)
print(f"Zero result rate: {float(results[0]['value'][1]) * 100:.2f}%")
# Get fitment coverage
results = query_metric("parts_fitment_coverage_ratio")
for result in results:
manufacturer = result["metric"]["manufacturer"]
coverage = float(result["value"][1])
print(f"{manufacturer}: {coverage * 100:.2f}%")Metric Cardinality Management
To prevent Prometheus memory issues, maintain low cardinality:
Good Practices
- Use fixed label values (enums)
- Group high-cardinality values (e.g., query_length: short/medium/long)
- Sample high-frequency metrics at 20%
- Avoid labels with unbounded values
Example: Sampling
// Record every metric
autocompleteQueriesTotal.inc({ status: 'success' });
// But sample detailed metrics to 20%
if (Math.random() < 0.2) {
autocompleteSuggestionTypes.inc({ type: suggestion.type });
autocompleteRankingScore.observe({ suggestion_type }, score);
}Cardinality Check
# Check cardinality in Prometheus
curl http://prometheus:9090/api/v1/label/__name__/values | \
jq 'map(select(startswith("autocomplete_"))) | .[]'Troubleshooting
Metrics Not Appearing
-
Check service is running:
curl http://localhost:3001/metrics -
Verify Prometheus scrape config:
- job_name: 'search-service' static_configs: - targets: ['localhost:3001'] metrics_path: '/metrics' -
Check Prometheus targets:
http://prometheus:9090/targets
High Cardinality Issues
# Find high-cardinality metrics
curl -s http://prometheus:9090/api/v1/series \
--data-urlencode 'match[]=autocomplete_result_count' | \
jq '.data | length'If > 1000 series, reduce cardinality by:
- Removing unnecessary labels
- Increasing sampling percentage (reduce from 20% to 10%)
- Grouping label values
Performance Impact
Metric recording should be < 1ms per request. If slower:
- Reduce histogram bucket count
- Increase sampling percentage
- Profile with
--inspectflag
Maintenance
Monthly Tasks
-
Review Alert Thresholds
- Check if alerts are firing appropriately
- Adjust thresholds based on actual values
-
Prune Old Metrics
- Prometheus retention: default 15 days
- Adjust if needed:
--storage.tsdb.retention.time=30d
-
Dashboard Updates
- Verify all panels are querying correctly
- Add new metrics as they're created
Quarterly Reviews
- Analyze metrics trends
- Identify optimization opportunities
- Plan for metric additions/removals
- Update alerting rules based on SLO changes
Related Documentation
METRICS.md- Complete metrics referenceAUTOCOMPLETE_ARCHITECTURE.md- Autocomplete implementationEQUIPMENT_API.md- Equipment fitment API- Prometheus: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Grafana: https://grafana.com/docs/grafana/latest/dashboards/