Monitoring & Metrics
Prometheus metrics, alerting, and monitoring for parts-services Search service — covering autocomplete quality, fitment coverage, data quality, and system health.
Monitoring & Metrics
The Search service exposes Prometheus metrics at GET /metrics for tracking autocomplete quality, equipment fitment coverage, data quality, and system health. All metrics follow Prometheus naming conventions and use low-cardinality labels (~38 total series).
Quick Reference
Accessing Metrics
# View all metrics
curl http://search-service:3001/metrics
# Filter specific metric
curl http://search-service:3001/metrics | grep autocomplete_result_count
# Prometheus UI
http://prometheus:9090/targetsKey Metrics at a Glance
Autocomplete Quality (P1)
| Metric | Type | Target | Alert Threshold |
|---|---|---|---|
autocomplete_zero_results_total | Counter | < 5% rate | > 10% |
autocomplete_empty_fallback_total | Counter | < 2% rate | > 5% |
autocomplete_result_count | Histogram | P95 >= 5 | P95 < 3 |
Fitment Coverage (P2)
| Metric | Type | Target | Alert Threshold |
|---|---|---|---|
parts_fitment_coverage_ratio | Gauge (0-1) | >= 0.8 | < 0.6 |
equipment_fitment_coverage_ratio | Gauge (0-1) | >= 0.8 | < 0.5 |
fitment_data_points_total / parts_with_fitment_total | Calculated | >= 2 | < 1.5 |
Data Quality (P3)
| Metric | Type | Target | Alert Threshold |
|---|---|---|---|
catalog_ready_ratio | Gauge (0-1) | >= 0.8 | < 0.6 |
missing_image_ratio | Gauge (0-1) | < 0.2 | > 0.3 |
Essential PromQL Queries
# Zero result rate (5m window)
(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m])) * 100
# Fallback rate
(rate(autocomplete_empty_fallback_total[5m]) / rate(autocomplete_queries_total[5m])) * 100
# Result count P95
histogram_quantile(0.95, rate(autocomplete_result_count_bucket[5m]))
# Query latency P95 (ms)
histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) * 1000
# Fitment coverage trend
parts_fitment_coverage_ratio
# Catalog ready percentage
catalog_ready_ratio * 100Metric Types
All metrics use three Prometheus types:
- Counter -- Monotonically increasing (total requests, errors). Use
rate()to query. - Gauge -- Point-in-time value (coverage ratios, document counts). Query directly.
- Histogram -- Value distributions with buckets (latency, result counts). Use
histogram_quantile().
All Metrics
Autocomplete Quality
autocomplete_result_count (Histogram) -- Number of suggestions returned per query.
- Labels:
query_type(short, pn_like, textual, equipment),mode(auto, parts, equipment) - Buckets: [0, 1, 3, 5, 10, 20, 50]
autocomplete_zero_results_total (Counter) -- Queries returning zero results.
- Labels:
query_length(short, medium, long),query_type
autocomplete_empty_fallback_total (Counter) -- Enhanced autocomplete falling back to legacy.
- Labels:
intent(short, pn_like, textual, equipment)
autocomplete_suggestion_types_total (Counter) -- Distribution of suggestion types returned.
- Labels:
type(sku, category, manufacturer, equipment, description, brand) - Sampled at 20% to reduce cardinality.
autocomplete_ranking_score (Histogram) -- Relevance scores of returned suggestions.
- Labels:
suggestion_type - Buckets: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, 1.0]
- Sampled at 20%.
Autocomplete Performance
autocomplete_elasticsearch_latency_ms (Histogram) -- ES query latency per section.
- Labels:
section_type(sku, equipment, category, name, brand) - Buckets: [10, 25, 50, 100, 200, 500, 1000, 2000]
autocomplete_parallel_queries_count (Histogram) -- Parallel ES queries per request.
- Buckets: [1, 2, 3, 4, 5, 10]
autocomplete_cache_hits_total (Counter) -- Cache hits by type.
- Labels:
cache_type(query_result, section_data, suggestion_dedup)
Fitment Coverage
parts_fitment_coverage_ratio (Gauge, 0-1) -- Ratio of parts with equipment fitment data.
- Labels:
manufacturer,collection(nh_unified, mchale, hotsy) - 0.7+ = good coverage.
parts_with_fitment_total (Gauge) -- Absolute count of parts with fitment.
- Labels:
manufacturer,collection
fitment_data_points_total (Gauge) -- Total individual fitment entries.
- Labels:
manufacturer,collection - Richness indicator:
fitment_data_points_total / parts_with_fitment_total= avg fitments per part.
equipment_fitment_coverage_ratio (Gauge, 0-1) -- Fitment coverage from query perspective.
Equipment Fitment Queries
equipment_fitment_queries_total (Counter) -- Equipment fitment query volume.
- Labels:
mode(browse, search, related)
equipment_fitment_query_latency_ms (Histogram) -- Equipment query latency.
- Buckets: [10, 50, 100, 200, 500, 1000]
Index Health
index_document_count (Gauge) -- Documents in search index.
- Labels:
index_name(parts_current)
index_size_bytes (Gauge) -- Index size in bytes.
- Labels:
index_name
Data Quality
quality_score_distribution (Histogram) -- Part quality scores (0-100).
- Labels:
manufacturer - Buckets: [0, 20, 40, 60, 80, 100]
- Ranges: 0-20 Poor, 21-40 Fair, 41-60 Good, 61-80 Very Good, 81-100 Excellent.
catalog_ready_ratio (Gauge, 0-1) -- Parts with quality_score >= 80.
- Labels:
manufacturer
missing_image_ratio (Gauge, 0-1) -- Parts missing primary images.
- Labels:
manufacturer
Alert Rules
Critical (P1) -- 5 minute evaluation
groups:
- name: search-service
interval: 30s
rules:
- alert: AutocompleteZeroResultsHigh
expr: >
(sum(rate(autocomplete_zero_results_total[5m])) /
sum(rate(autocomplete_queries_total[5m]))) > 0.1
for: 5m
annotations:
summary: "Zero result rate > 10%"
- alert: AutocompleteFallbackHigh
expr: >
(sum(rate(autocomplete_empty_fallback_total[5m])) /
sum(rate(autocomplete_queries_total[5m]))) > 0.05
for: 5m
annotations:
summary: "Fallback rate > 5%"
- alert: AutocompleteLatencyHigh
expr: >
histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "P95 latency > 500ms"Warning (P2) -- 10 minute evaluation
- alert: FitmentCoverageLow
expr: parts_fitment_coverage_ratio{manufacturer="nh"} < 0.6
for: 10m
- alert: CatalogQualityLow
expr: catalog_ready_ratio{manufacturer="nh"} < 0.6
for: 10m
- alert: MissingImageRatioHigh
expr: missing_image_ratio{manufacturer="nh"} > 0.3
for: 10mGrafana Dashboard
Import: scripts/grafana-dashboard-autocomplete-metrics.json
Panels:
- Query Volume -- QPS breakdown (success, error, aborted)
- Zero Result Rate -- Percentage gauge
- Result Count Distribution -- P50, P95, P99
- Query Latency -- Performance by percentile
- Suggestion Type Distribution -- Breakdown by type
- Fallback Rate -- Enhanced mode fallback percentage
- Fitment Coverage (NH) -- Manufacturer coverage gauge
- Equipment Fitment Coverage -- Query-level gauge
- Catalog Ready -- Data quality gauge
- Missing Images Ratio -- Image coverage gauge
Adding New Metrics
1. Define in registry
// src/metrics/registry.ts
import { Counter } from 'prom-client';
export const myNewMetric = new Counter({
name: 'my_new_metric_total',
help: 'Description of what this measures',
labelNames: ['label1', 'label2'],
registers: [register],
});2. Export from routes
// src/routes/metrics.ts
import { myNewMetric } from '../metrics/registry';
export { myNewMetric };3. Record in handler
myNewMetric.inc({ label1: 'value', label2: 'value' });4. Verify
curl http://localhost:3001/metrics | grep my_new_metricCardinality Guidelines
- Use fixed label values (enum-like) only -- never
part_id,query_hash, etc. - Group high-cardinality values (e.g., query_length: short/medium/long).
- Sample high-frequency metrics at 20%:
if (Math.random() < 0.2) {
autocompleteSuggestionTypes.inc({ type: suggestion.type });
}- Target: ~38 total series. Alert if exceeding 1000.
Fitment Coverage Calculator
Script: scripts/calculate-fitment-coverage.ts
# Dry run
bun scripts/calculate-fitment-coverage.ts --dry-run
# Specific manufacturer
bun scripts/calculate-fitment-coverage.ts --manufacturer=nh
# Full update
bun scripts/calculate-fitment-coverage.tsSchedule hourly via cron or Cloud Run Job:
# Cron
0 * * * * cd /path/to/microservices && bun scripts/calculate-fitment-coverage.ts
# Cloud Run Job
gcloud run jobs create calculate-fitment-coverage \
--image gcr.io/PROJECT/search-service:TAG \
--command bun \
--args scripts/calculate-fitment-coverage.ts \
--set-env-vars ELASTICSEARCH_URL=https://elasticsearch:9200Deployment Checklist
Pre-Deploy
-
bun run type-checkpasses -
curl http://localhost:3001/metricsreturns all 20 metrics - No cardinality warnings (expected ~38 series)
-
bun testpasses
Monitoring Setup
- Prometheus scrape target added (30s interval)
- Grafana dashboard imported from
scripts/grafana-dashboard-autocomplete-metrics.json - Alert rules deployed to AlertManager
- Fitment coverage job scheduled (hourly)
Post-Deploy (24h)
- All metrics collecting data
- Zero result rate within expected range
- Latency within expected range
- No false-positive alerts
- Resource consumption stable (Prometheus memory, CPU, disk)
Rollback Plan
- Metric errors -- Disable recording in code, restart service
- High cardinality -- Reduce sampling from 20% to 10%, remove problematic metric
- Alert false positives -- Disable specific rule, adjust threshold
- Complete rollback -- Revert code, remove scrape config, delete dashboard, disable alerts
Performance
Metric recording overhead is negligible:
| Operation | Cost |
|---|---|
| Counter increment | ~0.01ms |
| Histogram observation | ~0.05ms |
| Gauge update | ~0.01ms |
| 20% sampling check | ~0.001ms |
| Total per request | < 0.1ms |
Memory: ~15 KB for all custom metrics (500 bytes per histogram, 200 bytes per counter/gauge).
Troubleshooting
Metrics not in Prometheus:
- Verify service:
curl http://localhost:3001/metrics - Check scrape config targets:
http://prometheus:9090/targets - Confirm scrape interval and metrics_path in Prometheus config
High cardinality warnings: All metrics use fixed enum labels. If series count exceeds 1000, reduce sampling or remove unnecessary labels.
Latency spikes:
Use autocomplete_elasticsearch_latency_ms by section_type to isolate which ES section is slow. Check autocomplete_parallel_queries_count for query complexity.
File Structure
microservices/services/search/
├── src/
│ ├── metrics/
│ │ └── registry.ts # 20 metric definitions
│ └── routes/
│ ├── metrics.ts # Metric exports
│ └── autocomplete.ts # Recording logic
└── scripts/
├── calculate-fitment-coverage.ts # Hourly fitment job
└── grafana-dashboard-autocomplete-metrics.json