Prometheus Metrics Documentation
This document describes all custom Prometheus metrics collected by the Search service for monitoring autocomplete quality, fitment coverage, and system health.
Prometheus Metrics Documentation
This document describes all custom Prometheus metrics collected by the Search service for monitoring autocomplete quality, fitment coverage, and system health.
Overview
All metrics follow Prometheus naming conventions:
- Counter: Monotonically increasing value (total requests, errors, etc.)
- Gauge: Point-in-time measurement (document count, ratios, etc.)
- Histogram: Distribution of values with quantile aggregation (latency, sizes, etc.)
Metrics endpoint: GET /metrics
Autocomplete Quality Metrics
autocomplete_result_count (Histogram)
Type: Histogram
Labels: query_type, mode
Buckets: [0, 1, 3, 5, 10, 20, 50]
Number of suggestions returned per query. Tracks distribution to identify queries with poor or excessive results.
Query Type Values:
short- Queries <= 3 characterspn_like- Part number pattern queriestextual- Natural language queriesequipment- Equipment-focused queries
Mode Values:
auto- Automatic mode selectionparts- Parts-only searchequipment- Equipment-centric search
Example Queries:
# Mean result count
rate(autocomplete_result_count_sum[5m]) / rate(autocomplete_result_count_count[5m])
# P95 result count distribution
histogram_quantile(0.95, autocomplete_result_count_bucket)
# Percentage of queries returning 0-1 results
(autocomplete_result_count_bucket{le="1"} / autocomplete_result_count_count) * 100autocomplete_zero_results_total (Counter)
Type: Counter
Labels: query_length, query_type
Total queries returning zero results. Indicates potential quality issues or gaps in autocomplete coverage.
Query Length Values:
short- 1-3 charactersmedium- 4-10 characterslong- 11+ characters
Example Queries:
# Zero result rate (5-minute window)
rate(autocomplete_zero_results_total[5m])
# Zero result rate by query length
rate(autocomplete_zero_results_total{query_length="short"}[5m])
# Percentage of queries with zero results
(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m])) * 100autocomplete_empty_fallback_total (Counter)
Type: Counter
Labels: intent
Times enhanced autocomplete returned empty results and triggered legacy fallback. Lower values indicate stable enhanced mode.
Intent Values:
short- Short queriespn_like- Part number queriestextual- Text queriesequipment- Equipment queries
Example Queries:
# Fallback rate
rate(autocomplete_empty_fallback_total[5m])
# Fallback percentage of total queries
(rate(autocomplete_empty_fallback_total[5m]) / rate(autocomplete_queries_total[5m])) * 100
# Alert on high fallback rate
rate(autocomplete_empty_fallback_total[5m]) > 0.1autocomplete_suggestion_types_total (Counter)
Type: Counter
Labels: type
Distribution of suggestion types returned. Indicates which fields provide results and their relative importance.
Type Values:
sku- Part number/SKU matchescategory- Product category matchesmanufacturer- Manufacturer matchesequipment- Equipment model matchesdescription- Description field matchesbrand- Brand matches
Example Queries:
# Top suggestion types by count
topk(5, rate(autocomplete_suggestion_types_total[5m]))
# SKU suggestion percentage
(rate(autocomplete_suggestion_types_total{type="sku"}[5m]) /
sum(rate(autocomplete_suggestion_types_total[5m]))) * 100
# Equipment suggestion availability
rate(autocomplete_suggestion_types_total{type="equipment"}[5m])autocomplete_ranking_score (Histogram)
Type: Histogram
Labels: suggestion_type
Buckets: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, 1.0]
Relevance/ranking scores of returned suggestions. Higher scores indicate better ranking quality.
Note: Sampled at 20% to reduce cardinality impact.
Example Queries:
# Average score per suggestion type
rate(autocomplete_ranking_score_sum[5m]) / rate(autocomplete_ranking_score_count[5m])
# P95 score by suggestion type
histogram_quantile(0.95, rate(autocomplete_ranking_score_bucket[5m]))
# Percentage of suggestions with score >= 0.7 (good)
(histogram_quantile(0.95, autocomplete_ranking_score_bucket) >= 0.7) * 100Autocomplete Performance Metrics
autocomplete_elasticsearch_latency_ms (Histogram)
Type: Histogram
Labels: section_type
Buckets: [10, 25, 50, 100, 200, 500, 1000, 2000]
Time spent in Elasticsearch queries for each section type (milliseconds).
Section Type Values:
sku- Part number/SKU searchequipment- Equipment model searchcategory- Category searchname- Description/name searchbrand- Brand/manufacturer search
Example Queries:
# P95 latency by section
histogram_quantile(0.95, rate(autocomplete_elasticsearch_latency_ms_bucket[5m]))
# Average latency per section
rate(autocomplete_elasticsearch_latency_ms_sum[5m]) / rate(autocomplete_elasticsearch_latency_ms_count[5m])
# Alert on high SKU latency
histogram_quantile(0.95, rate(autocomplete_elasticsearch_latency_ms_bucket{section_type="sku"}[5m])) > 100autocomplete_parallel_queries_count (Histogram)
Type: Histogram Buckets: [1, 2, 3, 4, 5, 10]
Number of parallel Elasticsearch queries in a single autocomplete request. Tracks query efficiency and concurrency.
Example Queries:
# Average number of parallel queries
rate(autocomplete_parallel_queries_count_sum[5m]) / rate(autocomplete_parallel_queries_count_count[5m])
# P95 parallel query count
histogram_quantile(0.95, rate(autocomplete_parallel_queries_count_bucket[5m]))autocomplete_cache_hits_total (Counter)
Type: Counter
Labels: cache_type
Cache hits for autocomplete queries. Currently informational; used for future optimization planning.
Cache Type Values:
query_result- Full query result cachesection_data- Individual section data cachesuggestion_dedup- Deduplication cache
Example Queries:
# Cache hit rate
(rate(autocomplete_cache_hits_total[5m]) / rate(autocomplete_queries_total[5m])) * 100Fitment Coverage Metrics
parts_fitment_coverage_ratio (Gauge)
Type: Gauge
Labels: manufacturer, collection
Range: 0-1 (ratio)
Ratio of parts with equipment fitment data. Calculated as: parts_with_fitment / total_parts
Collection Values:
nh_unified- New Holland unified collectionmchale- McHale collectionhotsy- HOTSY collection
Interpretation:
- 0.0 = No parts have fitment data
- 1.0 = All parts have fitment data
- 0.7+ = Good coverage
Example Queries:
# NH fitment coverage
parts_fitment_coverage_ratio{manufacturer="nh"}
# Coverage by all manufacturers
parts_fitment_coverage_ratio
# Alert on coverage drop
(parts_fitment_coverage_ratio < 0.6)
# Comparison between manufacturers
parts_fitment_coverage_ratio group by (manufacturer)parts_with_fitment_total (Gauge)
Type: Gauge
Labels: manufacturer, collection
Absolute count of parts with equipment fitment data.
Example Queries:
# NH parts with fitment
parts_with_fitment_total{manufacturer="nh"}
# Total across all
sum(parts_with_fitment_total)
# Track changes over time
rate(parts_with_fitment_total[1h])fitment_data_points_total (Gauge)
Type: Gauge
Labels: manufacturer, collection
Total equipment fitment data points (individual fitment entries) across all parts.
Interpretation:
- Indicates richness of fitment data
- Higher values = more detailed fitment coverage
- Use ratio:
fitment_data_points / parts_with_fitment= avg fitments per part
Example Queries:
# Average fitments per part
fitment_data_points_total / parts_with_fitment_total
# Total fitment data points
sum(fitment_data_points_total)
# Alert on data loss
(fitment_data_points_total / parts_with_fitment_total) < 2equipment_fitment_coverage_ratio (Gauge)
Type: Gauge Range: 0-1 (ratio)
Ratio of search queries with equipment fitment data available. Measures coverage from query perspective.
Interpretation:
- 0.0 = No queries return parts with fitment data
- 1.0 = All queries return parts with fitment data
- 0.5+ = Reasonable coverage
Example Queries:
# Current fitment coverage
equipment_fitment_coverage_ratio
# Alert on coverage below threshold
(equipment_fitment_coverage_ratio < 0.5)
# Track coverage over time
equipment_fitment_coverage_ratioEquipment Fitment Query Metrics
equipment_fitment_queries_total (Counter)
Type: Counter
Labels: mode
Total equipment fitment queries (part -> equipment search operations).
Mode Values:
browse- Equipment browsing queriessearch- Search-based equipment queriesrelated- Related equipment queries
Example Queries:
# Equipment query rate
rate(equipment_fitment_queries_total[5m])
# Query volume by mode
rate(equipment_fitment_queries_total[5m]) by (mode)equipment_fitment_query_latency_ms (Histogram)
Type: Histogram Labels: None Buckets: [10, 50, 100, 200, 500, 1000]
Latency of equipment fitment queries (milliseconds).
Example Queries:
# P95 equipment query latency
histogram_quantile(0.95, rate(equipment_fitment_query_latency_ms_bucket[5m]))
# Average latency
rate(equipment_fitment_query_latency_ms_sum[5m]) / rate(equipment_fitment_query_latency_ms_count[5m])
# Alert on slow queries
histogram_quantile(0.95, rate(equipment_fitment_query_latency_ms_bucket[5m])) > 500Index Health Metrics
index_document_count (Gauge)
Type: Gauge
Labels: index_name
Total number of documents in search index.
Index Name Values:
parts_current- Current parts index (via alias)
Example Queries:
# Document count
index_document_count
# Alert on unexpected count drop
(index_document_count < 5000)index_size_bytes (Gauge)
Type: Gauge
Labels: index_name
Total size of index in bytes.
Example Queries:
# Index size in GB
index_size_bytes / 1024 / 1024 / 1024
# Alert on excessive growth
(index_size_bytes / 1024 / 1024 / 1024) > 100Data Quality Metrics
quality_score_distribution (Histogram)
Type: Histogram
Labels: manufacturer
Buckets: [0, 20, 40, 60, 80, 100]
Distribution of part quality scores (0-100).
Score Ranges:
- 0-20: Poor
- 21-40: Fair
- 41-60: Good
- 61-80: Very Good
- 81-100: Excellent
Example Queries:
# Average quality score
rate(quality_score_distribution_sum[5m]) / rate(quality_score_distribution_count[5m])
# Average by manufacturer
(rate(quality_score_distribution_sum[5m]) / rate(quality_score_distribution_count[5m])) by (manufacturer)
# P95 score
histogram_quantile(0.95, rate(quality_score_distribution_bucket[5m]))catalog_ready_ratio (Gauge)
Type: Gauge
Labels: manufacturer
Range: 0-1 (ratio)
Ratio of parts with catalogReady=true (quality_score >= 80).
Interpretation:
- Percentage of production-ready parts
- 0.8+ = Strong data quality
- < 0.5 = Quality concerns
Example Queries:
# Catalog ready by manufacturer
catalog_ready_ratio
# Alert on quality drop
(catalog_ready_ratio < 0.6)
# Track over time
catalog_ready_ratiomissing_image_ratio (Gauge)
Type: Gauge
Labels: manufacturer
Range: 0-1 (ratio)
Ratio of parts missing primary images.
Interpretation:
- 0.0 = All parts have images
- 1.0 = No parts have images
- < 0.2 = Good image coverage
Example Queries:
# Missing image ratio
missing_image_ratio
# Alert on high missing rate
(missing_image_ratio > 0.3)
# Comparison across manufacturers
missing_image_ratioCommon Alert Patterns
Autocomplete Quality
# Zero result rate > 10%
- alert: AutocompleteZeroResultsHigh
expr: (rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m])) > 0.1
# Fallback rate > 5%
- alert: AutocompleteFallbackHigh
expr: (rate(autocomplete_empty_fallback_total[5m]) / rate(autocomplete_queries_total[5m])) > 0.05
# P95 latency > 500ms
- alert: AutocompleteSlowLatency
expr: histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m])) > 0.5Fitment Coverage
# Fitment coverage drop
- alert: FitmentCoverageLow
expr: parts_fitment_coverage_ratio < 0.6
# Equipment coverage drop
- alert: EquipmentFitmentCoverageLow
expr: equipment_fitment_coverage_ratio < 0.5Data Quality
# Catalog ready ratio < 60%
- alert: CatalogQualityLow
expr: catalog_ready_ratio < 0.6
# Missing images > 30%
- alert: MissingImageRatioHigh
expr: missing_image_ratio > 0.3Dashboard Queries
Autocomplete Quality Dashboard
# Top panel: Query volume and success rate
sum(rate(autocomplete_queries_total[5m])) by (status)
# Second panel: Result distribution
histogram_quantile(0.95, rate(autocomplete_result_count_bucket[5m]))
# Third panel: Zero result rate
(rate(autocomplete_zero_results_total[5m]) / rate(autocomplete_queries_total[5m])) * 100
# Fourth panel: Suggestion type distribution
rate(autocomplete_suggestion_types_total[5m]) by (type)
# Fifth panel: Query latency
histogram_quantile(0.95, rate(autocomplete_query_duration_seconds_bucket[5m]))
# Sixth panel: Fallback rate
(rate(autocomplete_empty_fallback_total[5m]) / rate(autocomplete_queries_total[5m])) * 100Fitment Coverage Dashboard
# Top panel: Fitment coverage by manufacturer
parts_fitment_coverage_ratio by (manufacturer)
# Second panel: Parts with fitment
parts_with_fitment_total by (manufacturer)
# Third panel: Equipment fitment coverage
equipment_fitment_coverage_ratio
# Fourth panel: Fitments per part (richness)
fitment_data_points_total / parts_with_fitment_total
# Fifth panel: Equipment query latency P95
histogram_quantile(0.95, rate(equipment_fitment_query_latency_ms_bucket[5m]))Data Quality Dashboard
# Top panel: Catalog ready by manufacturer
catalog_ready_ratio by (manufacturer)
# Second panel: Average quality score
(rate(quality_score_distribution_sum[5m]) / rate(quality_score_distribution_count[5m])) by (manufacturer)
# Third panel: Missing image ratio
missing_image_ratio by (manufacturer)
# Fourth panel: Index health
index_document_count
# Fifth panel: Index size
index_size_bytes / 1024 / 1024 / 1024Metric Recording Guidelines
Low Cardinality
Keep label cardinality low to prevent memory issues:
- Use fixed set of label values (enum-like)
- Avoid high-cardinality labels like
part_id,query_hash - Group similar values (query_length: short/medium/long)
- Sample high-frequency metrics (20% for suggestion types)
Sampling
To reduce overhead on frequent operations:
// Sample 20% of requests
if (Math.random() < 0.2) {
// Record detailed metric
autocompleteSuggestionTypes.inc({ type });
}Performance
Metric recording should be < 1ms per request:
// Fast path - aggregate only totals
autocompleteQueriesTotal.inc({ status: 'success' });
// Slower path - histograms (reserve for important metrics)
autocompleteQueryDuration.observe({ intent }, durationSec);Implementation Examples
Recording Result Count
autocompleteResultCount.observe(
{ query_type: intent, mode: resolvedMode },
suggestions.length
);Recording Zero Results
if (suggestions.length === 0) {
const queryLength = query.length <= 3 ? 'short' : 'medium';
autocompleteZeroResults.inc({ query_length: queryLength, query_type: intent });
}Recording Fitment Coverage (periodic job)
const total = await countTotalParts();
const withFitment = await countPartsWithFitment();
const ratio = total > 0 ? withFitment / total : 0;
fitmentCoverageRatio.set({ manufacturer: 'nh' }, ratio);
fitmentPartsTotal.set({ manufacturer: 'nh' }, withFitment);
fitmentDataPointsTotal.set({ manufacturer: 'nh' }, totalDataPoints);Future Improvements
- Query Result Caching: Track
autocomplete_cache_hitsonce cache is implemented - A/B Testing Metrics: Track variant-specific metrics for ranking experiments
- User Behavior: Session-level metrics (session conversion, suggestion click-through)
- Index Performance: Shard-level metrics, query complexity tracking
- Automated Alerts: Machine learning-based anomaly detection
Related Documentation
AUTOCOMPLETE_ARCHITECTURE.md- Autocomplete implementation detailsEQUIPMENT_API.md- Equipment fitment API documentationUNIFIED_SCHEMA.md- IndexedPart schema reference- Prometheus docs: https://prometheus.io/docs/practices/instrumentation/
- Grafana docs: https://grafana.com/docs/grafana/latest/