Search Index Improvement
Analysis of search index data integrity issues and improvement plan for Elasticsearch sync.
Search Index Analysis & Improvement Plan
[!NOTE] Discussion: Open an issue to comment on this plan. Attach supporting documents to the issue or link them here.
Date: 2026-01-21 Status: Critical issues identified, action required
Executive Summary
The current search index has critical data integrity issues:
- Only 2,000 of 7,091 K&M tires are indexed (72% missing)
- Index contains stale data from old syncs
- Sync workflow is incomplete — missing collections
- No separation between parts and tires indices
Current State Analysis
MongoDB Data (crop_dev)
| Collection | Documents | Type | In Sync Config? |
|---|---|---|---|
| parts_nhl | 1 | Parts | Yes |
| parts_bns | 1,307 | Parts | Yes |
| parts_vnt | 211 | Parts | Yes |
| parts_mch | 63 | Parts | Yes |
| parts_kuh | 495 | Parts | NO |
| parts_hot | 190 | Parts | NO |
| parts_har | 147 | Parts | NO |
| parts_kin | 101 | Parts | NO |
| parts_mar | 66 | Parts | NO |
| Parts Subtotal | 2,581 | ||
| parts_kmt | 7,091 | Tires | Yes |
| Grand Total | 9,672 |
Elasticsearch Index (parts_current)
| Manufacturer | ES Docs | MongoDB | Delta | Issue |
|---|---|---|---|---|
| KMT | 2,000 | 7,091 | -5,091 | SYNC INCOMPLETE |
| BNS | 1,307 | 1,307 | 0 | OK |
| KUH | 447 | 495 | -48 | Stale (not in sync) |
| VNT | 211 | 211 | 0 | OK |
| HOT | 190 | 190 | 0 | Stale (not in sync) |
| HAR | 123 | 147 | -24 | Stale (not in sync) |
| KIN | 101 | 101 | 0 | Stale (not in sync) |
| MAR | 66 | 66 | 0 | Stale (not in sync) |
| MCH | 63 | 63 | 0 | OK |
| NHL | 0 | 1 | -1 | MISSING |
| Total | 4,508 | 9,672 | -5,164 |
Critical Issues
Issue #1: K&M Tire Sync Incomplete (CRITICAL)
- Expected: 7,091 tires
- Actual: 2,000 tires (72% missing)
- Root Cause: Unknown — likely Cloud Run job timeout or memory issue
Issue #2: Stale Data in Index
- Collections KUH, HOT, HAR, KIN, MAR exist in ES from OLD syncs
- These are NOT in current sync config, so data never updates
Issue #3: Missing Collections in Sync
- Workflow only syncs 5 collections, should sync all 10
Issue #4: No Index Separation
- Parts and Tires share same index
- Need separate
parts_currentandtires_currentindices
Issue #5: No Incremental Sync
- Full re-sync on every deployment, no delta/change detection
Recommendations
Immediate Actions (P0)
- Fix K&M sync — increase Cloud Run job memory to 4Gi, timeout to 3600s
- Add all collections to sync — update search-deploy.yml
- Create fresh index — delete stale data, run full sync
Short-term (P1)
- Separate indices for parts vs tires
- Add sync validation — compare MongoDB vs ES counts after sync
- Add monitoring — Datadog metric for index doc count
Medium-term (P2)
- Incremental sync — track
updatedAt, only sync changes - Blue-green index deployment — validate before switching alias
- Separate sync jobs — independent jobs for parts and tires
Success Criteria
After Phase 1:
- ES doc count = 9,672
- KMT docs = 7,091
- Parts docs = 2,581
After Phase 2:
- Separate indices: parts_current, tires_current
- Independent sync jobs
- Validation step passes
Related
- Search Service — Search Service overview and current architecture
Data Model Consolidation: MongoDB as Single Database
Detailed data model for consolidating 3 databases (MongoDB + Elasticsearch + Weaviate) into a single MongoDB Atlas cluster with Atlas Search and Vector Search.
Marcrest Scraping Task
Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.