Search Index Analysis & Improvement Plan
Date: 2026-01-21 Status: Critical issues identified, action required
Search Index Analysis & Improvement Plan
Date: 2026-01-21 Status: Critical issues identified, action required
Executive Summary
The current search index has critical data integrity issues:
- Only 2,000 of 7,091 K&M tires are indexed (72% missing)
- Index contains stale data from old syncs
- Sync workflow is incomplete - missing collections
- No separation between parts and tires indices
1. Current State Analysis
1.1 MongoDB Data (crop_dev)
| Collection | Documents | Type | In Sync Config? |
|---|---|---|---|
| parts_nhl | 1 | Parts | Yes |
| parts_bns | 1,307 | Parts | Yes |
| parts_vnt | 211 | Parts | Yes |
| parts_mch | 63 | Parts | Yes |
| parts_kuh | 495 | Parts | NO |
| parts_hot | 190 | Parts | NO |
| parts_har | 147 | Parts | NO |
| parts_kin | 101 | Parts | NO |
| parts_mar | 66 | Parts | NO |
| Parts Subtotal | 2,581 | ||
| parts_kmt | 7,091 | Tires | Yes |
| Grand Total | 9,672 |
1.2 Elasticsearch Index (parts_current → parts_v2026_01_20)
| Manufacturer | ES Docs | MongoDB | Delta | Issue |
|---|---|---|---|---|
| KMT | 2,000 | 7,091 | -5,091 | SYNC INCOMPLETE |
| BNS | 1,307 | 1,307 | 0 | OK |
| KUH | 447 | 495 | -48 | Stale (not in sync) |
| VNT | 211 | 211 | 0 | OK |
| HOT | 190 | 190 | 0 | Stale (not in sync) |
| HAR | 123 | 147 | -24 | Stale (not in sync) |
| KIN | 101 | 101 | 0 | Stale (not in sync) |
| MAR | 66 | 66 | 0 | Stale (not in sync) |
| MCH | 63 | 63 | 0 | OK |
| NHL | 0 | 1 | -1 | MISSING |
| Total | 4,508 | 9,672 | -5,164 |
1.3 Current Sync Configuration
File: .github/workflows/search-deploy.yml (line 389)
bash services/search/scripts/gcp-sync-data.sh "parts_nhl,parts_bns,parts_vnt,parts_mch,parts_kmt"Missing collections: parts_kuh,parts_hot,parts_har,parts_kin,parts_mar
2. Critical Issues Identified
Issue #1: K&M Tire Sync Incomplete (CRITICAL)
- Expected: 7,091 tires
- Actual: 2,000 tires
- Missing: 5,091 tires (72%)
- Root Cause: Unknown - likely Cloud Run job timeout or memory issue
- Impact: Website shows only 28% of available tires
Issue #2: Stale Data in Index
- Collections KUH, HOT, HAR, KIN, MAR exist in ES from OLD syncs
- These are NOT in current sync config, so data never updates
- Impact: Outdated product information
Issue #3: Missing Collections in Sync
- Workflow only syncs 5 collections
- Should sync all 10 parts collections
- Impact: Inconsistent data, some manufacturers missing
Issue #4: No Index Separation
- Parts and Tires share same index
- User wants:
parts_current→ ~2,500 parts onlytires_current→ 7,091 K&M tires
- Impact: Cannot manage catalogs independently
Issue #5: No Incremental Sync
- Full re-sync on every deployment
- No delta/change detection
- Impact: Slow deployments, potential data loss during sync
3. Constructive Criticism
What We Did Wrong
-
Incomplete testing of K&M sync
- Added K&M to sync without verifying full dataset syncs
- Should have tested with production data volume locally
-
No validation after deployment
- Didn't check document counts after sync completed
- Health endpoint showed 4,508 docs but we didn't verify this was correct
-
Mixed old and new data
- Index has data from multiple syncs with different collection sets
- Should have created fresh index or cleaned up old data
-
No monitoring/alerting
- No alerts for sync failures or document count mismatches
- Should have Datadog/Sentry alerts for data integrity
-
Documentation gap
- DATA_FLOWS.md created but didn't catch these issues
- Need runbooks for common sync problems
What Worked Well
- Transformer registration for KMT worked correctly
- Manufacturer aliases configured properly
- Website correctly displays data from ES
- GitHub Actions workflow structure is solid
4. Recommendations
Immediate Actions (P0 - Today)
-
Fix K&M sync - full 7,091 tires
# Increase Cloud Run job memory and timeout --memory=4Gi --task-timeout=3600s # 1 hour -
Add all collections to sync
# Update search-deploy.yml line 389 bash services/search/scripts/gcp-sync-data.sh \ "parts_nhl,parts_bns,parts_vnt,parts_mch,parts_kuh,parts_hot,parts_har,parts_kin,parts_mar,parts_kmt" -
Create fresh index
- Delete stale
parts_v2026_01_20 - Run full sync to new
parts_v2026_01_21 - Switch alias atomically
- Delete stale
Short-term Improvements (P1 - This Week)
-
Separate indices for parts vs tires
parts_current → parts_v{date} (2,581 docs) tires_current → tires_v{date} (7,091 docs) -
Add sync validation
- Compare MongoDB count vs ES count after sync
- Fail deployment if delta > 1%
-
Add monitoring
- Datadog metric:
search.index.doc_count - Alert if count drops by > 5%
- Datadog metric:
Medium-term Improvements (P2 - This Month)
-
Incremental sync
- Track
updatedAtin MongoDB - Only sync changed documents
- Reduces sync time from 30min to ~5min
- Track
-
Blue-green index deployment
- Create new index alongside old
- Validate new index before switching alias
- Keep old index for instant rollback
-
Separate sync jobs
sync-partsjob for DIS collectionssync-tiresjob for K&M collection- Can run independently
5. Implementation Plan
Phase 1: Emergency Fix (2 hours)
graph LR
A[Update sync config] --> B[Increase job resources]
B --> C[Trigger manual sync]
C --> D[Validate counts]
D --> E[Switch alias]Steps:
- Update
search-deploy.ymlwith all collections - Increase Cloud Run job memory to 4Gi
- Manually trigger workflow
- Verify ES doc count = 9,672
- Verify website shows correct data
Phase 2: Index Separation (1 day)
New architecture:
MongoDB (crop_dev)
├── parts_* (9 collections) → parts_current alias
└── parts_kmt (tires) → tires_current alias
Elasticsearch
├── parts_current → parts_v{date}
│ └── 2,581 documents (DIS parts only)
└── tires_current → tires_v{date}
└── 7,091 documents (K&M tires only)Required changes:
- Create
tires_currentalias and index - Modify sync script to support multiple target indices
- Update search service to query correct index based on category
- Update website routing
Phase 3: Monitoring & Validation (1 day)
- Add post-sync validation step in workflow
- Create Datadog dashboard for index health
- Add Slack alerts for sync failures
- Create runbook for common issues
6. Refactoring Needs
sync-mongodb-to-es.ts
Current issues:
- Single index target hardcoded
- No progress reporting to CI
- Silent failures possible
Proposed changes:
// Add index targeting
const INDEX_TARGETS = {
parts_kmt: 'tires_current',
default: 'parts_current'
};
// Add validation
async function validateSync(collection: string, expected: number) {
const actual = await es.count({ index: getIndex(collection) });
if (Math.abs(actual - expected) / expected > 0.01) {
throw new Error(`Sync validation failed: expected ${expected}, got ${actual}`);
}
}
// Add CI-friendly output
console.log(`::set-output name=docs_synced::${total}`);gcp-sync-data.sh
Current issues:
- Fixed memory allocation
- No retry on failure
- Log parsing fragile
Proposed changes:
# Dynamic resource allocation based on collection size
if [[ "$COLLECTION" == *"kmt"* ]]; then
MEMORY="4Gi"
TIMEOUT="3600s"
else
MEMORY="2Gi"
TIMEOUT="1800s"
fi
# Retry logic
MAX_RETRIES=3
for i in $(seq 1 $MAX_RETRIES); do
if gcloud run jobs execute ...; then
break
fi
echo "Retry $i of $MAX_RETRIES"
sleep 30
done7. Success Criteria
After implementing Phase 1:
- ES doc count = 9,672
- KMT docs = 7,091
- Parts docs = 2,581
- Website /parts/tires shows 7,091 tires
- Website /parts/catalog shows 2,581 parts
After implementing Phase 2:
- Separate indices: parts_current, tires_current
- Independent sync jobs
- Validation step passes
After implementing Phase 3:
- Monitoring dashboard live
- Alerts configured
- Runbook documented
8. Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Sync timeout again | Medium | High | Increase resources, add retry |
| Data loss during reindex | Low | Critical | Blue-green deployment |
| Website downtime | Low | High | Alias atomic switch |
| Memory exhaustion | Medium | Medium | Monitor and tune |
Appendix: Commands
Check current state
# ES document count
curl -s "https://search-service-atife5uvka-ue.a.run.app/health" | jq '.docCount_es'
# ES by manufacturer
curl -s "https://search-service-atife5uvka-ue.a.run.app/api/filters" | jq '.facets.manufacturer'
# MongoDB counts (via MCP)
# Use mcp__mongodb__count toolTrigger manual sync
gh workflow run search-deploy.yml
gh run watchCheck sync job logs
gcloud logging read \
"resource.type=cloud_run_job AND resource.labels.job_name=sync-data-auto" \
--limit=100 \
--project="noted-bliss-466410-q6"