Data Model Consolidation: MongoDB as Single Database
Detailed data model for consolidating 3 databases (MongoDB + Elasticsearch + Weaviate) into a single MongoDB Atlas cluster with Atlas Search and Vector Search.
Current Architecture: 3 Databases
graph TD
subgraph Services
Agents["CROP-Agents"]
RAG["CROP-RAG-Embedding"]
Search["CROP-RAG-Search"]
SEO["CROP-AI-SEO"]
PDF["CROP-pdf-parser"]
Delivery["delivery"]
end
subgraph MongoDB["MongoDB Atlas"]
parts_m["parts"]
fitment["equipment_fitment"]
convos["conversations"]
vapi["vapi_calls"]
pl_docs["pl_documents"]
pl_pages["pl_document_pages"]
shipments["shipments"]
rates["rate_queries"]
tracking["tracking"]
end
subgraph ES["Elasticsearch"]
parts_es["parts_current<br/><small>mirrors MongoDB parts</small>"]
manual_es["manual_parts"]
end
subgraph Weaviate
doc_chunk["DocumentChunk<br/><small>1024d BGE</small>"]
part_doc["PartDocument"]
part_img["PartImage<br/><small>CLIP</small>"]
prod_img["ProductImages<br/><small>512d</small>"]
prod_emb["ProductEmbedding<br/><small>512d</small>"]
conv_turns["ConversationTurns<br/><small>1024d</small>"]
end
Agents --> parts_m & parts_es & doc_chunk & prod_emb & conv_turns & fitment & convos & vapi
RAG --> parts_m & manual_es & doc_chunk
Search --> doc_chunk
SEO --> parts_m & prod_img
PDF --> parts_m & part_doc & part_img & manual_es & pl_docs & pl_pages
Delivery --> shipments & rates & trackingMongoDB Atlas (crop_prod / crop_stage)
| Collection | Used By | Purpose |
|---|---|---|
parts | Agents, AI-SEO, RAG-Embedding, pdf-parser | Product catalog (partNumber, title, pricing, inventory, manufacturer, media, ai_seo, equipmentFitment) |
equipment_fitment | Agents | Parts-to-equipment compatibility mapping |
conversations | Agents | Chat bot conversation history |
vapi_calls | Agents | Voice API call logs |
pl_documents | parse-pdf-api | PDF document metadata (documentId, provider, modelNumber, processingStatus) |
pl_document_pages | parse-pdf-api | PDF page-level data (documentId, pageNumber, pageType, textContent) |
shipments | delivery | Shipment records (tracking, addresses, packages, cost) |
rate_queries | delivery | Delivery rate calculation queries |
tracking | delivery | Package tracking events |
Elasticsearch (34.138.41.64:9200)
| Index | Used By | Purpose |
|---|---|---|
parts_current | Agents | Full-text search (title^3, partNumber^4, sku^2, description^2, manufacturer^2). Fuzzy matching + brand/stock filters. Mirrors MongoDB parts. |
manual_parts | Agents, RAG-Embedding | Part numbers from PDFs (part_number, provider, pdf_name, page_number, context) |
Weaviate (34.59.145.247:8080)
| Collection | Used By | Dims | Purpose |
|---|---|---|---|
DocumentChunk | RAG-Embedding, Agents, RAG-Search | 1024 | RAG chunks from PDF manuals. Hybrid search (BM25 + vector). |
PartDocument | pdf-parser | varies | Parsed PDF content |
PartImage | RAG-Embedding, pdf-parser | CLIP | Images from PDFs |
ProductImages | AI-SEO, Agents | 512 | Product image embeddings for visual similarity |
ProductEmbedding | Agents | 512 | Semantic product search |
ConversationTurns | Agents | 1024 | Semantic retrieval of conversation history |
Vector config: HNSW index, cosine distance, ef=100, efConstruction=128, maxConnections=64 Embeddings: BGE-large-en-v1.5 (1024d), CLIP-ViT (512d)
Target Architecture: Single MongoDB with Atlas Search + Vector Search
graph TD
subgraph Services
Agents["CROP-Agents"]
RAG["CROP-RAG-Embedding"]
Search["CROP-RAG-Search"]
SEO["CROP-AI-SEO"]
PDF["CROP-pdf-parser"]
Delivery["delivery"]
end
subgraph MongoDB["MongoDB Atlas — single cluster"]
parts["parts<br/><small>+ Atlas Search + Vector Search (512d)</small>"]
chunks["document_chunks<br/><small>+ Vector Search (1024d) + Atlas Search</small>"]
manual["manual_parts<br/><small>+ Atlas Search</small>"]
images["part_images<br/><small>+ Vector Search (CLIP)</small>"]
convos["conversations<br/><small>+ Vector Search (1024d)</small>"]
fitment["equipment_fitment"]
pl_docs["pl_documents"]
pl_pages["pl_document_pages"]
shipments["shipments"]
rates["rate_queries"]
tracking["tracking"]
vapi["vapi_calls"]
end
Agents --> parts & chunks & manual & convos & fitment & vapi
RAG --> parts & chunks & manual
Search --> chunks
SEO --> parts
PDF --> parts & chunks & images & manual & pl_docs & pl_pages
Delivery --> shipments & rates & trackingCollection Schemas
parts (existing, extended)
Adds embedding_512 and image_embeddings fields to the existing collection.
{
"_id": "ObjectId",
"partNumber": "CT-12345 (unique index)",
"sku": "CT12345",
"title": "Hydraulic Pump Assembly",
"slug": "hydraulic-pump-assembly-ct-12345",
"description": { "text": "High-performance hydraulic pump..." },
"pricing": {
"msrp": 249.99,
"dis": { "price": 199.99 }
},
"inventory": { "inStock": true, "availability": "In Stock" },
"manufacturer": { "name": "Parker Hannifin" },
"category": { "main": "Hydraulics", "sub": "Pumps" },
"media": {
"primaryImage": "https://media.crop.com/...",
"images": [{ "url": "https://media.crop.com/..." }]
},
"equipmentFitment": ["John Deere 310SL", "CAT 420F2"],
"youtube": { "videos": [] },
"ai_seo": {
"short_description": "...",
"product_description": "...",
"seo_keywords": "hydraulic pump, parker hannifin, 310SL"
},
"embedding_512": "[512-dim float array — replaces Weaviate ProductEmbedding]",
"image_embeddings": [
{
"image_index": 0,
"embedding_512": "[512-dim CLIP float array — replaces Weaviate ProductImages]"
}
]
}Indexes:
| Type | Fields | Replaces |
|---|---|---|
| Atlas Search | title, partNumber, sku, description.text, manufacturer.name, category., ai_seo. | ES parts_current |
| Atlas Vector Search | embedding_512 (cosine, 512d) | Weaviate ProductEmbedding |
| Atlas Vector Search | image_embeddings.embedding_512 (cosine, 512d) | Weaviate ProductImages |
document_chunks (replaces Weaviate DocumentChunk + PartDocument)
{
"_id": "ObjectId",
"document_id": "doc_abc123",
"chunk_index": 3,
"text": "The hydraulic pump assembly (CT-12345) requires...",
"provider_name": "Parker Hannifin",
"pdf_name": "PH-300-manual.pdf",
"page_number": 42,
"page_type": "parts_list",
"chunk_type": "table_row",
"part_numbers": ["CT-12345", "CT-12346"],
"media_urls": ["gs://crop-pdfs/PH-300/page42-fig1.png"],
"gcs_path": "gs://crop-pdfs/PH-300-manual.pdf",
"pdf_hash": "sha256:abc123...",
"embedding_1024": "[1024-dim BGE float array]",
"ingestion_timestamp": "2025-01-15T10:30:00Z"
}Indexes:
| Type | Fields | Replaces |
|---|---|---|
| Atlas Vector Search | embedding_1024 (cosine, 1024d) | Weaviate DocumentChunk vector |
| Atlas Search | text (full-text) | Weaviate DocumentChunk BM25 |
| Compound unique | { document_id: 1, chunk_index: 1 } | — |
| Filter | { provider_name: 1 } | — |
manual_parts (replaces ES manual_parts index)
{
"_id": "ObjectId",
"part_number": "ct-12345",
"provider": "Parker Hannifin",
"pdf_name": "PH-300-manual.pdf",
"document_id": "doc_abc123",
"page_number": 42,
"context": "Item 7 — Hydraulic Pump Assembly, replaces CT-12344",
"gcs_path": "gs://crop-pdfs/PH-300-manual.pdf",
"page_type": "parts_list"
}Indexes:
| Type | Fields | Replaces |
|---|---|---|
| Standard | { part_number: 1 } | ES lowercase_ascii normalizer |
| Atlas Search | part_number, context | ES bool/should query |
part_images (replaces Weaviate PartImage)
{
"_id": "ObjectId",
"source_pdf": "PH-300-manual.pdf",
"page": 42,
"image_type": "exploded_diagram",
"bbox": { "x": 50.0, "y": 120.0, "w": 400.0, "h": 300.0 },
"width": 800,
"height": 600,
"format": "png",
"gcs_path": "gs://crop-pdfs/PH-300/page42-fig1.png",
"part_number": "CT-12345",
"item_number": "7",
"related_document_ids": ["ObjectId(...)"],
"embedding": "[CLIP float array]"
}Indexes:
| Type | Fields | Replaces |
|---|---|---|
| Atlas Vector Search | embedding (cosine) | Weaviate PartImage |
| Compound | { source_pdf: 1, page: 1 } | — |
conversations (existing, extended)
Adds embedding_1024 to each turn, replacing the separate Weaviate ConversationTurns collection.
{
"_id": "ObjectId",
"session_id": "sess_xyz789",
"turns": [
{
"role": "user",
"content": "What hydraulic pump fits a John Deere 310SL?",
"agent_id": "parts-agent",
"timestamp": "2025-01-15T10:30:00Z",
"embedding_1024": "[1024-dim float array]"
},
{
"role": "assistant",
"content": "The CT-12345 Hydraulic Pump Assembly is compatible...",
"agent_id": "parts-agent",
"timestamp": "2025-01-15T10:30:05Z",
"embedding_1024": "[1024-dim float array]"
}
]
}Unchanged Collections
These collections are already MongoDB-only and require no changes:
pl_documents— PDF document metadatapl_document_pages— PDF page-level datashipments— shipment recordsrate_queries— delivery rate queriestracking— package trackingvapi_calls— voice API logsequipment_fitment— parts-to-equipment mapping
Migration Mapping
flowchart LR
subgraph "Decommission"
ES_parts["ES parts_current"]
ES_manual["ES manual_parts"]
W_DocChunk["Weaviate DocumentChunk"]
W_PartDoc["Weaviate PartDocument"]
W_PartImg["Weaviate PartImage"]
W_ProdEmb["Weaviate ProductEmbedding"]
W_ProdImg["Weaviate ProductImages"]
W_ConvTrn["Weaviate ConversationTurns"]
end
subgraph "MongoDB Atlas"
M_parts["parts<br/>+ Atlas Search<br/>+ embedding_512<br/>+ image_embeddings"]
M_chunks["document_chunks<br/>+ embedding_1024"]
M_manual["manual_parts<br/>+ Atlas Search"]
M_images["part_images<br/>+ CLIP embedding"]
M_convos["conversations<br/>+ turn embeddings"]
end
ES_parts -->|"Atlas Search replaces<br/>full-text"| M_parts
W_ProdEmb -->|"embedding_512<br/>field"| M_parts
W_ProdImg -->|"image_embeddings<br/>array"| M_parts
W_DocChunk -->|"unified schema"| M_chunks
W_PartDoc -->|"merged"| M_chunks
ES_manual -->|"Atlas Search<br/>replaces ES"| M_manual
W_PartImg -->|"CLIP vectors"| M_images
W_ConvTrn -->|"embedded in<br/>turns array"| M_convos| Source | Target | Strategy |
|---|---|---|
ES parts_current | Atlas Search on parts | Full-text search with fuzzy, boost weights, brand/stock filters |
ES manual_parts | MongoDB manual_parts + Atlas Search | Keyword + full-text on part_number and context |
Weaviate DocumentChunk | document_chunks + Vector Search | $vectorSearch (cosine, 1024d) + Atlas Search for hybrid |
Weaviate PartDocument | Merged into document_chunks | Same data, unified schema |
Weaviate PartImage | part_images + Vector Search | CLIP vectors stored in Mongo |
Weaviate ProductEmbedding | parts.embedding_512 | Vector Search on parts collection |
Weaviate ProductImages | parts.image_embeddings | Vector Search on nested field |
Weaviate ConversationTurns | conversations.turns | Vector Search on nested field |
Benefits
- 1 database instead of 3 — less infrastructure, fewer failure points
- Atomic operations — vectors + metadata in one document, no cross-db sync
- Atlas Search replaces ES — fuzzy matching, boost weights, aggregations, all built-in
- Atlas Vector Search replaces Weaviate — cosine similarity, HNSW index, same performance at current scale
- Single connection string for all services
- No data duplication —
parts_currentES index was a mirror of MongoDBpartsanyway
Infrastructure to Decommission
| Resource | Address | Action |
|---|---|---|
| Elasticsearch VM | 34.138.41.64:9200 | Shut down after migration validation |
| Weaviate VM (primary) | 34.59.145.247:8080 | Shut down after migration validation |
| Weaviate VM (secondary) | 34.139.6.131:8080 | Shut down after migration validation |
Services That Need Code Changes
| Service | Current DB Clients | Changes Needed |
|---|---|---|
| CROP-Agents | pymongo + elasticsearch + weaviate | Replace ES/Weaviate calls with Atlas Search/Vector Search |
| CROP-RAG-Embedding | motor + weaviate + elasticsearch | Replace Weaviate ingestion + ES manual_parts indexer |
| CROP-RAG-Search | weaviate | Full rewrite to MongoDB Vector Search |
| CROP-AI-SEO | motor + weaviate | Replace Weaviate ProductImages with MongoDB |
| CROP-pdf-parser | pymongo + weaviate + elasticsearch | Replace Weaviate/ES clients with MongoDB |
| parse-pdf-api | motor | No changes (already MongoDB only) |
| delivery | pymongo | No changes (already MongoDB only) |
Architecture Evolution Plan: MongoDB-First with Vector DB Adapter
3-stage evolution from current ES/Weaviate to consolidated MongoDB Atlas, with optional Pinecone migration path via adapter pattern.
Search Index Improvement
Analysis of search index data integrity issues and improvement plan for Elasticsearch sync.