CROP

Data Model Consolidation: MongoDB as Single Database

Detailed data model for consolidating 3 databases (MongoDB + Elasticsearch + Weaviate) into a single MongoDB Atlas cluster with Atlas Search and Vector Search.

Current Architecture: 3 Databases

graph TD
    subgraph Services
        Agents["CROP-Agents"]
        RAG["CROP-RAG-Embedding"]
        Search["CROP-RAG-Search"]
        SEO["CROP-AI-SEO"]
        PDF["CROP-pdf-parser"]
        Delivery["delivery"]
    end

    subgraph MongoDB["MongoDB Atlas"]
        parts_m["parts"]
        fitment["equipment_fitment"]
        convos["conversations"]
        vapi["vapi_calls"]
        pl_docs["pl_documents"]
        pl_pages["pl_document_pages"]
        shipments["shipments"]
        rates["rate_queries"]
        tracking["tracking"]
    end

    subgraph ES["Elasticsearch"]
        parts_es["parts_current<br/><small>mirrors MongoDB parts</small>"]
        manual_es["manual_parts"]
    end

    subgraph Weaviate
        doc_chunk["DocumentChunk<br/><small>1024d BGE</small>"]
        part_doc["PartDocument"]
        part_img["PartImage<br/><small>CLIP</small>"]
        prod_img["ProductImages<br/><small>512d</small>"]
        prod_emb["ProductEmbedding<br/><small>512d</small>"]
        conv_turns["ConversationTurns<br/><small>1024d</small>"]
    end

    Agents --> parts_m & parts_es & doc_chunk & prod_emb & conv_turns & fitment & convos & vapi
    RAG --> parts_m & manual_es & doc_chunk
    Search --> doc_chunk
    SEO --> parts_m & prod_img
    PDF --> parts_m & part_doc & part_img & manual_es & pl_docs & pl_pages
    Delivery --> shipments & rates & tracking

MongoDB Atlas (crop_prod / crop_stage)

CollectionUsed ByPurpose
partsAgents, AI-SEO, RAG-Embedding, pdf-parserProduct catalog (partNumber, title, pricing, inventory, manufacturer, media, ai_seo, equipmentFitment)
equipment_fitmentAgentsParts-to-equipment compatibility mapping
conversationsAgentsChat bot conversation history
vapi_callsAgentsVoice API call logs
pl_documentsparse-pdf-apiPDF document metadata (documentId, provider, modelNumber, processingStatus)
pl_document_pagesparse-pdf-apiPDF page-level data (documentId, pageNumber, pageType, textContent)
shipmentsdeliveryShipment records (tracking, addresses, packages, cost)
rate_queriesdeliveryDelivery rate calculation queries
trackingdeliveryPackage tracking events

Elasticsearch (34.138.41.64:9200)

IndexUsed ByPurpose
parts_currentAgentsFull-text search (title^3, partNumber^4, sku^2, description^2, manufacturer^2). Fuzzy matching + brand/stock filters. Mirrors MongoDB parts.
manual_partsAgents, RAG-EmbeddingPart numbers from PDFs (part_number, provider, pdf_name, page_number, context)

Weaviate (34.59.145.247:8080)

CollectionUsed ByDimsPurpose
DocumentChunkRAG-Embedding, Agents, RAG-Search1024RAG chunks from PDF manuals. Hybrid search (BM25 + vector).
PartDocumentpdf-parservariesParsed PDF content
PartImageRAG-Embedding, pdf-parserCLIPImages from PDFs
ProductImagesAI-SEO, Agents512Product image embeddings for visual similarity
ProductEmbeddingAgents512Semantic product search
ConversationTurnsAgents1024Semantic retrieval of conversation history

Vector config: HNSW index, cosine distance, ef=100, efConstruction=128, maxConnections=64 Embeddings: BGE-large-en-v1.5 (1024d), CLIP-ViT (512d)


graph TD
    subgraph Services
        Agents["CROP-Agents"]
        RAG["CROP-RAG-Embedding"]
        Search["CROP-RAG-Search"]
        SEO["CROP-AI-SEO"]
        PDF["CROP-pdf-parser"]
        Delivery["delivery"]
    end

    subgraph MongoDB["MongoDB Atlas — single cluster"]
        parts["parts<br/><small>+ Atlas Search + Vector Search (512d)</small>"]
        chunks["document_chunks<br/><small>+ Vector Search (1024d) + Atlas Search</small>"]
        manual["manual_parts<br/><small>+ Atlas Search</small>"]
        images["part_images<br/><small>+ Vector Search (CLIP)</small>"]
        convos["conversations<br/><small>+ Vector Search (1024d)</small>"]
        fitment["equipment_fitment"]
        pl_docs["pl_documents"]
        pl_pages["pl_document_pages"]
        shipments["shipments"]
        rates["rate_queries"]
        tracking["tracking"]
        vapi["vapi_calls"]
    end

    Agents --> parts & chunks & manual & convos & fitment & vapi
    RAG --> parts & chunks & manual
    Search --> chunks
    SEO --> parts
    PDF --> parts & chunks & images & manual & pl_docs & pl_pages
    Delivery --> shipments & rates & tracking

Collection Schemas

parts (existing, extended)

Adds embedding_512 and image_embeddings fields to the existing collection.

{
  "_id": "ObjectId",
  "partNumber": "CT-12345 (unique index)",
  "sku": "CT12345",
  "title": "Hydraulic Pump Assembly",
  "slug": "hydraulic-pump-assembly-ct-12345",
  "description": { "text": "High-performance hydraulic pump..." },
  "pricing": {
    "msrp": 249.99,
    "dis": { "price": 199.99 }
  },
  "inventory": { "inStock": true, "availability": "In Stock" },
  "manufacturer": { "name": "Parker Hannifin" },
  "category": { "main": "Hydraulics", "sub": "Pumps" },
  "media": {
    "primaryImage": "https://media.crop.com/...",
    "images": [{ "url": "https://media.crop.com/..." }]
  },
  "equipmentFitment": ["John Deere 310SL", "CAT 420F2"],
  "youtube": { "videos": [] },
  "ai_seo": {
    "short_description": "...",
    "product_description": "...",
    "seo_keywords": "hydraulic pump, parker hannifin, 310SL"
  },

  "embedding_512":    "[512-dim float array — replaces Weaviate ProductEmbedding]",
  "image_embeddings": [
    {
      "image_index": 0,
      "embedding_512": "[512-dim CLIP float array — replaces Weaviate ProductImages]"
    }
  ]
}

Indexes:

TypeFieldsReplaces
Atlas Searchtitle, partNumber, sku, description.text, manufacturer.name, category., ai_seo.ES parts_current
Atlas Vector Searchembedding_512 (cosine, 512d)Weaviate ProductEmbedding
Atlas Vector Searchimage_embeddings.embedding_512 (cosine, 512d)Weaviate ProductImages

document_chunks (replaces Weaviate DocumentChunk + PartDocument)

{
  "_id": "ObjectId",
  "document_id": "doc_abc123",
  "chunk_index": 3,
  "text": "The hydraulic pump assembly (CT-12345) requires...",
  "provider_name": "Parker Hannifin",
  "pdf_name": "PH-300-manual.pdf",
  "page_number": 42,
  "page_type": "parts_list",
  "chunk_type": "table_row",
  "part_numbers": ["CT-12345", "CT-12346"],
  "media_urls": ["gs://crop-pdfs/PH-300/page42-fig1.png"],
  "gcs_path": "gs://crop-pdfs/PH-300-manual.pdf",
  "pdf_hash": "sha256:abc123...",
  "embedding_1024": "[1024-dim BGE float array]",
  "ingestion_timestamp": "2025-01-15T10:30:00Z"
}

Indexes:

TypeFieldsReplaces
Atlas Vector Searchembedding_1024 (cosine, 1024d)Weaviate DocumentChunk vector
Atlas Searchtext (full-text)Weaviate DocumentChunk BM25
Compound unique{ document_id: 1, chunk_index: 1 }
Filter{ provider_name: 1 }

manual_parts (replaces ES manual_parts index)

{
  "_id": "ObjectId",
  "part_number": "ct-12345",
  "provider": "Parker Hannifin",
  "pdf_name": "PH-300-manual.pdf",
  "document_id": "doc_abc123",
  "page_number": 42,
  "context": "Item 7 — Hydraulic Pump Assembly, replaces CT-12344",
  "gcs_path": "gs://crop-pdfs/PH-300-manual.pdf",
  "page_type": "parts_list"
}

Indexes:

TypeFieldsReplaces
Standard{ part_number: 1 }ES lowercase_ascii normalizer
Atlas Searchpart_number, contextES bool/should query

part_images (replaces Weaviate PartImage)

{
  "_id": "ObjectId",
  "source_pdf": "PH-300-manual.pdf",
  "page": 42,
  "image_type": "exploded_diagram",
  "bbox": { "x": 50.0, "y": 120.0, "w": 400.0, "h": 300.0 },
  "width": 800,
  "height": 600,
  "format": "png",
  "gcs_path": "gs://crop-pdfs/PH-300/page42-fig1.png",
  "part_number": "CT-12345",
  "item_number": "7",
  "related_document_ids": ["ObjectId(...)"],
  "embedding": "[CLIP float array]"
}

Indexes:

TypeFieldsReplaces
Atlas Vector Searchembedding (cosine)Weaviate PartImage
Compound{ source_pdf: 1, page: 1 }

conversations (existing, extended)

Adds embedding_1024 to each turn, replacing the separate Weaviate ConversationTurns collection.

{
  "_id": "ObjectId",
  "session_id": "sess_xyz789",
  "turns": [
    {
      "role": "user",
      "content": "What hydraulic pump fits a John Deere 310SL?",
      "agent_id": "parts-agent",
      "timestamp": "2025-01-15T10:30:00Z",
      "embedding_1024": "[1024-dim float array]"
    },
    {
      "role": "assistant",
      "content": "The CT-12345 Hydraulic Pump Assembly is compatible...",
      "agent_id": "parts-agent",
      "timestamp": "2025-01-15T10:30:05Z",
      "embedding_1024": "[1024-dim float array]"
    }
  ]
}

Unchanged Collections

These collections are already MongoDB-only and require no changes:

  • pl_documents — PDF document metadata
  • pl_document_pages — PDF page-level data
  • shipments — shipment records
  • rate_queries — delivery rate queries
  • tracking — package tracking
  • vapi_calls — voice API logs
  • equipment_fitment — parts-to-equipment mapping

Migration Mapping

flowchart LR
    subgraph "Decommission"
        ES_parts["ES parts_current"]
        ES_manual["ES manual_parts"]
        W_DocChunk["Weaviate DocumentChunk"]
        W_PartDoc["Weaviate PartDocument"]
        W_PartImg["Weaviate PartImage"]
        W_ProdEmb["Weaviate ProductEmbedding"]
        W_ProdImg["Weaviate ProductImages"]
        W_ConvTrn["Weaviate ConversationTurns"]
    end

    subgraph "MongoDB Atlas"
        M_parts["parts<br/>+ Atlas Search<br/>+ embedding_512<br/>+ image_embeddings"]
        M_chunks["document_chunks<br/>+ embedding_1024"]
        M_manual["manual_parts<br/>+ Atlas Search"]
        M_images["part_images<br/>+ CLIP embedding"]
        M_convos["conversations<br/>+ turn embeddings"]
    end

    ES_parts -->|"Atlas Search replaces<br/>full-text"| M_parts
    W_ProdEmb -->|"embedding_512<br/>field"| M_parts
    W_ProdImg -->|"image_embeddings<br/>array"| M_parts
    W_DocChunk -->|"unified schema"| M_chunks
    W_PartDoc -->|"merged"| M_chunks
    ES_manual -->|"Atlas Search<br/>replaces ES"| M_manual
    W_PartImg -->|"CLIP vectors"| M_images
    W_ConvTrn -->|"embedded in<br/>turns array"| M_convos
SourceTargetStrategy
ES parts_currentAtlas Search on partsFull-text search with fuzzy, boost weights, brand/stock filters
ES manual_partsMongoDB manual_parts + Atlas SearchKeyword + full-text on part_number and context
Weaviate DocumentChunkdocument_chunks + Vector Search$vectorSearch (cosine, 1024d) + Atlas Search for hybrid
Weaviate PartDocumentMerged into document_chunksSame data, unified schema
Weaviate PartImagepart_images + Vector SearchCLIP vectors stored in Mongo
Weaviate ProductEmbeddingparts.embedding_512Vector Search on parts collection
Weaviate ProductImagesparts.image_embeddingsVector Search on nested field
Weaviate ConversationTurnsconversations.turnsVector Search on nested field

Benefits

  • 1 database instead of 3 — less infrastructure, fewer failure points
  • Atomic operations — vectors + metadata in one document, no cross-db sync
  • Atlas Search replaces ES — fuzzy matching, boost weights, aggregations, all built-in
  • Atlas Vector Search replaces Weaviate — cosine similarity, HNSW index, same performance at current scale
  • Single connection string for all services
  • No data duplicationparts_current ES index was a mirror of MongoDB parts anyway

Infrastructure to Decommission

ResourceAddressAction
Elasticsearch VM34.138.41.64:9200Shut down after migration validation
Weaviate VM (primary)34.59.145.247:8080Shut down after migration validation
Weaviate VM (secondary)34.139.6.131:8080Shut down after migration validation

Services That Need Code Changes

ServiceCurrent DB ClientsChanges Needed
CROP-Agentspymongo + elasticsearch + weaviateReplace ES/Weaviate calls with Atlas Search/Vector Search
CROP-RAG-Embeddingmotor + weaviate + elasticsearchReplace Weaviate ingestion + ES manual_parts indexer
CROP-RAG-SearchweaviateFull rewrite to MongoDB Vector Search
CROP-AI-SEOmotor + weaviateReplace Weaviate ProductImages with MongoDB
CROP-pdf-parserpymongo + weaviate + elasticsearchReplace Weaviate/ES clients with MongoDB
parse-pdf-apimotorNo changes (already MongoDB only)
deliverypymongoNo changes (already MongoDB only)

On this page