CROP
ProjectsPDF Parser

MongoDB Integration for Training Data

This document describes how MongoDB is integrated to enrich training data with additional part information.

MongoDB Integration for Training Data

This document describes how MongoDB is integrated to enrich training data with additional part information.

Overview

MongoDB integration allows the training data preparation process to:

  1. Fetch additional part information from MongoDB database
  2. Enrich RAG documents with database data
  3. Create more comprehensive training examples

Configuration

Environment Variables

Set these environment variables to configure MongoDB connection:

# MongoDB connection string
export MONGODB_CONNECTION_STRING="mongodb://localhost:27017/"
# Or for MongoDB Atlas:
# export MONGODB_CONNECTION_STRING="mongodb+srv://user:password@cluster.mongodb.net/"

# Database and collection names
export MONGODB_DATABASE="parts_db"
export MONGODB_COLLECTION="parts"

# Enable/disable MongoDB enrichment
export USE_MONGODB="true"

MongoDB Document Structure

The connector looks for part numbers in various field names:

  • part_number
  • partNumber
  • part_no
  • partNo
  • number
  • sku
  • code

Example MongoDB document:

{
  "_id": "...",
  "part_number": "10",
  "name": "WASHER SPRING 10MM H.D Z/P",
  "description": "High-duty washer spring",
  "category": "Hardware",
  "manufacturer": "McHale",
  "price": 2.50,
  "stock": 150,
  "specifications": {
    "size": "10mm",
    "material": "Zinc-plated"
  }
}

Usage

In Training Data Preparation

MongoDB data is automatically included when running:

python prepare_training_data.py

The script will:

  1. Connect to MongoDB (if configured)
  2. For each part number in RAG documents, fetch MongoDB data
  3. Enrich training examples with database information

Example Training Example with MongoDB

Without MongoDB:

Context from RAG:
Part: 10
Part Number: 10
Description: WASHER SPRING 10MM H.D Z/P
Page: 6

User query: Find part number 10

With MongoDB:

Context from RAG:
Part: 10
Part Number: 10
Description: WASHER SPRING 10MM H.D Z/P
Page: 6

Additional Information from Database:
Name: WASHER SPRING 10MM H.D Z/P
Category: Hardware
Manufacturer: McHale
Price: 2.50
Stock: 150
Specifications: {"size": "10mm", "material": "Zinc-plated"}

User query: Find part number 10

Cloud Build Integration

When running via Cloud Build, set substitutions:

gcloud builds submit --config cloudbuild_training.yaml . \
  --substitutions=_MONGODB_CONNECTION_STRING="mongodb+srv://...",_MONGODB_DATABASE="parts_db",_MONGODB_COLLECTION="parts",_USE_MONGODB="true"

MongoDB Connector Features

Automatic Field Detection

The connector automatically tries multiple field names to find part numbers:

  • Supports different naming conventions
  • Case-insensitive matching
  • Flexible schema support

Batch Queries

For efficiency, the connector supports batch queries:

from mongodb_connector import get_mongodb_connector

connector = get_mongodb_connector()
parts = connector.get_parts_batch(["10", "20", "30"])

Error Handling

  • Graceful fallback if MongoDB is unavailable
  • Training data generation continues without MongoDB
  • Connection errors are logged but don't stop the process

Benefits

  1. Richer Training Data: More information per example
  2. Better Model Understanding: Model learns to use database information
  3. Real-world Context: Training examples closer to production scenarios
  4. Flexible: Works with or without MongoDB

Troubleshooting

Connection Issues

# Test MongoDB connection
python -c "from mongodb_connector import get_mongodb_connector; get_mongodb_connector()"

Field Name Mismatch

If part numbers are stored in a different field, update mongodb_connector.py:

query = {
    "$or": [
        {"your_field_name": part_number},
        # ... other fields
    ]
}

Performance

For large datasets, consider:

  • Indexing part number fields in MongoDB
  • Using batch queries instead of individual lookups
  • Caching MongoDB results

Security

Connection Strings

  • Use environment variables, never hardcode
  • For MongoDB Atlas, use connection strings with authentication
  • Consider using GCP Secret Manager for production

Example with Secret Manager

# Get connection string from Secret Manager
MONGODB_CONNECTION_STRING=$(gcloud secrets versions access latest --secret="mongodb-connection-string")
export MONGODB_CONNECTION_STRING

Disabling MongoDB

To disable MongoDB enrichment:

export USE_MONGODB="false"
python prepare_training_data.py

Or in Cloud Build:

substitutions:
  _USE_MONGODB: 'false'

Training data will be generated using only RAG documents.

On this page