MongoDB Integration for Training Data
This document describes how MongoDB is integrated to enrich training data with additional part information.
MongoDB Integration for Training Data
This document describes how MongoDB is integrated to enrich training data with additional part information.
Overview
MongoDB integration allows the training data preparation process to:
- Fetch additional part information from MongoDB database
- Enrich RAG documents with database data
- Create more comprehensive training examples
Configuration
Environment Variables
Set these environment variables to configure MongoDB connection:
# MongoDB connection string
export MONGODB_CONNECTION_STRING="mongodb://localhost:27017/"
# Or for MongoDB Atlas:
# export MONGODB_CONNECTION_STRING="mongodb+srv://user:password@cluster.mongodb.net/"
# Database and collection names
export MONGODB_DATABASE="parts_db"
export MONGODB_COLLECTION="parts"
# Enable/disable MongoDB enrichment
export USE_MONGODB="true"MongoDB Document Structure
The connector looks for part numbers in various field names:
part_numberpartNumberpart_nopartNonumberskucode
Example MongoDB document:
{
"_id": "...",
"part_number": "10",
"name": "WASHER SPRING 10MM H.D Z/P",
"description": "High-duty washer spring",
"category": "Hardware",
"manufacturer": "McHale",
"price": 2.50,
"stock": 150,
"specifications": {
"size": "10mm",
"material": "Zinc-plated"
}
}Usage
In Training Data Preparation
MongoDB data is automatically included when running:
python prepare_training_data.pyThe script will:
- Connect to MongoDB (if configured)
- For each part number in RAG documents, fetch MongoDB data
- Enrich training examples with database information
Example Training Example with MongoDB
Without MongoDB:
Context from RAG:
Part: 10
Part Number: 10
Description: WASHER SPRING 10MM H.D Z/P
Page: 6
User query: Find part number 10With MongoDB:
Context from RAG:
Part: 10
Part Number: 10
Description: WASHER SPRING 10MM H.D Z/P
Page: 6
Additional Information from Database:
Name: WASHER SPRING 10MM H.D Z/P
Category: Hardware
Manufacturer: McHale
Price: 2.50
Stock: 150
Specifications: {"size": "10mm", "material": "Zinc-plated"}
User query: Find part number 10Cloud Build Integration
When running via Cloud Build, set substitutions:
gcloud builds submit --config cloudbuild_training.yaml . \
--substitutions=_MONGODB_CONNECTION_STRING="mongodb+srv://...",_MONGODB_DATABASE="parts_db",_MONGODB_COLLECTION="parts",_USE_MONGODB="true"MongoDB Connector Features
Automatic Field Detection
The connector automatically tries multiple field names to find part numbers:
- Supports different naming conventions
- Case-insensitive matching
- Flexible schema support
Batch Queries
For efficiency, the connector supports batch queries:
from mongodb_connector import get_mongodb_connector
connector = get_mongodb_connector()
parts = connector.get_parts_batch(["10", "20", "30"])Error Handling
- Graceful fallback if MongoDB is unavailable
- Training data generation continues without MongoDB
- Connection errors are logged but don't stop the process
Benefits
- Richer Training Data: More information per example
- Better Model Understanding: Model learns to use database information
- Real-world Context: Training examples closer to production scenarios
- Flexible: Works with or without MongoDB
Troubleshooting
Connection Issues
# Test MongoDB connection
python -c "from mongodb_connector import get_mongodb_connector; get_mongodb_connector()"Field Name Mismatch
If part numbers are stored in a different field, update mongodb_connector.py:
query = {
"$or": [
{"your_field_name": part_number},
# ... other fields
]
}Performance
For large datasets, consider:
- Indexing part number fields in MongoDB
- Using batch queries instead of individual lookups
- Caching MongoDB results
Security
Connection Strings
- Use environment variables, never hardcode
- For MongoDB Atlas, use connection strings with authentication
- Consider using GCP Secret Manager for production
Example with Secret Manager
# Get connection string from Secret Manager
MONGODB_CONNECTION_STRING=$(gcloud secrets versions access latest --secret="mongodb-connection-string")
export MONGODB_CONNECTION_STRINGDisabling MongoDB
To disable MongoDB enrichment:
export USE_MONGODB="false"
python prepare_training_data.pyOr in Cloud Build:
substitutions:
_USE_MONGODB: 'false'Training data will be generated using only RAG documents.