Testing Data Preparation Pipeline
Test the pipeline on a few files without processing everything or keeping the service running constantly.
Testing Data Preparation Pipeline
Quick Test on GCP
Test the pipeline on a few files without processing everything or keeping the service running constantly.
Prerequisites
- Service deployed to GCP Cloud Run
- Required services deployed and running:
- PDF Parser Service (REQUIRED)
- Weaviate Service (REQUIRED - typically on GPU VM)
- CLIP Service (REQUIRED - typically on GPU VM)
- Environment variables configured in
.env.deploy:PROJECT_IDREGIONGCS_BUCKET_NAME=crop-documentsPDF_PARSER_API_URL- REQUIRED: URL to PDF Parser ServiceWEAVIATE_API_URL- REQUIRED: URL to Weaviate Service (e.g.,http://weaviate-service-vm-ip:8002)CLIP_API_URL- REQUIRED: URL to CLIP Service (e.g.,http://clip-service-vm-ip:8002)MLFLOW_TRACKING_URI
Test Pipeline
Option 1: Using Test Script (Recommended)
cd data_preparation_service
# Test on 5 files (default)
./test_pipeline.sh 5
# Test on 10 files
./test_pipeline.sh 10The script will:
- ✅ Check if service is deployed (deploy if needed)
- ✅ Run health check
- ✅ Start pipeline with specified limit
- ✅ Show run ID for monitoring
Option 2: Manual Testing
# 1. Get service URL
source ../.env.deploy
SERVICE_URL=$(gcloud run services describe data-preparation-service \
--region $REGION \
--project $PROJECT_ID \
--format 'value(status.url)')
# 2. Health check
curl "$SERVICE_URL/health"
# 3. Start pipeline (5 files)
curl -X POST "$SERVICE_URL/process-from-gcs?limit=5" \
-H "Content-Type: application/json"
# 4. Check status (use parent_run_id from response)
curl "$SERVICE_URL/runs/{parent_run_id}"Monitor Progress
Check Run Status
# Get run status
curl "$SERVICE_URL/runs/{parent_run_id}"
# List recent runs
curl "$SERVICE_URL/runs?limit=10"Check MLflow UI
If MLflow is configured:
# Open MLflow UI
mlflow ui --backend-store-uri $MLFLOW_TRACKING_URICheck Weaviate
Verify documents were stored:
# Search in Weaviate
curl "$WEAVIATE_API_URL/v1/objects?class=PartDocument&limit=10"Stop Service After Testing
To save costs, stop the service when not in use:
cd data_preparation_service
./stop_service.shThis scales the service to 0 instances (stops billing).
Restart Service
When you need to use it again:
# Option 1: Just make a request - Cloud Run auto-scales
curl "$SERVICE_URL/health"
# Option 2: Manually scale up
gcloud run services update data-preparation-service \
--region $REGION \
--project $PROJECT_ID \
--min-instances 0 \
--max-instances 10Expected Results
After running the pipeline, you should see:
-
MLflow Run with:
- Parameters: PDF paths, document numbers
- Metrics: pages processed, documents created
- Status: FINISHED
-
Weaviate with:
- New documents in
PartDocumentclass - Embeddings generated
- Metadata (source_pdf, page, bbox, etc.)
- New documents in
-
Service Logs in GCP Console:
- Processing progress
- Any errors or warnings
Troubleshooting
Service Not Found
# Deploy service first
cd data_preparation_service
./deploy.shPipeline Fails
Check logs:
gcloud run services logs read data-preparation-service \
--region $REGION \
--project $PROJECT_ID \
--limit 50No PDFs Found
Verify GCS bucket:
gsutil ls gs://crop-documents/Weaviate Connection Issues
Important: Weaviate service is required for storing processed documents.
Check environment variables:
echo $WEAVIATE_API_URL
echo $WEAVIATE_API_KEYCheck Weaviate service:
# Verify Weaviate service URL is set
echo $WEAVIATE_API_URL
# Test Weaviate service health
curl "$WEAVIATE_API_URL/health"
# If Weaviate is on a VM, check VM status
gcloud compute instances list --filter="name:weaviate-service"If Weaviate service is not deployed:
# Deploy Weaviate service to GPU VM
cd ../weaviate_service
./deploy_gpu_vm.shCLIP Service Not Available
Important: CLIP service is required for the pipeline to start.
Check CLIP service:
# Verify CLIP service URL is set
echo $CLIP_API_URL
# Test CLIP service health
curl "$CLIP_API_URL/health"
# If CLIP is on a VM, check VM status
gcloud compute instances list --filter="name:clip-service"If CLIP service is not deployed:
# Deploy CLIP service to GPU VM
cd ../clip_service
./deploy_gpu_vm.shCost Optimization
- ✅ Scale to 0 when not in use (
./stop_service.sh) - ✅ Test on small batches (5-10 files) first
- ✅ Use Cloud Run (pay per request, not 24/7)
- ✅ Monitor usage in GCP Console
Cloud Run charges only for:
- CPU time during requests
- Memory allocated
- Requests processed
When scaled to 0, you pay $0 for idle time.