Test the pipeline on a few files without processing everything or keeping the service running constantly.

Testing Data Preparation Pipeline

Quick Test on GCP

Test the pipeline on a few files without processing everything or keeping the service running constantly.

Prerequisites

Service deployed to GCP Cloud Run
Required services deployed and running:
- PDF Parser Service (REQUIRED)
- Weaviate Service (REQUIRED - typically on GPU VM)
- CLIP Service (REQUIRED - typically on GPU VM)
Environment variables configured in .env.deploy:
- PROJECT_ID
- REGION
- GCS_BUCKET_NAME=crop-documents
- PDF_PARSER_API_URL - REQUIRED: URL to PDF Parser Service
- WEAVIATE_API_URL - REQUIRED: URL to Weaviate Service (e.g., http://weaviate-service-vm-ip:8002)
- CLIP_API_URL - REQUIRED: URL to CLIP Service (e.g., http://clip-service-vm-ip:8002)
- MLFLOW_TRACKING_URI

Test Pipeline

Option 1: Using Test Script (Recommended)

cd data_preparation_service

# Test on 5 files (default)
./test_pipeline.sh 5

# Test on 10 files
./test_pipeline.sh 10

The script will:

✅ Check if service is deployed (deploy if needed)
✅ Run health check
✅ Start pipeline with specified limit
✅ Show run ID for monitoring

Option 2: Manual Testing

# 1. Get service URL
source ../.env.deploy
SERVICE_URL=$(gcloud run services describe data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --format 'value(status.url)')

# 2. Health check
curl "$SERVICE_URL/health"

# 3. Start pipeline (5 files)
curl -X POST "$SERVICE_URL/process-from-gcs?limit=5" \
    -H "Content-Type: application/json"

# 4. Check status (use parent_run_id from response)
curl "$SERVICE_URL/runs/{parent_run_id}"

Monitor Progress

Check Run Status

# Get run status
curl "$SERVICE_URL/runs/{parent_run_id}"

# List recent runs
curl "$SERVICE_URL/runs?limit=10"

Check MLflow UI

If MLflow is configured:

# Open MLflow UI
mlflow ui --backend-store-uri $MLFLOW_TRACKING_URI

Check Weaviate

Verify documents were stored:

# Search in Weaviate
curl "$WEAVIATE_API_URL/v1/objects?class=PartDocument&limit=10"

Stop Service After Testing

To save costs, stop the service when not in use:

cd data_preparation_service
./stop_service.sh

This scales the service to 0 instances (stops billing).

Restart Service

When you need to use it again:

# Option 1: Just make a request - Cloud Run auto-scales
curl "$SERVICE_URL/health"

# Option 2: Manually scale up
gcloud run services update data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --min-instances 0 \
    --max-instances 10

Expected Results

After running the pipeline, you should see:

MLflow Run with:
- Parameters: PDF paths, document numbers
- Metrics: pages processed, documents created
- Status: FINISHED
Weaviate with:
- New documents in PartDocument class
- Embeddings generated
- Metadata (source_pdf, page, bbox, etc.)
Service Logs in GCP Console:
- Processing progress
- Any errors or warnings

Troubleshooting

Service Not Found

# Deploy service first
cd data_preparation_service
./deploy.sh

Pipeline Fails

Check logs:

gcloud run services logs read data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --limit 50

No PDFs Found

Verify GCS bucket:

gsutil ls gs://crop-documents/

Weaviate Connection Issues

Important: Weaviate service is required for storing processed documents.

Check environment variables:

echo $WEAVIATE_API_URL
echo $WEAVIATE_API_KEY

Check Weaviate service:

# Verify Weaviate service URL is set
echo $WEAVIATE_API_URL

# Test Weaviate service health
curl "$WEAVIATE_API_URL/health"

# If Weaviate is on a VM, check VM status
gcloud compute instances list --filter="name:weaviate-service"

If Weaviate service is not deployed:

# Deploy Weaviate service to GPU VM
cd ../weaviate_service
./deploy_gpu_vm.sh

CLIP Service Not Available

Important: CLIP service is required for the pipeline to start.

Check CLIP service:

# Verify CLIP service URL is set
echo $CLIP_API_URL

# Test CLIP service health
curl "$CLIP_API_URL/health"

# If CLIP is on a VM, check VM status
gcloud compute instances list --filter="name:clip-service"

If CLIP service is not deployed:

# Deploy CLIP service to GPU VM
cd ../clip_service
./deploy_gpu_vm.sh

Cost Optimization

✅ Scale to 0 when not in use (./stop_service.sh)
✅ Test on small batches (5-10 files) first
✅ Use Cloud Run (pay per request, not 24/7)
✅ Monitor usage in GCP Console

Cloud Run charges only for:

CPU time during requests
Memory allocated
Requests processed

When scaled to 0, you pay $0 for idle time.

Testing Data Preparation Pipeline

On this page