CROP
ProjectsPDF Parser

Testing Data Preparation Pipeline

Test the pipeline on a few files without processing everything or keeping the service running constantly.

Testing Data Preparation Pipeline

Quick Test on GCP

Test the pipeline on a few files without processing everything or keeping the service running constantly.

Prerequisites

  1. Service deployed to GCP Cloud Run
  2. Required services deployed and running:
    • PDF Parser Service (REQUIRED)
    • Weaviate Service (REQUIRED - typically on GPU VM)
    • CLIP Service (REQUIRED - typically on GPU VM)
  3. Environment variables configured in .env.deploy:
    • PROJECT_ID
    • REGION
    • GCS_BUCKET_NAME=crop-documents
    • PDF_PARSER_API_URL - REQUIRED: URL to PDF Parser Service
    • WEAVIATE_API_URL - REQUIRED: URL to Weaviate Service (e.g., http://weaviate-service-vm-ip:8002)
    • CLIP_API_URL - REQUIRED: URL to CLIP Service (e.g., http://clip-service-vm-ip:8002)
    • MLFLOW_TRACKING_URI

Test Pipeline

cd data_preparation_service

# Test on 5 files (default)
./test_pipeline.sh 5

# Test on 10 files
./test_pipeline.sh 10

The script will:

  1. ✅ Check if service is deployed (deploy if needed)
  2. ✅ Run health check
  3. ✅ Start pipeline with specified limit
  4. ✅ Show run ID for monitoring

Option 2: Manual Testing

# 1. Get service URL
source ../.env.deploy
SERVICE_URL=$(gcloud run services describe data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --format 'value(status.url)')

# 2. Health check
curl "$SERVICE_URL/health"

# 3. Start pipeline (5 files)
curl -X POST "$SERVICE_URL/process-from-gcs?limit=5" \
    -H "Content-Type: application/json"

# 4. Check status (use parent_run_id from response)
curl "$SERVICE_URL/runs/{parent_run_id}"

Monitor Progress

Check Run Status

# Get run status
curl "$SERVICE_URL/runs/{parent_run_id}"

# List recent runs
curl "$SERVICE_URL/runs?limit=10"

Check MLflow UI

If MLflow is configured:

# Open MLflow UI
mlflow ui --backend-store-uri $MLFLOW_TRACKING_URI

Check Weaviate

Verify documents were stored:

# Search in Weaviate
curl "$WEAVIATE_API_URL/v1/objects?class=PartDocument&limit=10"

Stop Service After Testing

To save costs, stop the service when not in use:

cd data_preparation_service
./stop_service.sh

This scales the service to 0 instances (stops billing).

Restart Service

When you need to use it again:

# Option 1: Just make a request - Cloud Run auto-scales
curl "$SERVICE_URL/health"

# Option 2: Manually scale up
gcloud run services update data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --min-instances 0 \
    --max-instances 10

Expected Results

After running the pipeline, you should see:

  1. MLflow Run with:

    • Parameters: PDF paths, document numbers
    • Metrics: pages processed, documents created
    • Status: FINISHED
  2. Weaviate with:

    • New documents in PartDocument class
    • Embeddings generated
    • Metadata (source_pdf, page, bbox, etc.)
  3. Service Logs in GCP Console:

    • Processing progress
    • Any errors or warnings

Troubleshooting

Service Not Found

# Deploy service first
cd data_preparation_service
./deploy.sh

Pipeline Fails

Check logs:

gcloud run services logs read data-preparation-service \
    --region $REGION \
    --project $PROJECT_ID \
    --limit 50

No PDFs Found

Verify GCS bucket:

gsutil ls gs://crop-documents/

Weaviate Connection Issues

Important: Weaviate service is required for storing processed documents.

Check environment variables:

echo $WEAVIATE_API_URL
echo $WEAVIATE_API_KEY

Check Weaviate service:

# Verify Weaviate service URL is set
echo $WEAVIATE_API_URL

# Test Weaviate service health
curl "$WEAVIATE_API_URL/health"

# If Weaviate is on a VM, check VM status
gcloud compute instances list --filter="name:weaviate-service"

If Weaviate service is not deployed:

# Deploy Weaviate service to GPU VM
cd ../weaviate_service
./deploy_gpu_vm.sh

CLIP Service Not Available

Important: CLIP service is required for the pipeline to start.

Check CLIP service:

# Verify CLIP service URL is set
echo $CLIP_API_URL

# Test CLIP service health
curl "$CLIP_API_URL/health"

# If CLIP is on a VM, check VM status
gcloud compute instances list --filter="name:clip-service"

If CLIP service is not deployed:

# Deploy CLIP service to GPU VM
cd ../clip_service
./deploy_gpu_vm.sh

Cost Optimization

  • Scale to 0 when not in use (./stop_service.sh)
  • Test on small batches (5-10 files) first
  • Use Cloud Run (pay per request, not 24/7)
  • Monitor usage in GCP Console

Cloud Run charges only for:

  • CPU time during requests
  • Memory allocated
  • Requests processed

When scaled to 0, you pay $0 for idle time.

On this page