CROP

PDF Parser Services

Overview of all microservices in the CROP PDF Parser platform

PDF Parser Services

The PDF Parser platform is a microservice architecture for extracting, processing, and searching parts data from PDF equipment manuals. Services communicate via REST APIs and are deployed on GCP (Cloud Run and GPU VMs).


Core Services

PDF Parser Service

The foundation service that handles raw PDF processing. It extracts text spans with coordinates using PyMuPDF, tables using Camelot, and schema/diagram information (part numbers overlaid on diagrams). All downstream services depend on its output.

  • Tech stack: FastAPI, PyMuPDF (fitz), Camelot, Pandas
  • Port: 8000

Key endpoints:

EndpointDescription
POST /parseExtract text spans with bounding box coordinates from a PDF
POST /parse-tableExtract structured tables from a specific PDF page
POST /prepare-ragConvert parsed tables + schemas into RAG-ready documents for Weaviate
POST /split-by-documentsSplit a multi-document PDF by document number (SPD pattern)
POST /extract-imagesExtract images from PDF pages for CLIP embedding

Deploy: ./pdf_parser_service/deploy.sh (Cloud Run)


AI Service (McHale Parts Co-Pilot)

LangChain/LangGraph-based AI assistant for parts lookup. Acts as a professional parts search co-pilot with a defined business personality (McHale Vendor Agent Specialist). Uses RAG with Weaviate for grounded answers and offers store links when part numbers are found.

  • Tech stack: LangChain, LangGraph, LLaMA 3.1 8B (vLLM on GCP), BGE-Large embeddings
  • Port: 8001

Key features:

  • Main Agent: conversational parts search with store link integration
  • Expert Agent: specialized schema/table queries
  • RAG pipeline with Weaviate vector store
  • Fine-tuning support (LoRA adapters) combined with RAG
  • Structured JSON responses with part numbers, descriptions, coordinates, confidence scores

Key endpoints:

EndpointDescription
POST /queryQuery the main co-pilot agent (supports include_store_link flag)
POST /query-expertQuery the expert agent for schema/table lookups
GET /healthHealth check

Business rules: always offer store links when a part number is found; maintain consistent response format; professional tone.

Deploy: ./deploy.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)


Data Preparation Service

MLflow-tracked pipeline that orchestrates the full data ingestion flow: download PDFs from GCS, process them through the PDF Parser, generate CLIP embeddings for extracted images, and store everything in Weaviate. Processes PDFs sequentially (one at a time) to minimize memory usage.

  • Tech stack: FastAPI, MLflow, Google Cloud Storage client
  • Port: 8003
  • Required dependencies: pdf_parser_service, weaviate_service, clip_service (all three must be running)

Pipeline steps (per PDF):

  1. Download PDF from GCS bucket (crop-documents)
  2. Split by document number via /split-by-documents
  3. Parse tables and schemas via /parse-table and /parse
  4. Prepare RAG documents via /prepare-rag
  5. Extract images and generate CLIP embeddings via /extract-images + /embed/batch
  6. Store text + image documents in Weaviate via /store

Key endpoints:

EndpointDescription
POST /process-pdfProcess a single PDF through the full pipeline
POST /process-batchProcess multiple PDFs sequentially
POST /process-from-gcsAuto-discover and process all PDFs from a GCS bucket prefix
GET /runs/{run_id}Get MLflow run status and metrics
GET /runsList recent MLflow runs

Deploy: ./data_preparation_service/deploy.sh (Cloud Run, 2GB memory, 10min timeout)


Search and Embeddings

CLIP Service

Generates image embeddings using OpenAI's CLIP model (Contrastive Language-Image Pre-training). Used by the data preparation pipeline to create vector representations of images extracted from PDF manuals, enabling visual similarity search in Weaviate.

  • Tech stack: FastAPI, Hugging Face Transformers, PyTorch
  • Port: 8002
  • Model: openai/clip-vit-base-patch32 (configurable; large variant available)
  • Requires GPU: deployment enforces GPU availability (T4 recommended, ~$0.35/hr)

Key features:

  • Single image, batch, and URL-based embedding generation
  • ~10-20ms per image on T4 GPU (vs ~100-200ms on CPU)
  • Metadata passthrough for enriched vectors

Key endpoints:

EndpointDescription
POST /embedGenerate embedding for a single uploaded image
POST /embed/batchBatch-embed multiple images
POST /embed/urlGenerate embedding from an image URL

Deploy: ./clip_service/deploy_gpu_vm.sh (GCP GPU VM with T4; reuses existing VM on subsequent deploys)


Weaviate Service

FastAPI wrapper around Weaviate vector database for storing and retrieving RAG documents. Generates text embeddings using BGE-Large and provides both raw search and LLM-optimized retrieval endpoints. Stores both text documents (from PDF parsing) and image documents (with CLIP embeddings).

  • Tech stack: FastAPI, Weaviate Python client, sentence-transformers (BGE-Large)
  • Port: 8002
  • Embedding model: BAAI/bge-large-en-v1.5 (~1.5GB RAM)

Key features:

  • Batch storage API (10-100x faster than individual inserts; 100 docs/batch, 2 parallel workers)
  • Semantic search with metadata filtering (part number, source PDF, page)
  • /retrieve endpoint returns pre-formatted context string ready for LLM injection
  • Automatic schema creation on startup

Key endpoints:

EndpointDescription
POST /storeBatch-store RAG documents with auto-generated embeddings
POST /searchFull semantic search with Weaviate filters
POST /retrieveRAG retrieval optimized for LLM agents (returns context string)
GET /schemaInspect current Weaviate schema
DELETE /schemaDrop Weaviate schema (destructive)

Deploy: ./weaviate_service/deploy.sh (Cloud Run)


Supporting Services

Barcode Service

Detects and extracts barcodes and QR codes from images. Used to read part numbers from scanned labels or photos.

  • Tech stack: FastAPI, pyzbar, OpenCV
  • Port: 8003
  • Supported formats: EAN-13, EAN-8, UPC-A/E, Code 128, Code 39, ITF, QR Code, Micro QR

Key endpoints:

EndpointDescription
POST /detectDetect barcodes/QR codes in an uploaded image
POST /detect/urlDetect from an image URL
POST /detect/batchBatch detection across multiple images

Response includes barcode type, decoded data, format, and bounding box coordinates.

Deploy: ./barcode_service/deploy.sh (Cloud Run)


GCS Service

General-purpose FastAPI wrapper for Google Cloud Storage operations. Provides a REST API for listing, uploading, downloading, copying, and deleting files in GCS buckets. Supports automatic gzip compression/decompression.

  • Tech stack: FastAPI, Google Cloud Storage client
  • Port: 8006
  • Auth: service account JSON or application default credentials

Key endpoints:

EndpointDescription
GET /bucketsList all buckets in the project
GET /buckets/{name}/filesList files with optional prefix filter (max 1000)
POST /buckets/{name}/files/{path}Upload a file (optional gzip compression)
GET /buckets/{name}/files/{path}Download a file (optional gzip decompression)
DELETE /buckets/{name}/files/{path}Delete a file
POST /buckets/{name}/files/{path}/copyCopy a file to another bucket/path

MLflow Service

Centralized MLflow Tracking Server for monitoring pipeline runs, model experiments, and artifacts across the CROP AI ecosystem.

  • Tech stack: MLflow 2.8.1, SQLite/Cloud SQL backend, GCS artifact storage
  • Port: 5000
  • Production URL: https://mlflow-service-atife5uvka-ue.a.run.app
  • Artifact bucket: gs://mlflow-artifacts-noted-bliss-466410-q6/mlflow-artifacts/

Tracked data:

  • Pipeline run parameters (PDF path, document number)
  • Processing metrics (pages processed, documents created, images stored)
  • Model experiment artifacts (checkpoints, configs)
  • Run status and error logs

UI available at the production URL with experiment search, run comparison, and artifact browsing.

Deploy: Cloud Run (us-east1), gcloud builds submit --tag gcr.io/$PROJECT_ID/mlflow-service


Grafana Service

Grafana instance pre-configured with dashboards for monitoring Event Sourcing data from the Weaviate Service.

  • Tech stack: Grafana, REST API datasource
  • Datasources: Weaviate Service API (auto-configured), optional MongoDB Atlas

Dashboard panels: Total Events, Events by Type (chart), Events Timeline, Recent Events table.

Environment variables: WEAVIATE_SERVICE_URL, GRAFANA_ADMIN_PASSWORD, GRAFANA_ROOT_URL

Deploy: ./grafana_service/deploy_grafana.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)


Linear Service

FastAPI integration with Linear.app for programmatic issue management via GraphQL API. Used to create and track issues from within the pipeline (e.g., flagging PDFs that fail processing).

  • Tech stack: FastAPI, Linear GraphQL API, Pydantic
  • Port: 8003
  • Auth: Linear Personal API Key (LINEAR_API_KEY)

Key endpoints:

EndpointDescription
POST /create-issueCreate a new Linear issue (title, description, team, priority, assignee)
POST /update-issue/{id}Update an existing issue
GET /teamsList workspace teams
GET /projectsList workspace projects
GET /usersList workspace users
GET /statesList workflow states for a team
GET /issue/{id}Get issue details

Priority levels: 0 = None, 1 = Urgent, 2 = High, 3 = Medium, 4 = Low

Deploy: gcloud builds submit --config cloudbuild.yaml (Cloud Run)


Frontend

Weaviate Event Sourcing Dashboard

React dashboard for viewing and analyzing Event Sourcing data from the Weaviate Service. Provides event history, statistics, and time-travel state reconstruction. Works with both in-memory and persistent (MongoDB/PostgreSQL) event backends.

  • Tech stack: React, Vite
  • Port: 3000
  • API dependency: Weaviate Service (VITE_API_URL)

Key views:

  • /events -- event history with filtering and detail view
  • /stats -- event analytics and charts
  • /events/{aggregateId}/state -- reconstruct aggregate state at a point in time

Note: Grafana requires persistent storage; this custom dashboard works regardless of the event backend (including in-memory).


Service Communication Map

                    ┌──────────────┐
                    │   Frontend   │
                    │  (Dashboard) │
                    └──────┬───────┘

                    ┌──────▼───────┐
                    │  AI Service  │◄──── User queries
                    │  (Co-Pilot)  │
                    └──────┬───────┘

                    ┌──────▼───────┐
         ┌─────────│   Weaviate   │◄─────────┐
         │         │   Service    │           │
         │         └──────────────┘           │
         │                                    │
  ┌──────▼───────┐                  ┌─────────▼──────┐
  │  PDF Parser  │◄─────────────────│ Data Preparation│
  │   Service    │                  │    Service      │
  └──────────────┘                  └───┬────┬────────┘
                                        │    │
                              ┌─────────▼┐  ┌▼─────────┐
                              │   CLIP   │  │    GCS    │
                              │  Service │  │  Service  │
                              └──────────┘  └──────────┘

Monitoring layer: MLflow (pipeline tracking), Grafana (event visualization) External integration: Linear Service (issue tracking)

On this page