PDF Parser Services

The PDF Parser platform is a microservice architecture for extracting, processing, and searching parts data from PDF equipment manuals. Services communicate via REST APIs and are deployed on GCP (Cloud Run and GPU VMs).

Core Services

PDF Parser Service

The foundation service that handles raw PDF processing. It extracts text spans with coordinates using PyMuPDF, tables using Camelot, and schema/diagram information (part numbers overlaid on diagrams). All downstream services depend on its output.

Tech stack: FastAPI, PyMuPDF (fitz), Camelot, Pandas
Port: 8000

Key endpoints:

Endpoint	Description
`POST /parse`	Extract text spans with bounding box coordinates from a PDF
`POST /parse-table`	Extract structured tables from a specific PDF page
`POST /prepare-rag`	Convert parsed tables + schemas into RAG-ready documents for Weaviate
`POST /split-by-documents`	Split a multi-document PDF by document number (SPD pattern)
`POST /extract-images`	Extract images from PDF pages for CLIP embedding

Deploy: ./pdf_parser_service/deploy.sh (Cloud Run)

AI Service (McHale Parts Co-Pilot)

LangChain/LangGraph-based AI assistant for parts lookup. Acts as a professional parts search co-pilot with a defined business personality (McHale Vendor Agent Specialist). Uses RAG with Weaviate for grounded answers and offers store links when part numbers are found.

Tech stack: LangChain, LangGraph, LLaMA 3.1 8B (vLLM on GCP), BGE-Large embeddings
Port: 8001

Key features:

Main Agent: conversational parts search with store link integration
Expert Agent: specialized schema/table queries
RAG pipeline with Weaviate vector store
Fine-tuning support (LoRA adapters) combined with RAG
Structured JSON responses with part numbers, descriptions, coordinates, confidence scores

Key endpoints:

Endpoint	Description
`POST /query`	Query the main co-pilot agent (supports `include_store_link` flag)
`POST /query-expert`	Query the expert agent for schema/table lookups
`GET /health`	Health check

Business rules: always offer store links when a part number is found; maintain consistent response format; professional tone.

Deploy: ./deploy.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)

Data Preparation Service

MLflow-tracked pipeline that orchestrates the full data ingestion flow: download PDFs from GCS, process them through the PDF Parser, generate CLIP embeddings for extracted images, and store everything in Weaviate. Processes PDFs sequentially (one at a time) to minimize memory usage.

Tech stack: FastAPI, MLflow, Google Cloud Storage client
Port: 8003
Required dependencies: pdf_parser_service, weaviate_service, clip_service (all three must be running)

Pipeline steps (per PDF):

Download PDF from GCS bucket (crop-documents)
Split by document number via /split-by-documents
Parse tables and schemas via /parse-table and /parse
Prepare RAG documents via /prepare-rag
Extract images and generate CLIP embeddings via /extract-images + /embed/batch
Store text + image documents in Weaviate via /store

Key endpoints:

Endpoint	Description
`POST /process-pdf`	Process a single PDF through the full pipeline
`POST /process-batch`	Process multiple PDFs sequentially
`POST /process-from-gcs`	Auto-discover and process all PDFs from a GCS bucket prefix
`GET /runs/{run_id}`	Get MLflow run status and metrics
`GET /runs`	List recent MLflow runs

Deploy: ./data_preparation_service/deploy.sh (Cloud Run, 2GB memory, 10min timeout)

Search and Embeddings

CLIP Service

Generates image embeddings using OpenAI's CLIP model (Contrastive Language-Image Pre-training). Used by the data preparation pipeline to create vector representations of images extracted from PDF manuals, enabling visual similarity search in Weaviate.

Tech stack: FastAPI, Hugging Face Transformers, PyTorch
Port: 8002
Model: openai/clip-vit-base-patch32 (configurable; large variant available)
Requires GPU: deployment enforces GPU availability (T4 recommended, ~$0.35/hr)

Key features:

Single image, batch, and URL-based embedding generation
~10-20ms per image on T4 GPU (vs ~100-200ms on CPU)
Metadata passthrough for enriched vectors

Key endpoints:

Endpoint	Description
`POST /embed`	Generate embedding for a single uploaded image
`POST /embed/batch`	Batch-embed multiple images
`POST /embed/url`	Generate embedding from an image URL

Deploy: ./clip_service/deploy_gpu_vm.sh (GCP GPU VM with T4; reuses existing VM on subsequent deploys)

Weaviate Service

FastAPI wrapper around Weaviate vector database for storing and retrieving RAG documents. Generates text embeddings using BGE-Large and provides both raw search and LLM-optimized retrieval endpoints. Stores both text documents (from PDF parsing) and image documents (with CLIP embeddings).

Tech stack: FastAPI, Weaviate Python client, sentence-transformers (BGE-Large)
Port: 8002
Embedding model: BAAI/bge-large-en-v1.5 (~1.5GB RAM)

Key features:

Batch storage API (10-100x faster than individual inserts; 100 docs/batch, 2 parallel workers)
Semantic search with metadata filtering (part number, source PDF, page)
/retrieve endpoint returns pre-formatted context string ready for LLM injection
Automatic schema creation on startup

Key endpoints:

Endpoint	Description
`POST /store`	Batch-store RAG documents with auto-generated embeddings
`POST /search`	Full semantic search with Weaviate filters
`POST /retrieve`	RAG retrieval optimized for LLM agents (returns `context` string)
`GET /schema`	Inspect current Weaviate schema
`DELETE /schema`	Drop Weaviate schema (destructive)

Deploy: ./weaviate_service/deploy.sh (Cloud Run)

Supporting Services

Barcode Service

Detects and extracts barcodes and QR codes from images. Used to read part numbers from scanned labels or photos.

Tech stack: FastAPI, pyzbar, OpenCV
Port: 8003
Supported formats: EAN-13, EAN-8, UPC-A/E, Code 128, Code 39, ITF, QR Code, Micro QR

Key endpoints:

Endpoint	Description
`POST /detect`	Detect barcodes/QR codes in an uploaded image
`POST /detect/url`	Detect from an image URL
`POST /detect/batch`	Batch detection across multiple images

Response includes barcode type, decoded data, format, and bounding box coordinates.

Deploy: ./barcode_service/deploy.sh (Cloud Run)

GCS Service

General-purpose FastAPI wrapper for Google Cloud Storage operations. Provides a REST API for listing, uploading, downloading, copying, and deleting files in GCS buckets. Supports automatic gzip compression/decompression.

Tech stack: FastAPI, Google Cloud Storage client
Port: 8006
Auth: service account JSON or application default credentials

Key endpoints:

Endpoint	Description
`GET /buckets`	List all buckets in the project
`GET /buckets/{name}/files`	List files with optional prefix filter (max 1000)
`POST /buckets/{name}/files/{path}`	Upload a file (optional gzip compression)
`GET /buckets/{name}/files/{path}`	Download a file (optional gzip decompression)
`DELETE /buckets/{name}/files/{path}`	Delete a file
`POST /buckets/{name}/files/{path}/copy`	Copy a file to another bucket/path

MLflow Service

Centralized MLflow Tracking Server for monitoring pipeline runs, model experiments, and artifacts across the CROP AI ecosystem.

Tech stack: MLflow 2.8.1, SQLite/Cloud SQL backend, GCS artifact storage
Port: 5000
Production URL: https://mlflow-service-atife5uvka-ue.a.run.app
Artifact bucket: gs://mlflow-artifacts-noted-bliss-466410-q6/mlflow-artifacts/

Tracked data:

Pipeline run parameters (PDF path, document number)
Processing metrics (pages processed, documents created, images stored)
Model experiment artifacts (checkpoints, configs)
Run status and error logs

UI available at the production URL with experiment search, run comparison, and artifact browsing.

Deploy: Cloud Run (us-east1), gcloud builds submit --tag gcr.io/$PROJECT_ID/mlflow-service

Grafana Service

Grafana instance pre-configured with dashboards for monitoring Event Sourcing data from the Weaviate Service.

Tech stack: Grafana, REST API datasource
Datasources: Weaviate Service API (auto-configured), optional MongoDB Atlas

Dashboard panels: Total Events, Events by Type (chart), Events Timeline, Recent Events table.

Environment variables: WEAVIATE_SERVICE_URL, GRAFANA_ADMIN_PASSWORD, GRAFANA_ROOT_URL

Deploy: ./grafana_service/deploy_grafana.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)

Linear Service

FastAPI integration with Linear.app for programmatic issue management via GraphQL API. Used to create and track issues from within the pipeline (e.g., flagging PDFs that fail processing).

Tech stack: FastAPI, Linear GraphQL API, Pydantic
Port: 8003
Auth: Linear Personal API Key (LINEAR_API_KEY)

Key endpoints:

Endpoint	Description
`POST /create-issue`	Create a new Linear issue (title, description, team, priority, assignee)
`POST /update-issue/{id}`	Update an existing issue
`GET /teams`	List workspace teams
`GET /projects`	List workspace projects
`GET /users`	List workspace users
`GET /states`	List workflow states for a team
`GET /issue/{id}`	Get issue details

Priority levels: 0 = None, 1 = Urgent, 2 = High, 3 = Medium, 4 = Low

Deploy: gcloud builds submit --config cloudbuild.yaml (Cloud Run)

Frontend

Weaviate Event Sourcing Dashboard

React dashboard for viewing and analyzing Event Sourcing data from the Weaviate Service. Provides event history, statistics, and time-travel state reconstruction. Works with both in-memory and persistent (MongoDB/PostgreSQL) event backends.

Tech stack: React, Vite
Port: 3000
API dependency: Weaviate Service (VITE_API_URL)

Key views:

/events -- event history with filtering and detail view
/stats -- event analytics and charts
/events/{aggregateId}/state -- reconstruct aggregate state at a point in time

Note: Grafana requires persistent storage; this custom dashboard works regardless of the event backend (including in-memory).

Service Communication Map

                    ┌──────────────┐
                    │   Frontend   │
                    │  (Dashboard) │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │  AI Service  │◄──── User queries
                    │  (Co-Pilot)  │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
         ┌─────────│   Weaviate   │◄─────────┐
         │         │   Service    │           │
         │         └──────────────┘           │
         │                                    │
  ┌──────▼───────┐                  ┌─────────▼──────┐
  │  PDF Parser  │◄─────────────────│ Data Preparation│
  │   Service    │                  │    Service      │
  └──────────────┘                  └───┬────┬────────┘
                                        │    │
                              ┌─────────▼┐  ┌▼─────────┐
                              │   CLIP   │  │    GCS    │
                              │  Service │  │  Service  │
                              └──────────┘  └──────────┘

Monitoring layer: MLflow (pipeline tracking), Grafana (event visualization) External integration: Linear Service (issue tracking)

PDF Parser Services

On this page