PDF Parser Services
Overview of all microservices in the CROP PDF Parser platform
PDF Parser Services
The PDF Parser platform is a microservice architecture for extracting, processing, and searching parts data from PDF equipment manuals. Services communicate via REST APIs and are deployed on GCP (Cloud Run and GPU VMs).
Core Services
PDF Parser Service
The foundation service that handles raw PDF processing. It extracts text spans with coordinates using PyMuPDF, tables using Camelot, and schema/diagram information (part numbers overlaid on diagrams). All downstream services depend on its output.
- Tech stack: FastAPI, PyMuPDF (fitz), Camelot, Pandas
- Port: 8000
Key endpoints:
| Endpoint | Description |
|---|---|
POST /parse | Extract text spans with bounding box coordinates from a PDF |
POST /parse-table | Extract structured tables from a specific PDF page |
POST /prepare-rag | Convert parsed tables + schemas into RAG-ready documents for Weaviate |
POST /split-by-documents | Split a multi-document PDF by document number (SPD pattern) |
POST /extract-images | Extract images from PDF pages for CLIP embedding |
Deploy: ./pdf_parser_service/deploy.sh (Cloud Run)
AI Service (McHale Parts Co-Pilot)
LangChain/LangGraph-based AI assistant for parts lookup. Acts as a professional parts search co-pilot with a defined business personality (McHale Vendor Agent Specialist). Uses RAG with Weaviate for grounded answers and offers store links when part numbers are found.
- Tech stack: LangChain, LangGraph, LLaMA 3.1 8B (vLLM on GCP), BGE-Large embeddings
- Port: 8001
Key features:
- Main Agent: conversational parts search with store link integration
- Expert Agent: specialized schema/table queries
- RAG pipeline with Weaviate vector store
- Fine-tuning support (LoRA adapters) combined with RAG
- Structured JSON responses with part numbers, descriptions, coordinates, confidence scores
Key endpoints:
| Endpoint | Description |
|---|---|
POST /query | Query the main co-pilot agent (supports include_store_link flag) |
POST /query-expert | Query the expert agent for schema/table lookups |
GET /health | Health check |
Business rules: always offer store links when a part number is found; maintain consistent response format; professional tone.
Deploy: ./deploy.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)
Data Preparation Service
MLflow-tracked pipeline that orchestrates the full data ingestion flow: download PDFs from GCS, process them through the PDF Parser, generate CLIP embeddings for extracted images, and store everything in Weaviate. Processes PDFs sequentially (one at a time) to minimize memory usage.
- Tech stack: FastAPI, MLflow, Google Cloud Storage client
- Port: 8003
- Required dependencies: pdf_parser_service, weaviate_service, clip_service (all three must be running)
Pipeline steps (per PDF):
- Download PDF from GCS bucket (
crop-documents) - Split by document number via
/split-by-documents - Parse tables and schemas via
/parse-tableand/parse - Prepare RAG documents via
/prepare-rag - Extract images and generate CLIP embeddings via
/extract-images+/embed/batch - Store text + image documents in Weaviate via
/store
Key endpoints:
| Endpoint | Description |
|---|---|
POST /process-pdf | Process a single PDF through the full pipeline |
POST /process-batch | Process multiple PDFs sequentially |
POST /process-from-gcs | Auto-discover and process all PDFs from a GCS bucket prefix |
GET /runs/{run_id} | Get MLflow run status and metrics |
GET /runs | List recent MLflow runs |
Deploy: ./data_preparation_service/deploy.sh (Cloud Run, 2GB memory, 10min timeout)
Search and Embeddings
CLIP Service
Generates image embeddings using OpenAI's CLIP model (Contrastive Language-Image Pre-training). Used by the data preparation pipeline to create vector representations of images extracted from PDF manuals, enabling visual similarity search in Weaviate.
- Tech stack: FastAPI, Hugging Face Transformers, PyTorch
- Port: 8002
- Model:
openai/clip-vit-base-patch32(configurable; large variant available) - Requires GPU: deployment enforces GPU availability (T4 recommended, ~$0.35/hr)
Key features:
- Single image, batch, and URL-based embedding generation
- ~10-20ms per image on T4 GPU (vs ~100-200ms on CPU)
- Metadata passthrough for enriched vectors
Key endpoints:
| Endpoint | Description |
|---|---|
POST /embed | Generate embedding for a single uploaded image |
POST /embed/batch | Batch-embed multiple images |
POST /embed/url | Generate embedding from an image URL |
Deploy: ./clip_service/deploy_gpu_vm.sh (GCP GPU VM with T4; reuses existing VM on subsequent deploys)
Weaviate Service
FastAPI wrapper around Weaviate vector database for storing and retrieving RAG documents. Generates text embeddings using BGE-Large and provides both raw search and LLM-optimized retrieval endpoints. Stores both text documents (from PDF parsing) and image documents (with CLIP embeddings).
- Tech stack: FastAPI, Weaviate Python client, sentence-transformers (BGE-Large)
- Port: 8002
- Embedding model: BAAI/bge-large-en-v1.5 (~1.5GB RAM)
Key features:
- Batch storage API (10-100x faster than individual inserts; 100 docs/batch, 2 parallel workers)
- Semantic search with metadata filtering (part number, source PDF, page)
/retrieveendpoint returns pre-formatted context string ready for LLM injection- Automatic schema creation on startup
Key endpoints:
| Endpoint | Description |
|---|---|
POST /store | Batch-store RAG documents with auto-generated embeddings |
POST /search | Full semantic search with Weaviate filters |
POST /retrieve | RAG retrieval optimized for LLM agents (returns context string) |
GET /schema | Inspect current Weaviate schema |
DELETE /schema | Drop Weaviate schema (destructive) |
Deploy: ./weaviate_service/deploy.sh (Cloud Run)
Supporting Services
Barcode Service
Detects and extracts barcodes and QR codes from images. Used to read part numbers from scanned labels or photos.
- Tech stack: FastAPI, pyzbar, OpenCV
- Port: 8003
- Supported formats: EAN-13, EAN-8, UPC-A/E, Code 128, Code 39, ITF, QR Code, Micro QR
Key endpoints:
| Endpoint | Description |
|---|---|
POST /detect | Detect barcodes/QR codes in an uploaded image |
POST /detect/url | Detect from an image URL |
POST /detect/batch | Batch detection across multiple images |
Response includes barcode type, decoded data, format, and bounding box coordinates.
Deploy: ./barcode_service/deploy.sh (Cloud Run)
GCS Service
General-purpose FastAPI wrapper for Google Cloud Storage operations. Provides a REST API for listing, uploading, downloading, copying, and deleting files in GCS buckets. Supports automatic gzip compression/decompression.
- Tech stack: FastAPI, Google Cloud Storage client
- Port: 8006
- Auth: service account JSON or application default credentials
Key endpoints:
| Endpoint | Description |
|---|---|
GET /buckets | List all buckets in the project |
GET /buckets/{name}/files | List files with optional prefix filter (max 1000) |
POST /buckets/{name}/files/{path} | Upload a file (optional gzip compression) |
GET /buckets/{name}/files/{path} | Download a file (optional gzip decompression) |
DELETE /buckets/{name}/files/{path} | Delete a file |
POST /buckets/{name}/files/{path}/copy | Copy a file to another bucket/path |
MLflow Service
Centralized MLflow Tracking Server for monitoring pipeline runs, model experiments, and artifacts across the CROP AI ecosystem.
- Tech stack: MLflow 2.8.1, SQLite/Cloud SQL backend, GCS artifact storage
- Port: 5000
- Production URL:
https://mlflow-service-atife5uvka-ue.a.run.app - Artifact bucket:
gs://mlflow-artifacts-noted-bliss-466410-q6/mlflow-artifacts/
Tracked data:
- Pipeline run parameters (PDF path, document number)
- Processing metrics (pages processed, documents created, images stored)
- Model experiment artifacts (checkpoints, configs)
- Run status and error logs
UI available at the production URL with experiment search, run comparison, and artifact browsing.
Deploy: Cloud Run (us-east1), gcloud builds submit --tag gcr.io/$PROJECT_ID/mlflow-service
Grafana Service
Grafana instance pre-configured with dashboards for monitoring Event Sourcing data from the Weaviate Service.
- Tech stack: Grafana, REST API datasource
- Datasources: Weaviate Service API (auto-configured), optional MongoDB Atlas
Dashboard panels: Total Events, Events by Type (chart), Events Timeline, Recent Events table.
Environment variables: WEAVIATE_SERVICE_URL, GRAFANA_ADMIN_PASSWORD, GRAFANA_ROOT_URL
Deploy: ./grafana_service/deploy_grafana.sh or gcloud builds submit --config cloudbuild.yaml (Cloud Run)
Linear Service
FastAPI integration with Linear.app for programmatic issue management via GraphQL API. Used to create and track issues from within the pipeline (e.g., flagging PDFs that fail processing).
- Tech stack: FastAPI, Linear GraphQL API, Pydantic
- Port: 8003
- Auth: Linear Personal API Key (
LINEAR_API_KEY)
Key endpoints:
| Endpoint | Description |
|---|---|
POST /create-issue | Create a new Linear issue (title, description, team, priority, assignee) |
POST /update-issue/{id} | Update an existing issue |
GET /teams | List workspace teams |
GET /projects | List workspace projects |
GET /users | List workspace users |
GET /states | List workflow states for a team |
GET /issue/{id} | Get issue details |
Priority levels: 0 = None, 1 = Urgent, 2 = High, 3 = Medium, 4 = Low
Deploy: gcloud builds submit --config cloudbuild.yaml (Cloud Run)
Frontend
Weaviate Event Sourcing Dashboard
React dashboard for viewing and analyzing Event Sourcing data from the Weaviate Service. Provides event history, statistics, and time-travel state reconstruction. Works with both in-memory and persistent (MongoDB/PostgreSQL) event backends.
- Tech stack: React, Vite
- Port: 3000
- API dependency: Weaviate Service (
VITE_API_URL)
Key views:
/events-- event history with filtering and detail view/stats-- event analytics and charts/events/{aggregateId}/state-- reconstruct aggregate state at a point in time
Note: Grafana requires persistent storage; this custom dashboard works regardless of the event backend (including in-memory).
Service Communication Map
┌──────────────┐
│ Frontend │
│ (Dashboard) │
└──────┬───────┘
│
┌──────▼───────┐
│ AI Service │◄──── User queries
│ (Co-Pilot) │
└──────┬───────┘
│
┌──────▼───────┐
┌─────────│ Weaviate │◄─────────┐
│ │ Service │ │
│ └──────────────┘ │
│ │
┌──────▼───────┐ ┌─────────▼──────┐
│ PDF Parser │◄─────────────────│ Data Preparation│
│ Service │ │ Service │
└──────────────┘ └───┬────┬────────┘
│ │
┌─────────▼┐ ┌▼─────────┐
│ CLIP │ │ GCS │
│ Service │ │ Service │
└──────────┘ └──────────┘Monitoring layer: MLflow (pipeline tracking), Grafana (event visualization) External integration: Linear Service (issue tracking)