FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.

PDF Parser Service

FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.

Features

Extract text spans with coordinates using PyMuPDF
Extract tables using Camelot
Extract schema information (part numbers on diagrams)
RESTful API with Swagger documentation

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Run the service
uvicorn main:app --reload

The service will be available at http://localhost:8000

API Endpoints

GET / - Health check
GET /docs - Swagger documentation
POST /parse - Parse PDF file and extract text spans
POST /parse-table - Extract tables from PDF
POST /prepare-rag - Prepare parsed PDF data for RAG system (Weaviate)

Example Usage

Parse PDF:

curl -X POST "http://localhost:8000/parse" \
  -H "accept: application/json" \
  -F "file=@example.pdf"

Extract Tables:

curl -X POST "http://localhost:8000/parse-table" \
  -H "accept: application/json" \
  -F "file=@example.pdf" \
  -F "page=0"

Prepare RAG Data:

curl -X POST "http://localhost:8000/prepare-rag" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": [{"page": 6, "tables": [{"rows": [["Part", "Description"], ["10", "WASHER"]], "bbox": [10, 20, 90, 80]}]}],
    "schemas": [{"page": 6, "spans": [{"text": "10", "bbox": [11.76, 9.09, 29.41, 9.85]}]}],
    "source_pdf": "example.pdf"
  }'

The /prepare-rag endpoint converts parsed PDF data into RAG-ready format for Weaviate vector database ingestion. It processes tables and schemas from /parse and /parse-table endpoints and returns documents with content and metadata structure.

Deployment

Quick Deploy

cd pdf_parser_service
./deploy.sh

Manual Deploy

cd pdf_parser_service
gcloud builds submit --config cloudbuild.yaml .

Using Root Deploy Script

./deploy.sh pdf-parser

See deployment section in README.md for full deployment instructions.

Dependencies

FastAPI - Web framework
PyMuPDF (fitz) - PDF parsing
Camelot - Table extraction
Pandas - Data processing

PDF Parser Service

On this page