CROP
ProjectsPDF Parser

PDF Parser Service

FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.

PDF Parser Service

FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.

Features

  • Extract text spans with coordinates using PyMuPDF
  • Extract tables using Camelot
  • Extract schema information (part numbers on diagrams)
  • RESTful API with Swagger documentation

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Run the service
uvicorn main:app --reload

The service will be available at http://localhost:8000

API Endpoints

  • GET / - Health check
  • GET /docs - Swagger documentation
  • POST /parse - Parse PDF file and extract text spans
  • POST /parse-table - Extract tables from PDF
  • POST /prepare-rag - Prepare parsed PDF data for RAG system (Weaviate)

Example Usage

Parse PDF:

curl -X POST "http://localhost:8000/parse" \
  -H "accept: application/json" \
  -F "file=@example.pdf"

Extract Tables:

curl -X POST "http://localhost:8000/parse-table" \
  -H "accept: application/json" \
  -F "file=@example.pdf" \
  -F "page=0"

Prepare RAG Data:

curl -X POST "http://localhost:8000/prepare-rag" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": [{"page": 6, "tables": [{"rows": [["Part", "Description"], ["10", "WASHER"]], "bbox": [10, 20, 90, 80]}]}],
    "schemas": [{"page": 6, "spans": [{"text": "10", "bbox": [11.76, 9.09, 29.41, 9.85]}]}],
    "source_pdf": "example.pdf"
  }'

The /prepare-rag endpoint converts parsed PDF data into RAG-ready format for Weaviate vector database ingestion. It processes tables and schemas from /parse and /parse-table endpoints and returns documents with content and metadata structure.

Deployment

Quick Deploy

cd pdf_parser_service
./deploy.sh

Manual Deploy

cd pdf_parser_service
gcloud builds submit --config cloudbuild.yaml .

Using Root Deploy Script

./deploy.sh pdf-parser

See deployment section in README.md for full deployment instructions.

Dependencies

  • FastAPI - Web framework
  • PyMuPDF (fitz) - PDF parsing
  • Camelot - Table extraction
  • Pandas - Data processing

On this page