ProjectsPDF Parser
PDF Parser Service
FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.
PDF Parser Service
FastAPI service for parsing PDF manuals and extracting text spans, tables, and schemas.
Features
- Extract text spans with coordinates using PyMuPDF
- Extract tables using Camelot
- Extract schema information (part numbers on diagrams)
- RESTful API with Swagger documentation
Quick Start
Local Development
# Install dependencies
pip install -r requirements.txt
# Run the service
uvicorn main:app --reloadThe service will be available at http://localhost:8000
API Endpoints
GET /- Health checkGET /docs- Swagger documentationPOST /parse- Parse PDF file and extract text spansPOST /parse-table- Extract tables from PDFPOST /prepare-rag- Prepare parsed PDF data for RAG system (Weaviate)
Example Usage
Parse PDF:
curl -X POST "http://localhost:8000/parse" \
-H "accept: application/json" \
-F "file=@example.pdf"Extract Tables:
curl -X POST "http://localhost:8000/parse-table" \
-H "accept: application/json" \
-F "file=@example.pdf" \
-F "page=0"Prepare RAG Data:
curl -X POST "http://localhost:8000/prepare-rag" \
-H "Content-Type: application/json" \
-d '{
"tables": [{"page": 6, "tables": [{"rows": [["Part", "Description"], ["10", "WASHER"]], "bbox": [10, 20, 90, 80]}]}],
"schemas": [{"page": 6, "spans": [{"text": "10", "bbox": [11.76, 9.09, 29.41, 9.85]}]}],
"source_pdf": "example.pdf"
}'The /prepare-rag endpoint converts parsed PDF data into RAG-ready format for Weaviate vector database ingestion. It processes tables and schemas from /parse and /parse-table endpoints and returns documents with content and metadata structure.
Deployment
Quick Deploy
cd pdf_parser_service
./deploy.shManual Deploy
cd pdf_parser_service
gcloud builds submit --config cloudbuild.yaml .Using Root Deploy Script
./deploy.sh pdf-parserSee deployment section in README.md for full deployment instructions.
Dependencies
- FastAPI - Web framework
- PyMuPDF (fitz) - PDF parsing
- Camelot - Table extraction
- Pandas - Data processing