ProjectsPDF Parser
PDF Parser Service Architecture
Module Responsibilities
PDF Parser Service Architecture
Project Structure
pdf_parser_service/
├── main.py # FastAPI application entry point
├── api/ # API routes
│ ├── __init__.py
│ └── routes.py # All API endpoints
├── parsers/ # PDF parsing modules
│ ├── __init__.py
│ ├── spans_parser.py # Extract text spans from PDF
│ └── tables_parser.py # Extract tables from PDF
├── rag/ # RAG document preparation
│ ├── __init__.py
│ ├── table_row_parser.py # Parse table rows to extract part info
│ ├── schema_index.py # Build schema index for linking
│ └── rag_preparer.py # Prepare RAG documents
├── models/ # Pydantic models
│ ├── __init__.py
│ └── schemas.py # Request/response models
├── utils/ # Utility functions
│ ├── __init__.py
│ └── bbox.py # Bounding box coordinate normalization
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── cloudbuild.yaml # GCP Cloud Build configuration
├── deploy.sh # Deployment script
└── README.md # Service documentationModule Responsibilities
main.py
- FastAPI application initialization
- CORS middleware configuration
- Router registration
api/routes.py
POST /parse: Extract text spans from PDFPOST /parse-table: Extract tables from PDFPOST /prepare-rag: Prepare RAG documents from parsed dataGET /: Health check endpoint
parsers/
spans_parser.py: Extracts text spans with coordinates from PDF pages- Filters spans by font (MicrosoftSansSerif) and digits
- Normalizes coordinates to percentages (0-100)
tables_parser.py: Extracts tables using Camelot- Supports both lattice and stream flavors
- Returns raw table data with normalized coordinates
rag/
table_row_parser.py: Parses table rows to extract:- Item number (No)
- Part number
- Description
- Quantity
- Notes
schema_index.py: Builds index of schema locations by item number- Enables quick lookup of diagram coordinates
rag_preparer.py: Combines table and schema data into RAG documents- Links table entries with schema locations
- Creates comprehensive documents for Weaviate indexing
models/schemas.py
RAGPreparationRequest: Pydantic model for RAG preparation endpoint- Validates input data
- Provides example in OpenAPI schema
utils/bbox.py
normalize_bbox(): Normalizes bounding box coordinates- Converts PDF coordinates to canvas coordinates
- Flips Y axis (PDF: bottom=0, Canvas: top=0)
- Returns percentages (0-100)
Data Flow
- PDF Upload →
api/routes.py - Parsing →
parsers/spans_parser.pyorparsers/tables_parser.py - RAG Preparation →
rag/rag_preparer.py- Uses
rag/table_row_parser.pyto parse rows - Uses
rag/schema_index.pyto link with schemas
- Uses
- Response → JSON with RAG documents ready for Weaviate
Benefits of This Architecture
- Separation of Concerns: Each module has a single responsibility
- Testability: Modules can be tested independently
- Maintainability: Easy to locate and modify specific functionality
- Scalability: Easy to add new parsers or RAG processors
- Reusability: Modules can be imported and used in other projects
Adding New Features
Adding a New Parser
- Create new file in
parsers/ - Implement parsing function
- Export in
parsers/__init__.py - Add route in
api/routes.py
Adding a New RAG Processor
- Create new file in
rag/ - Implement processing function
- Export in
rag/__init__.py - Use in
rag/rag_preparer.pyor create new endpoint
Adding a New Model
- Create new file in
models/or add tomodels/schemas.py - Export in
models/__init__.py - Use in
api/routes.py