PDF Parser Service Architecture

Project Structure

pdf_parser_service/
├── main.py                 # FastAPI application entry point
├── api/                    # API routes
│   ├── __init__.py
│   └── routes.py          # All API endpoints
├── parsers/               # PDF parsing modules
│   ├── __init__.py
│   ├── spans_parser.py    # Extract text spans from PDF
│   └── tables_parser.py   # Extract tables from PDF
├── rag/                   # RAG document preparation
│   ├── __init__.py
│   ├── table_row_parser.py    # Parse table rows to extract part info
│   ├── schema_index.py        # Build schema index for linking
│   └── rag_preparer.py        # Prepare RAG documents
├── models/                # Pydantic models
│   ├── __init__.py
│   └── schemas.py         # Request/response models
├── utils/                 # Utility functions
│   ├── __init__.py
│   └── bbox.py            # Bounding box coordinate normalization
├── requirements.txt       # Python dependencies
├── Dockerfile             # Docker configuration
├── cloudbuild.yaml        # GCP Cloud Build configuration
├── deploy.sh              # Deployment script
└── README.md              # Service documentation

Module Responsibilities

`main.py`

FastAPI application initialization
CORS middleware configuration
Router registration

`api/routes.py`

POST /parse: Extract text spans from PDF
POST /parse-table: Extract tables from PDF
POST /prepare-rag: Prepare RAG documents from parsed data
GET /: Health check endpoint

`parsers/`

spans_parser.py: Extracts text spans with coordinates from PDF pages
- Filters spans by font (MicrosoftSansSerif) and digits
- Normalizes coordinates to percentages (0-100)
tables_parser.py: Extracts tables using Camelot
- Supports both lattice and stream flavors
- Returns raw table data with normalized coordinates

`rag/`

table_row_parser.py: Parses table rows to extract:
- Item number (No)
- Part number
- Description
- Quantity
- Notes
schema_index.py: Builds index of schema locations by item number
- Enables quick lookup of diagram coordinates
rag_preparer.py: Combines table and schema data into RAG documents
- Links table entries with schema locations
- Creates comprehensive documents for Weaviate indexing

`models/schemas.py`

RAGPreparationRequest: Pydantic model for RAG preparation endpoint
- Validates input data
- Provides example in OpenAPI schema

`utils/bbox.py`

normalize_bbox(): Normalizes bounding box coordinates
- Converts PDF coordinates to canvas coordinates
- Flips Y axis (PDF: bottom=0, Canvas: top=0)
- Returns percentages (0-100)

Data Flow

PDF Upload → api/routes.py
Parsing → parsers/spans_parser.py or parsers/tables_parser.py
RAG Preparation → rag/rag_preparer.py
- Uses rag/table_row_parser.py to parse rows
- Uses rag/schema_index.py to link with schemas
Response → JSON with RAG documents ready for Weaviate

Benefits of This Architecture

Separation of Concerns: Each module has a single responsibility
Testability: Modules can be tested independently
Maintainability: Easy to locate and modify specific functionality
Scalability: Easy to add new parsers or RAG processors
Reusability: Modules can be imported and used in other projects

PDF Parser Service Architecture

PDF Parser Service Architecture

Project Structure

Module Responsibilities

`main.py`

`api/routes.py`

`parsers/`

`rag/`

`models/schemas.py`

`utils/bbox.py`

Data Flow

Benefits of This Architecture

Adding New Features

Adding a New Parser

Adding a New RAG Processor

Adding a New Model

On this page