CROP
ProjectsPDF Parser

PDF Parser Service Architecture

Module Responsibilities

PDF Parser Service Architecture

Project Structure

pdf_parser_service/
├── main.py                 # FastAPI application entry point
├── api/                    # API routes
│   ├── __init__.py
│   └── routes.py          # All API endpoints
├── parsers/               # PDF parsing modules
│   ├── __init__.py
│   ├── spans_parser.py    # Extract text spans from PDF
│   └── tables_parser.py   # Extract tables from PDF
├── rag/                   # RAG document preparation
│   ├── __init__.py
│   ├── table_row_parser.py    # Parse table rows to extract part info
│   ├── schema_index.py        # Build schema index for linking
│   └── rag_preparer.py        # Prepare RAG documents
├── models/                # Pydantic models
│   ├── __init__.py
│   └── schemas.py         # Request/response models
├── utils/                 # Utility functions
│   ├── __init__.py
│   └── bbox.py            # Bounding box coordinate normalization
├── requirements.txt       # Python dependencies
├── Dockerfile             # Docker configuration
├── cloudbuild.yaml        # GCP Cloud Build configuration
├── deploy.sh              # Deployment script
└── README.md              # Service documentation

Module Responsibilities

main.py

  • FastAPI application initialization
  • CORS middleware configuration
  • Router registration

api/routes.py

  • POST /parse: Extract text spans from PDF
  • POST /parse-table: Extract tables from PDF
  • POST /prepare-rag: Prepare RAG documents from parsed data
  • GET /: Health check endpoint

parsers/

  • spans_parser.py: Extracts text spans with coordinates from PDF pages
    • Filters spans by font (MicrosoftSansSerif) and digits
    • Normalizes coordinates to percentages (0-100)
  • tables_parser.py: Extracts tables using Camelot
    • Supports both lattice and stream flavors
    • Returns raw table data with normalized coordinates

rag/

  • table_row_parser.py: Parses table rows to extract:
    • Item number (No)
    • Part number
    • Description
    • Quantity
    • Notes
  • schema_index.py: Builds index of schema locations by item number
    • Enables quick lookup of diagram coordinates
  • rag_preparer.py: Combines table and schema data into RAG documents
    • Links table entries with schema locations
    • Creates comprehensive documents for Weaviate indexing

models/schemas.py

  • RAGPreparationRequest: Pydantic model for RAG preparation endpoint
    • Validates input data
    • Provides example in OpenAPI schema

utils/bbox.py

  • normalize_bbox(): Normalizes bounding box coordinates
    • Converts PDF coordinates to canvas coordinates
    • Flips Y axis (PDF: bottom=0, Canvas: top=0)
    • Returns percentages (0-100)

Data Flow

  1. PDF Uploadapi/routes.py
  2. Parsingparsers/spans_parser.py or parsers/tables_parser.py
  3. RAG Preparationrag/rag_preparer.py
    • Uses rag/table_row_parser.py to parse rows
    • Uses rag/schema_index.py to link with schemas
  4. Response → JSON with RAG documents ready for Weaviate

Benefits of This Architecture

  1. Separation of Concerns: Each module has a single responsibility
  2. Testability: Modules can be tested independently
  3. Maintainability: Easy to locate and modify specific functionality
  4. Scalability: Easy to add new parsers or RAG processors
  5. Reusability: Modules can be imported and used in other projects

Adding New Features

Adding a New Parser

  1. Create new file in parsers/
  2. Implement parsing function
  3. Export in parsers/__init__.py
  4. Add route in api/routes.py

Adding a New RAG Processor

  1. Create new file in rag/
  2. Implement processing function
  3. Export in rag/__init__.py
  4. Use in rag/rag_preparer.py or create new endpoint

Adding a New Model

  1. Create new file in models/ or add to models/schemas.py
  2. Export in models/__init__.py
  3. Use in api/routes.py

On this page