CROP

RAG Pipelines

Queue-based PDF processing service with RabbitMQ workers for parts extraction.

RAG Pipelines

Repository: CT-CROP/CROP-RAG-Pipelines Last updated: 2026-02-18 Last synced to docs: 2026-03-10

Queue-based PDF processing service for agricultural equipment manuals.

Features

  • PDF Parsing: Extract text, tables, part numbers (13+ patterns), images
  • Queue Processing: RabbitMQ for reliable task distribution
  • Job Tracking: Redis with 7-day TTL
  • GCS Storage: Save results to Google Cloud Storage
  • Scalable: 10+ workers with automatic load balancing

Architecture

Client → API (8080) → RabbitMQ → Batch Worker → RabbitMQ → PDF Workers (×10) → GCS
                          ↓                                        ↓
                       Redis (job status)                   Redis (progress)

Quick Start

docker-compose up --build
# API: http://localhost:8080
# RabbitMQ UI: http://localhost:15672 (admin/admin)

API Endpoints

POST /api/upload-pdf                    # Upload single PDF
POST /api/process-bucket-pdfs           # Batch process from GCS
GET /api/jobs/{batch_id}                # Job status
GET /api/jobs                           # List all jobs
GET /health                             # Health check

Tech Stack

FastAPI, RabbitMQ, Redis, PyMuPDF, Pillow, Google Cloud Storage, Docker Compose.

On this page