FastAPI service for storing and retrieving RAG documents in Weaviate vector database.

Weaviate Service

FastAPI service for storing and retrieving RAG documents in Weaviate vector database.

Overview

This service receives RAG documents from the /prepare-rag endpoint of pdf-parser-service, generates embeddings using BGE-Large model, and stores them in Weaviate for semantic search.

Features

Store RAG Documents: Receive documents from PDF parser and store in Weaviate
Semantic Search: Search for similar parts using natural language queries
Embedding Generation: Automatic embedding generation using BGE-Large model
Metadata Filtering: Filter results by part number, source PDF, page, etc.

Architecture

pdf-parser-service (/prepare-rag)
    ↓ JSON documents
weaviate-service (/store)
    ↓ Generate embeddings
    ↓ Store in Weaviate
Weaviate Vector Database

Local Development

Prerequisites

Python 3.11+
Weaviate instance running (local or remote)
BGE-Large model (downloaded automatically on first run)

Installation

cd weaviate_service
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Environment Variables

Create a .env file:

For Weaviate Cloud (recommended):

WEAVIATE_URL=https://your-cluster.c0.us-east1.gcp.weaviate.cloud
WEAVIATE_API_KEY=your-weaviate-api-key-here
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DEVICE=cpu  # or cuda if GPU available
PORT=8002

For Self-hosted Weaviate:

WEAVIATE_URL=http://localhost:8080
WEAVIATE_API_KEY=  # Optional, if Weaviate requires authentication
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DEVICE=cpu  # or cuda if GPU available
PORT=8002

Running the Service

uvicorn main:app --reload --port 8002

The service will be available at http://localhost:8002

API documentation: http://localhost:8002/docs

API Endpoints

`POST /store`

Store RAG documents in Weaviate using batch API (optimized for performance).

⚠️ Important: Always send all documents in one request!

The service uses Weaviate's batch API which is 10-100x faster than storing documents one by one.

Request:

{
  "documents": [
    {
      "content": "Item Number: 1\nPart Number: ACH02226\n...",
      "metadata": {
        "item_number": "1",
        "part_number": "ACH02226",
        "description": "REEL BEARING FLOATING ASSY",
        "quantity": "5",
        "page": 0,
        "source_pdf": "F5-540-F5-540C.pdf",
        "bbox": [10.44, 31.23, 11.68, 33.09]
      }
    },
    {
      "content": "Item Number: 2\nPart Number: ACN00046\n...",
      "metadata": {...}
    }
  ],
  "source_pdf": "F5-540-F5-540C.pdf"
}

Response:

{
  "success": true,
  "stored_count": 2,
  "failed_count": 0,
  "errors": []
}

Performance:

✅ Batch processing: All documents processed together
✅ Batch size: 100 documents per batch
✅ Parallel workers: 2 concurrent workers
✅ Speed: 10-100x faster than individual storage
✅ Automatic fallback: Falls back to individual storage if batch fails

Best Practice:

Send all documents from one PDF in a single request
Typical PDF with 40-100 parts: send all at once
Don't split into multiple requests - it's much slower!

`POST /search`

Semantic search in Weaviate (full search with all options).

Request:

{
  "query": "bearing for reel",
  "limit": 10,
  "where": {
    "path": ["source_pdf"],
    "operator": "Equal",
    "valueString": "F5-540-F5-540C.pdf"
  },
  "score_threshold": 0.7
}

Response:

{
  "query": "bearing for reel",
  "results": [
    {
      "content": "Item Number: 1\nPart Number: ACH02226\n...",
      "metadata": {...},
      "score": 0.85
    }
  ],
  "count": 1
}

`POST /retrieve` ⭐ Recommended for LLM Agents

RAG retrieval endpoint optimized for LLM agents. Returns documents in a format suitable for context injection.

Request:

{
  "query": "bearing for reel",
  "limit": 5,
  "source_pdf": "F5-540-F5-540C.pdf",
  "part_number": null,
  "page": null,
  "score_threshold": 0.7,
  "include_metadata": true
}

Response:

{
  "query": "bearing for reel",
  "documents": [
    {
      "content": "Item Number: 1\nPart Number: ACH02226\nDescription: REEL BEARING FLOATING ASSY\n...",
      "metadata": {
        "item_number": "1",
        "part_number": "ACH02226",
        "description": "REEL BEARING FLOATING ASSY",
        "quantity": "5",
        "page": 0,
        "source_pdf": "F5-540-F5-540C.pdf",
        "bbox": [10.44, 31.23, 11.68, 33.09]
      },
      "score": 0.85
    }
  ],
  "count": 1,
  "context": "[1] Part: ACH02226 | REEL BEARING FLOATING ASSY | Item #1 | Item Number: 1\nPart Number: ACH02226\n...",
  "filters": {
    "source_pdf": "F5-540-F5-540C.pdf",
    "part_number": null,
    "page": null
  }
}

Key Features:

✅ Optimized default limit (5 documents) for RAG
✅ Easy filtering by source_pdf, part_number, page
✅ Pre-formatted context string ready for LLM injection
✅ Structured response format
✅ Optional metadata inclusion

`GET /health`

Health check endpoint.

`GET /schema`

Get Weaviate schema.

`DELETE /schema`

Delete Weaviate schema (use with caution!).

Integration with PDF Parser Service and LLM Agents

Workflow

Parse PDF → pdf-parser-service /parse and /parse-table
Prepare RAG → pdf-parser-service /prepare-rag
Store in Weaviate → weaviate-service /store
Retrieve for LLM → weaviate-service /retrieve (used by LLM agents)

Example Integration

import requests

# Step 1: Parse PDF
pdf_response = requests.post(
    "http://localhost:8000/parse",
    files={"file": open("manual.pdf", "rb")}
)
schema_data = pdf_response.json()

table_response = requests.post(
    "http://localhost:8000/parse-table",
    files={"file": open("manual.pdf", "rb")}
)
table_data = table_response.json()

# Step 2: Prepare RAG
rag_response = requests.post(
    "http://localhost:8000/prepare-rag",
    json={
        "tables": table_data,
        "schemas": schema_data,
        "source_pdf": "manual.pdf"
    }
)
rag_documents = rag_response.json()

# Step 3: Store in Weaviate
store_response = requests.post(
    "http://localhost:8002/store",
    json={
        "documents": rag_documents["documents"],
        "source_pdf": "manual.pdf"
    }
)
print(store_response.json())

# Step 4: Retrieve for LLM Agent
retrieve_response = requests.post(
    "http://localhost:8002/retrieve",
    json={
        "query": "bearing for reel",
        "limit": 5,
        "source_pdf": "manual.pdf"
    }
)
rag_context = retrieve_response.json()["context"]
# Use rag_context in LLM prompt

Deployment to GCP

Prerequisites

GCP project with billing enabled
gcloud CLI installed and authenticated
Weaviate instance deployed (see deployment section below)
Environment variables in .env.deploy (project root):
- PROJECT_ID
- REGION
- WEAVIATE_URL
- WEAVIATE_API_KEY (if required)

Deploy to Cloud Run

cd weaviate_service
chmod +x deploy.sh
./deploy.sh

Environment Variables for Cloud Run

Add to .env.deploy (project root):

# Weaviate Service
_WEAVIATE_URL=https://your-weaviate-instance.run.app
_WEAVIATE_API_KEY=your-api-key  # Optional
_EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
_EMBEDDING_DEVICE=cpu

Update cloudbuild.yaml with these substitutions.

Weaviate Schema

The service automatically creates a schema with the following properties:

content (text): Document content for semantic search
item_number (string): Item number in diagram
part_number (string): Part number for exact search
description (text): Part description
quantity (string): Quantity
notes (text): Additional notes
page (int): Page number
source_pdf (string): Source PDF file name
type (string): Document type
bbox (number[]): Bounding box coordinates
table_bbox (number[]): Table bounding box
schema_location_count (int): Number of schema locations
schema_locations (object): Schema locations (stored as JSON string)

Performance Considerations

Batch vs Individual Storage

✅ Always use batch storage (send all documents in one request):

10-100x faster than individual storage
Processes 100 documents per batch
Uses 2 parallel workers
Automatic fallback if batch fails

❌ Don't send documents one by one:

Much slower (each document = separate HTTP request)
Higher latency
More network overhead

Other Performance Tips

Embedding Generation: First request may be slow as model loads (~30s)
Batch Embeddings: Documents are embedded in batches of 32
Memory: BGE-Large model requires ~1.5GB RAM
GPU: Set EMBEDDING_DEVICE=cuda for faster embeddings (if GPU available)
Typical Performance:
- 100 documents: ~5-10 seconds (with embeddings)
- 1000 documents: ~30-60 seconds (with embeddings)

Troubleshooting

Connection Errors

Verify WEAVIATE_URL is correct
Check if Weaviate instance is running
Verify API key if authentication is required

Embedding Model Errors

Ensure sufficient disk space for model download (~1.5GB)
Check internet connection for model download
Verify EMBEDDING_DEVICE matches available hardware

Storage Errors

Verify Weaviate schema is created
Check Weaviate instance has sufficient storage
Review Weaviate logs for detailed errors

PDF Parser Service - PDF parsing service
Data Preparation Service - Data preparation scripts

Weaviate Service

On this page