CROP
ProjectsPDF Parser

GCS PDF Links Generation

PDF files are stored in GCS bucket . The utility is used to generate correct PDF links.

GCS PDF Links Generation

Overview

PDF files are stored in GCS bucket crop-documents. The gcs_pdf_utils.py utility is used to generate correct PDF links.

Configuration

Environment Variables

# GCS Bucket
GCS_BUCKET_NAME=crop-documents  # Default: crop-documents
GCS_PREFIX=manuals/  # Optional prefix for PDF paths

# Default PDF filename (fallback if not in metadata)
DEFAULT_PDF_FILENAME=F5-540-F5-540C.pdf

Usage

1. Generate PDF URL from Metadata

from gcs_pdf_utils import get_pdf_url_from_metadata

metadata = {
    "pdf_filename": "F5-540-F5-540C.pdf",
    "page": 6,
    "gcs_path": "manuals/F5-540-F5-540C.pdf"
}

pdf_url = get_pdf_url_from_metadata(metadata)
# Returns: https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6

2. Generate PDF URL Directly

from gcs_pdf_utils import get_gcs_pdf_url

pdf_url = get_gcs_pdf_url(
    pdf_filename="F5-540-F5-540C.pdf",
    page=6,
    gcs_bucket="crop-documents",
    gcs_prefix="manuals/"
)
# Returns: https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6

3. In Agents

Agents automatically use get_pdf_url_from_metadata() to generate PDF links:

# In agents.py
from gcs_pdf_utils import get_pdf_url_from_metadata

# In extract_parts_from_rag()
pdf_link = get_pdf_url_from_metadata(metadata)

URL Format

Public GCS URL (if bucket is public)

https://storage.googleapis.com/{bucket}/{path}#page={page}

Example:

https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6

GCS URL (for internal use)

gs://{bucket}/{path}#page={page}

Example:

gs://crop-documents/manuals/F5-540-F5-540C.pdf#page=6

Metadata Structure

Document metadata should contain:

metadata = {
    "pdf_filename": "F5-540-F5-540C.pdf",  # PDF filename
    "page": 6,  # Page number (0-indexed or 1-indexed)
    "gcs_path": "manuals/F5-540-F5-540C.pdf",  # Path in GCS (optional)
    "source_file": "F5-540-F5-540C.pdf",  # Alternative field name
    "pdf_file": "F5-540-F5-540C.pdf",  # Alternative field name
    "file_name": "F5-540-F5-540C.pdf"  # Alternative field name
}

Fallback Strategy

If PDF filename is not found in metadata:

  1. Use DEFAULT_PDF_FILENAME from env variables
  2. If not set, use F5-540-F5-540C.pdf

Examples

Example 1: From Weaviate Metadata

# Document from Weaviate
doc = vectorstore.similarity_search("part 10")[0]
metadata = doc.metadata

# Generate PDF link
pdf_link = get_pdf_url_from_metadata(metadata)
# https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6

Example 2: From Elasticsearch

# Result from Elasticsearch
es_result = {
    "part_number": "10",
    "page": 6,
    "pdf_filename": "F5-540-F5-540C.pdf"
}

# Generate PDF link
pdf_link = get_pdf_url_from_metadata(es_result)
# https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6

Example 3: With Different Prefixes

# PDF in bucket root
pdf_link = get_gcs_pdf_url(
    pdf_filename="manual.pdf",
    page=1,
    gcs_bucket="crop-documents",
    gcs_prefix=""  # Empty prefix
)
# https://storage.googleapis.com/crop-documents/manual.pdf#page=1

# PDF in subfolder
pdf_link = get_gcs_pdf_url(
    pdf_filename="manual.pdf",
    page=1,
    gcs_bucket="crop-documents",
    gcs_prefix="manuals/mchale/"  # With prefix
)
# https://storage.googleapis.com/crop-documents/manuals/mchale/manual.pdf#page=1

Public Access Configuration

For public URLs to work, the bucket must be public or use signed URLs.

Make Bucket Public

gsutil iam ch allUsers:objectViewer gs://crop-documents

Or Use Signed URLs

from google.cloud import storage
from datetime import timedelta

def generate_signed_url(bucket_name, blob_name, expiration_minutes=60):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        expiration=timedelta(minutes=expiration_minutes),
        method="GET"
    )
    return url

Troubleshooting

Solution:

  1. Check that bucket exists: gsutil ls gs://crop-documents
  2. Check that file exists: gsutil ls gs://crop-documents/manuals/F5-540-F5-540C.pdf
  3. Check bucket access permissions
  4. Verify GCS_BUCKET_NAME is set correctly

Issue: Wrong page number

Solution:

  • Function automatically converts page number to 1-indexed for URL
  • Ensure page number is correct in metadata

Issue: PDF filename not found

Solution:

  1. Check document metadata
  2. Set DEFAULT_PDF_FILENAME in env variables
  3. Ensure pdf_filename field is present in metadata

On this page