ProjectsPDF Parser
GCS PDF Links Generation
PDF files are stored in GCS bucket . The utility is used to generate correct PDF links.
GCS PDF Links Generation
Overview
PDF files are stored in GCS bucket crop-documents. The gcs_pdf_utils.py utility is used to generate correct PDF links.
Configuration
Environment Variables
# GCS Bucket
GCS_BUCKET_NAME=crop-documents # Default: crop-documents
GCS_PREFIX=manuals/ # Optional prefix for PDF paths
# Default PDF filename (fallback if not in metadata)
DEFAULT_PDF_FILENAME=F5-540-F5-540C.pdfUsage
1. Generate PDF URL from Metadata
from gcs_pdf_utils import get_pdf_url_from_metadata
metadata = {
"pdf_filename": "F5-540-F5-540C.pdf",
"page": 6,
"gcs_path": "manuals/F5-540-F5-540C.pdf"
}
pdf_url = get_pdf_url_from_metadata(metadata)
# Returns: https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=62. Generate PDF URL Directly
from gcs_pdf_utils import get_gcs_pdf_url
pdf_url = get_gcs_pdf_url(
pdf_filename="F5-540-F5-540C.pdf",
page=6,
gcs_bucket="crop-documents",
gcs_prefix="manuals/"
)
# Returns: https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=63. In Agents
Agents automatically use get_pdf_url_from_metadata() to generate PDF links:
# In agents.py
from gcs_pdf_utils import get_pdf_url_from_metadata
# In extract_parts_from_rag()
pdf_link = get_pdf_url_from_metadata(metadata)URL Format
Public GCS URL (if bucket is public)
https://storage.googleapis.com/{bucket}/{path}#page={page}Example:
https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6GCS URL (for internal use)
gs://{bucket}/{path}#page={page}Example:
gs://crop-documents/manuals/F5-540-F5-540C.pdf#page=6Metadata Structure
Document metadata should contain:
metadata = {
"pdf_filename": "F5-540-F5-540C.pdf", # PDF filename
"page": 6, # Page number (0-indexed or 1-indexed)
"gcs_path": "manuals/F5-540-F5-540C.pdf", # Path in GCS (optional)
"source_file": "F5-540-F5-540C.pdf", # Alternative field name
"pdf_file": "F5-540-F5-540C.pdf", # Alternative field name
"file_name": "F5-540-F5-540C.pdf" # Alternative field name
}Fallback Strategy
If PDF filename is not found in metadata:
- Use
DEFAULT_PDF_FILENAMEfrom env variables - If not set, use
F5-540-F5-540C.pdf
Examples
Example 1: From Weaviate Metadata
# Document from Weaviate
doc = vectorstore.similarity_search("part 10")[0]
metadata = doc.metadata
# Generate PDF link
pdf_link = get_pdf_url_from_metadata(metadata)
# https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6Example 2: From Elasticsearch
# Result from Elasticsearch
es_result = {
"part_number": "10",
"page": 6,
"pdf_filename": "F5-540-F5-540C.pdf"
}
# Generate PDF link
pdf_link = get_pdf_url_from_metadata(es_result)
# https://storage.googleapis.com/crop-documents/manuals/F5-540-F5-540C.pdf#page=6Example 3: With Different Prefixes
# PDF in bucket root
pdf_link = get_gcs_pdf_url(
pdf_filename="manual.pdf",
page=1,
gcs_bucket="crop-documents",
gcs_prefix="" # Empty prefix
)
# https://storage.googleapis.com/crop-documents/manual.pdf#page=1
# PDF in subfolder
pdf_link = get_gcs_pdf_url(
pdf_filename="manual.pdf",
page=1,
gcs_bucket="crop-documents",
gcs_prefix="manuals/mchale/" # With prefix
)
# https://storage.googleapis.com/crop-documents/manuals/mchale/manual.pdf#page=1Public Access Configuration
For public URLs to work, the bucket must be public or use signed URLs.
Make Bucket Public
gsutil iam ch allUsers:objectViewer gs://crop-documentsOr Use Signed URLs
from google.cloud import storage
from datetime import timedelta
def generate_signed_url(bucket_name, blob_name, expiration_minutes=60):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
url = blob.generate_signed_url(
expiration=timedelta(minutes=expiration_minutes),
method="GET"
)
return urlTroubleshooting
Issue: PDF link doesn't work
Solution:
- Check that bucket exists:
gsutil ls gs://crop-documents - Check that file exists:
gsutil ls gs://crop-documents/manuals/F5-540-F5-540C.pdf - Check bucket access permissions
- Verify
GCS_BUCKET_NAMEis set correctly
Issue: Wrong page number
Solution:
- Function automatically converts page number to 1-indexed for URL
- Ensure page number is correct in metadata
Issue: PDF filename not found
Solution:
- Check document metadata
- Set
DEFAULT_PDF_FILENAMEin env variables - Ensure
pdf_filenamefield is present in metadata