This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.

PDF Data Migration Guide

Overview

This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.

Current Status: Phase 1 (Schema Ready, Zero Data)

Future Status: Phase 2 (Data Integration Complete)

Breaking Changes: NONE - Frontend code continues to work, counts simply populate from zero to actual values.

Current State (Phase 1)

API Behavior Today

The Media Coverage API returns PDF fields with count = 0:

curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'

Response:

{
  "coverage": {
    "count": 0,
    "percentage": 0
  },
  "byType": {
    "manuals": 0,
    "datasheets": 0,
    "certifications": 0
  }
}

MongoDB Schema (Ready)

The nh_unified collection schema includes PDF document fields:

{
  _id: ObjectId("..."),
  partNumber: "87840296",
  sku: "117-2295-001",
  title: "Hydraulic Filter Element",
  // ... other fields ...
  documents: {
    manuals: [
      {
        type: "installation" | "service" | "parts" | "operator",
        title: "Installation Guide - Hydraulic Filter",
        url: "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
        language: "en",
        pageCount: 12,
        fileSize: 2457600, // bytes
        uploadedAt: ISODate("2025-11-17T10:00:00Z"),
        metadata: {
          version: "1.0",
          author: "New Holland Engineering",
          tags: ["hydraulic", "filter", "installation"]
        }
      }
    ],
    datasheets: [
      {
        type: "specifications" | "dimensions" | "compatibility",
        title: "Technical Specifications",
        url: "https://storage.googleapis.com/crop_docs/newholland/datasheets/87840296_specs.pdf",
        language: "en",
        pageCount: 4,
        fileSize: 524288
      }
    ],
    certifications: [
      {
        type: "safety" | "quality" | "environmental",
        title: "CE Certification",
        url: "https://storage.googleapis.com/crop_docs/newholland/certs/87840296_ce.pdf",
        issuer: "TUV Rheinland",
        issueDate: ISODate("2024-06-15T00:00:00Z"),
        validUntil: ISODate("2026-12-31T23:59:59Z"),
        certificationNumber: "CE-2024-NHL-87840296"
      }
    ]
  }
}

Key Points:

Schema exists, but documents field is currently undefined/null for all parts
API gracefully handles missing field (returns 0 counts)
No code changes needed when data arrives - automatic detection

Migration Timeline

Phase 1: Schema Ready (Current - Completed ✅)

Status: Complete

Deliverables:

MongoDB schema includes documents object
API endpoints return PDF fields with zero counts
Frontend components handle zero state gracefully
OpenAPI spec documents PDF structures
TypeScript types define document interfaces

Verification:

# API returns PDF fields
curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents' && echo "✅ PDF fields present"

# MongoDB schema allows documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.findOne({}, {documents: 1})
' && echo "✅ Schema ready"

Phase 2: Data Integration (Future)

Trigger: When PDF files become available (scanned, uploaded to GCS)

Duration: ~2-4 weeks (data collection + processing)

Steps:

Step 1: Source PDF Files

Option A: Scan GCS Bucket

# Scan existing GCS bucket for PDF files
bun scripts/scan-gcs-documents.ts \
  --bucket=crop_docs \
  --prefix=newholland/manuals \
  --output=gcs-document-manifest.json

# Output: gcs-document-manifest.json
{
  "totalFiles": 1847,
  "byType": {
    "manuals": 987,
    "datasheets": 623,
    "certifications": 237
  },
  "files": [
    {
      "name": "newholland/manuals/87840296_install.pdf",
      "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
      "size": 2457600,
      "contentType": "application/pdf",
      "updated": "2025-11-15T08:30:00Z",
      "partNumber": "87840296",
      "type": "installation"
    }
  ]
}

Option B: Scrape from Source Website

# Scrape PDFs from manufacturer website
bun scripts/scrape-nh-documents.ts \
  --source=https://partstore.agriculture.newholland.com \
  --download=true \
  --upload-to-gcs=true

# Downloads PDFs to local cache, then uploads to GCS

Option C: Manual Upload

# Upload local PDF directory to GCS
gsutil -m cp -r ./local-pdfs/* gs://crop_docs/newholland/

# Generate manifest from local files
bun scripts/generate-manifest-from-local.ts \
  --directory=./local-pdfs \
  --output=local-document-manifest.json

Step 2: Extract PDF Metadata

# Parse PDF files to extract metadata
bun scripts/parse-pdf-metadata.ts \
  --manifest=gcs-document-manifest.json \
  --output=enriched-document-manifest.json

# Uses pdf-parse library to extract:
# - Page count
# - Author
# - Title
# - Keywords
# - Creation date

Enriched Manifest Example:

{
  "name": "newholland/manuals/87840296_install.pdf",
  "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
  "size": 2457600,
  "partNumber": "87840296",
  "type": "installation",
  "metadata": {
    "pageCount": 12,
    "author": "New Holland Engineering",
    "title": "Installation Guide - Hydraulic Filter",
    "keywords": ["hydraulic", "filter", "installation"],
    "createdAt": "2025-11-15T08:30:00Z"
  }
}

Step 3: Enrich `nh_unified` Collection

# Bulk update MongoDB with document data
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified \
  --dry-run

# Dry-run output:
# Would update 1,847 parts
# - 987 with manuals
# - 623 with datasheets
# - 237 with certifications
# 0 errors, 1,893 parts without matching PDFs

# Execute actual update
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified

# Progress: 1847/1847 parts updated ✅

Update Logic:

// Pseudocode for enrichment script
for (const file of manifest.files) {
  const partNumber = file.partNumber;
  const documentType = getDocumentType(file.type); // manuals/datasheets/certifications

  await db.collection('nh_unified').updateOne(
    { partNumber },
    {
      $push: {
        [`documents.${documentType}`]: {
          type: file.type,
          title: file.metadata.title,
          url: file.url,
          language: 'en',
          pageCount: file.metadata.pageCount,
          fileSize: file.size,
          uploadedAt: new Date(file.metadata.createdAt)
        }
      }
    },
    { upsert: false } // Don't create new parts
  );
}

Step 4: Verify API Integration

No Code Changes Required! The API automatically detects populated document fields.

# Check coverage endpoint (counts should be non-zero)
curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'

# Expected output:
{
  "coverage": {
    "count": 1847,        # Was 0, now 1847
    "percentage": 49.39   # Was 0, now 49.39%
  },
  "byType": {
    "manuals": 987,       # Was 0
    "datasheets": 623,    # Was 0
    "certifications": 237 # Was 0
  }
}

Verification Checklist:

/coverage endpoint returns non-zero PDF counts
/distribution endpoint includes PDF breakdowns
/gaps endpoint flags parts missing PDFs
MongoDB documents have documents.* fields populated
GCS bucket contains all uploaded PDFs
PDF URLs are publicly accessible (or signed URLs work)

Step 5: Frontend Verification

No Code Changes Needed! Frontend components already handle non-zero states.

Example React Component (already works):

function DocumentCoverage({ documents }: { documents: DocumentCoverage }) {
  return (
    <div>
      <h3>PDF Documents</h3>
      <p>Coverage: {documents.coverage.percentage.toFixed(1)}%</p>

      {documents.byType.manuals > 0 && (
        <Badge>
          {documents.byType.manuals} Manuals
        </Badge>
      )}

      {documents.byType.datasheets > 0 && (
        <Badge>
          {documents.byType.datasheets} Datasheets
        </Badge>
      )}

      {documents.byType.certifications > 0 && (
        <Badge>
          {documents.byType.certifications} Certifications
        </Badge>
      )}
    </div>
  );
}

Before Data (Phase 1): Badges don't render (counts are 0)

After Data (Phase 2): Badges automatically appear with actual counts

Scripts Reference

`scripts/scan-gcs-documents.ts`

Scan GCS bucket for PDF files and generate manifest.

Usage:

bun scripts/scan-gcs-documents.ts \
  --bucket=crop_docs \
  --prefix=newholland/ \
  --output=gcs-document-manifest.json \
  --filter=*.pdf

Options:

--bucket: GCS bucket name (default: crop_docs)
--prefix: Folder prefix to scan (default: newholland/)
--output: Output JSON file (default: gcs-document-manifest.json)
--filter: File pattern filter (default: *.pdf)

Implementation:

import { Storage } from '@google-cloud/storage';

export async function scanGCSDocuments(
  bucket: string,
  prefix: string
): Promise<DocumentManifest> {
  const storage = new Storage();
  const [files] = await storage.bucket(bucket).getFiles({ prefix });

  const manifest: DocumentManifest = {
    totalFiles: 0,
    byType: { manuals: 0, datasheets: 0, certifications: 0 },
    files: []
  };

  for (const file of files) {
    if (!file.name.endsWith('.pdf')) continue;

    const parsed = parseDocumentName(file.name);
    if (!parsed) continue;

    manifest.files.push({
      name: file.name,
      url: `https://storage.googleapis.com/${bucket}/${file.name}`,
      size: parseInt(file.metadata.size || '0'),
      contentType: file.metadata.contentType || 'application/pdf',
      updated: file.metadata.updated,
      partNumber: parsed.partNumber,
      type: parsed.type
    });

    manifest.byType[parsed.category]++;
    manifest.totalFiles++;
  }

  return manifest;
}

function parseDocumentName(name: string): ParsedDocument | null {
  // Parse: newholland/manuals/87840296_install.pdf
  const match = name.match(/\/([a-z]+)\/([^_]+)_([^.]+)\.pdf$/);
  if (!match) return null;

  const [, category, partNumber, type] = match;
  return {
    category: category as 'manuals' | 'datasheets' | 'certifications',
    partNumber,
    type
  };
}

`scripts/parse-pdf-metadata.ts`

Extract metadata from PDF files (page count, author, etc.).

Usage:

bun scripts/parse-pdf-metadata.ts \
  --manifest=gcs-document-manifest.json \
  --output=enriched-document-manifest.json \
  --concurrency=10

Options:

--manifest: Input manifest file
--output: Output enriched manifest file
--concurrency: Parallel PDF parsing (default: 10)

Implementation:

import PDFParser from 'pdf-parse';
import fetch from 'node-fetch';

export async function parsePDFMetadata(
  file: ManifestFile
): Promise<EnrichedFile> {
  // Download PDF from GCS
  const response = await fetch(file.url);
  const buffer = await response.arrayBuffer();

  // Parse PDF
  const pdf = await PDFParser(Buffer.from(buffer));

  return {
    ...file,
    metadata: {
      pageCount: pdf.numpages,
      author: pdf.info?.Author || 'Unknown',
      title: pdf.info?.Title || file.name,
      keywords: parseKeywords(pdf.info?.Keywords),
      createdAt: pdf.info?.CreationDate || new Date().toISOString()
    }
  };
}

function parseKeywords(keywordsStr?: string): string[] {
  if (!keywordsStr) return [];
  return keywordsStr.split(/[,;]/).map(k => k.trim()).filter(Boolean);
}

`scripts/enrich-pdf-data.ts`

Bulk update MongoDB with document data.

Usage:

# Dry-run first
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified \
  --dry-run

# Execute
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified

Options:

--manifest: Enriched manifest JSON file
--collection: MongoDB collection name (default: nh_unified)
--dry-run: Preview changes without executing
--batch-size: MongoDB bulk write batch size (default: 100)

Implementation:

import { MongoClient } from 'mongodb';
import type { EnrichedManifest } from './types';

export async function enrichPDFData(
  manifest: EnrichedManifest,
  collection: string,
  dryRun: boolean
): Promise<void> {
  const client = new MongoClient(process.env.MONGODB_URI!);
  await client.connect();

  const db = client.db('crop');
  const coll = db.collection(collection);

  let updated = 0;
  let errors = 0;

  for (const file of manifest.files) {
    const documentType = getDocumentCategory(file.type);

    const updateDoc = {
      $push: {
        [`documents.${documentType}`]: {
          type: file.type,
          title: file.metadata.title,
          url: file.url,
          language: 'en',
          pageCount: file.metadata.pageCount,
          fileSize: file.size,
          uploadedAt: new Date(file.metadata.createdAt),
          metadata: {
            author: file.metadata.author,
            tags: file.metadata.keywords
          }
        }
      }
    };

    if (dryRun) {
      console.log(`[DRY RUN] Would update part ${file.partNumber}:`, updateDoc);
      updated++;
      continue;
    }

    try {
      const result = await coll.updateOne(
        { partNumber: file.partNumber },
        updateDoc
      );

      if (result.matchedCount > 0) {
        updated++;
      } else {
        console.warn(`Part not found: ${file.partNumber}`);
        errors++;
      }
    } catch (error) {
      console.error(`Error updating ${file.partNumber}:`, error);
      errors++;
    }
  }

  await client.close();

  console.log(`
    Migration ${dryRun ? 'Preview' : 'Complete'}:
    - Updated: ${updated}
    - Errors: ${errors}
    - Total: ${manifest.files.length}
  `);
}

Quality Assurance

Pre-Migration Checklist

Before running data migration in production:

Backup MongoDB: mongodump --uri="$MONGODB_URI" --out=backup-$(date +%Y%m%d)
Verify GCS Bucket: All PDFs accessible, correct permissions
Test Manifest: Run scripts in dev environment first
Dry-Run Migration: Review changes without executing
Monitor Disk Space: Ensure sufficient GCS quota
Alert Stakeholders: Notify frontend/content teams of migration

Post-Migration Validation

After migration completes:

1. Data Integrity

# Count parts with documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.countDocuments({
    "documents.manuals": { $exists: true, $ne: [] }
  })
'
# Expected: ~987 parts

# Verify URL accessibility
curl -I "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf"
# Expected: HTTP 200 OK

2. API Validation

# Coverage endpoint
curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents.byType' | \
  grep -q '"manuals": 987' && echo "✅ Manuals count correct"

# Gaps endpoint (parts missing PDFs)
curl 'http://localhost:3005/api/health/media/gaps?mediaType=documents&limit=10' | \
  jq '.data.totalGaps' | \
  grep -qE '^[0-9]+$' && echo "✅ Gaps API functional"

3. Sample Verification

# Fetch random sample of 10 parts with documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.aggregate([
    { $match: { "documents.manuals": { $exists: true } } },
    { $sample: { size: 10 } },
    { $project: { partNumber: 1, "documents.manuals.url": 1 } }
  ]).forEach(part => {
    print(`Part ${part.partNumber}: ${part.documents.manuals.length} manuals`);
  })
'

# Manually verify URLs are accessible

4. Performance Testing

# Coverage endpoint response time (should remain <500ms)
time curl -s http://localhost:3005/api/health/media/coverage > /dev/null

# Expected: real 0m0.250s

Rollback Plan

If migration causes issues, rollback procedure:

Step 1: Restore MongoDB Backup

# Stop writes to collection
# (set API to read-only mode or shut down)

# Restore from backup
mongorestore \
  --uri="$MONGODB_URI" \
  --nsInclude="crop.nh_unified" \
  --drop \
  backup-20251117/crop/nh_unified.bson

# Verify restoration
mongo "$MONGODB_URI" --eval '
  db.nh_unified.countDocuments({ "documents.manuals": { $exists: true } })
'
# Expected: 0 (pre-migration state)

Step 2: Verify API Returns Zero Counts

curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents.byType' | \
  grep -q '"manuals": 0' && echo "✅ Rollback successful"

Step 3: Re-enable Writes

# Restart API service (removes read-only mode)
# Or re-enable write endpoints

Rollback Time: ~15 minutes (depends on collection size)

No Breaking Changes Guarantee

Frontend Compatibility

The API is designed to be backwards compatible with zero-data and non-zero-data states.

Before Migration (Phase 1):

{
  "documents": {
    "coverage": { "count": 0, "percentage": 0 },
    "byType": { "manuals": 0, "datasheets": 0, "certifications": 0 }
  }
}

After Migration (Phase 2):

{
  "documents": {
    "coverage": { "count": 1847, "percentage": 49.39 },
    "byType": { "manuals": 987, "datasheets": 623, "certifications": 237 }
  }
}

Frontend Code (Works in Both States):

// ✅ Handles zero state gracefully
function renderDocuments(documents: DocumentCoverage) {
  if (documents.coverage.count === 0) {
    return <EmptyState message="No documents available yet" />;
  }

  return (
    <div>
      <h3>{documents.coverage.count} Documents</h3>
      {documents.byType.manuals > 0 && <Badge>{documents.byType.manuals} Manuals</Badge>}
      {documents.byType.datasheets > 0 && <Badge>{documents.byType.datasheets} Datasheets</Badge>}
    </div>
  );
}

Key Points:

Zero checks (count === 0) prevent rendering empty UI
Conditional rendering (count > 0) shows badges only when data exists
No API version bump needed
No frontend deployments required

Monitoring & Alerts

Metrics to Track

After migration, monitor these metrics:

1. API Performance

/coverage response time (target: <300ms)
/gaps with mediaType=documents response time (target: <600ms)
Error rate (target: <0.1%)

2. Data Quality

Parts with documents: 1,847 (49.39%)
Average documents per part: 1.2
Orphaned documents (no matching part): 0

3. GCS Usage

Total PDF storage: ~4.2 GB
Bandwidth usage: Monitor for unexpected spikes
404 errors: Should be near zero (all URLs valid)

Alerting Rules

Set up alerts for:

# Prometheus alert rules
groups:
  - name: media_coverage_pdf
    rules:
      - alert: PDFCoverageDropped
        expr: |
          media_coverage_documents_count < 1800
        for: 5m
        annotations:
          summary: 'PDF document count dropped below threshold'

      - alert: PDFEndpointSlow
        expr: |
          histogram_quantile(0.95, media_coverage_response_time_seconds_bucket{endpoint="/coverage"}) > 0.5
        for: 5m
        annotations:
          summary: '/coverage endpoint P95 latency >500ms'

      - alert: GCS404Rate
        expr: |
          rate(gcs_pdf_404_errors_total[5m]) > 0.01
        for: 5m
        annotations:
          summary: 'GCS PDF 404 error rate >1%'

FAQ

Q: Will the migration cause downtime?

A: No. Migration runs as background MongoDB updates. API remains available throughout.

Q: Can we migrate incrementally (e.g., manuals first, then datasheets)?

A: Yes! Run enrich-pdf-data.ts multiple times with different manifest files:

# Step 1: Migrate manuals only
bun scripts/enrich-pdf-data.ts --manifest=manuals-manifest.json

# Step 2: Later, migrate datasheets
bun scripts/enrich-pdf-data.ts --manifest=datasheets-manifest.json

Q: What if a part has multiple manuals (e.g., installation + service)?

A: MongoDB $push appends to array. A part can have unlimited documents:

{
  "documents": {
    "manuals": [
      { "type": "installation", "title": "Installation Guide", ... },
      { "type": "service", "title": "Service Manual", ... }
    ]
  }
}

Q: How do we update a PDF (e.g., new version released)?

A: Two options:

Option A: Replace Array Element

db.nh_unified.updateOne(
  {
    partNumber: '87840296',
    'documents.manuals.type': 'installation'
  },
  {
    $set: {
      'documents.manuals.$.url': 'https://...new-version.pdf',
      'documents.manuals.$.metadata.version': '2.0'
    }
  }
);

Option B: Add as New Document (keep history)

db.nh_unified.updateOne(
  { partNumber: '87840296' },
  {
    $push: {
      'documents.manuals': {
        type: 'installation',
        title: 'Installation Guide v2.0',
        url: 'https://...new-version.pdf',
        metadata: { version: '2.0', supersedes: 'v1.0' }
      }
    }
  }
);

Q: Can we prioritize high-quality parts for PDF enrichment?

A: Yes! Filter manifest by quality scores:

# Generate manifest for high-quality parts only
bun scripts/scan-gcs-documents.ts --min-quality=80 > high-quality-manifest.json

# Migrate high-quality parts first
bun scripts/enrich-pdf-data.ts --manifest=high-quality-manifest.json

Timeline Estimate

Phase	Duration	Owner	Status
1. Schema Design	2 days	Backend	✅ Complete
2. API Implementation	3 days	Backend	✅ Complete
3. Frontend Zero-State	2 days	Frontend	✅ Complete
4. PDF Sourcing	1-2 weeks	Content Team	⏳ Pending
5. GCS Upload	2-3 days	DevOps	⏳ Pending
6. Metadata Parsing	3-4 days	Backend	⏳ Pending
7. MongoDB Enrichment	1 day	Backend	⏳ Pending
8. QA & Validation	2-3 days	QA	⏳ Pending

Total Estimated Time: 2-4 weeks (primarily waiting on PDF sourcing)

Contact

Questions or issues during migration?

Slack: #backend-health-analytics
Email: backend-team@crop.example.com
Oncall: PagerDuty escalation

Last Updated: 2025-11-17

Document Version: 1.0

PDF Data Migration Guide

On this page