CROP
ProjectsParts ServicesMedia

PDF Data Migration Guide

This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.

PDF Data Migration Guide

Overview

This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.

Current Status: Phase 1 (Schema Ready, Zero Data)

Future Status: Phase 2 (Data Integration Complete)

Breaking Changes: NONE - Frontend code continues to work, counts simply populate from zero to actual values.


Current State (Phase 1)

API Behavior Today

The Media Coverage API returns PDF fields with count = 0:

curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'

Response:

{
  "coverage": {
    "count": 0,
    "percentage": 0
  },
  "byType": {
    "manuals": 0,
    "datasheets": 0,
    "certifications": 0
  }
}

MongoDB Schema (Ready)

The nh_unified collection schema includes PDF document fields:

{
  _id: ObjectId("..."),
  partNumber: "87840296",
  sku: "117-2295-001",
  title: "Hydraulic Filter Element",
  // ... other fields ...
  documents: {
    manuals: [
      {
        type: "installation" | "service" | "parts" | "operator",
        title: "Installation Guide - Hydraulic Filter",
        url: "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
        language: "en",
        pageCount: 12,
        fileSize: 2457600, // bytes
        uploadedAt: ISODate("2025-11-17T10:00:00Z"),
        metadata: {
          version: "1.0",
          author: "New Holland Engineering",
          tags: ["hydraulic", "filter", "installation"]
        }
      }
    ],
    datasheets: [
      {
        type: "specifications" | "dimensions" | "compatibility",
        title: "Technical Specifications",
        url: "https://storage.googleapis.com/crop_docs/newholland/datasheets/87840296_specs.pdf",
        language: "en",
        pageCount: 4,
        fileSize: 524288
      }
    ],
    certifications: [
      {
        type: "safety" | "quality" | "environmental",
        title: "CE Certification",
        url: "https://storage.googleapis.com/crop_docs/newholland/certs/87840296_ce.pdf",
        issuer: "TUV Rheinland",
        issueDate: ISODate("2024-06-15T00:00:00Z"),
        validUntil: ISODate("2026-12-31T23:59:59Z"),
        certificationNumber: "CE-2024-NHL-87840296"
      }
    ]
  }
}

Key Points:

  • Schema exists, but documents field is currently undefined/null for all parts
  • API gracefully handles missing field (returns 0 counts)
  • No code changes needed when data arrives - automatic detection

Migration Timeline

Phase 1: Schema Ready (Current - Completed ✅)

Status: Complete

Deliverables:

  • MongoDB schema includes documents object
  • API endpoints return PDF fields with zero counts
  • Frontend components handle zero state gracefully
  • OpenAPI spec documents PDF structures
  • TypeScript types define document interfaces

Verification:

# API returns PDF fields
curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents' && echo "✅ PDF fields present"

# MongoDB schema allows documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.findOne({}, {documents: 1})
' && echo "✅ Schema ready"

Phase 2: Data Integration (Future)

Trigger: When PDF files become available (scanned, uploaded to GCS)

Duration: ~2-4 weeks (data collection + processing)

Steps:

Step 1: Source PDF Files

Option A: Scan GCS Bucket

# Scan existing GCS bucket for PDF files
bun scripts/scan-gcs-documents.ts \
  --bucket=crop_docs \
  --prefix=newholland/manuals \
  --output=gcs-document-manifest.json

# Output: gcs-document-manifest.json
{
  "totalFiles": 1847,
  "byType": {
    "manuals": 987,
    "datasheets": 623,
    "certifications": 237
  },
  "files": [
    {
      "name": "newholland/manuals/87840296_install.pdf",
      "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
      "size": 2457600,
      "contentType": "application/pdf",
      "updated": "2025-11-15T08:30:00Z",
      "partNumber": "87840296",
      "type": "installation"
    }
  ]
}

Option B: Scrape from Source Website

# Scrape PDFs from manufacturer website
bun scripts/scrape-nh-documents.ts \
  --source=https://partstore.agriculture.newholland.com \
  --download=true \
  --upload-to-gcs=true

# Downloads PDFs to local cache, then uploads to GCS

Option C: Manual Upload

# Upload local PDF directory to GCS
gsutil -m cp -r ./local-pdfs/* gs://crop_docs/newholland/

# Generate manifest from local files
bun scripts/generate-manifest-from-local.ts \
  --directory=./local-pdfs \
  --output=local-document-manifest.json

Step 2: Extract PDF Metadata

# Parse PDF files to extract metadata
bun scripts/parse-pdf-metadata.ts \
  --manifest=gcs-document-manifest.json \
  --output=enriched-document-manifest.json

# Uses pdf-parse library to extract:
# - Page count
# - Author
# - Title
# - Keywords
# - Creation date

Enriched Manifest Example:

{
  "name": "newholland/manuals/87840296_install.pdf",
  "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
  "size": 2457600,
  "partNumber": "87840296",
  "type": "installation",
  "metadata": {
    "pageCount": 12,
    "author": "New Holland Engineering",
    "title": "Installation Guide - Hydraulic Filter",
    "keywords": ["hydraulic", "filter", "installation"],
    "createdAt": "2025-11-15T08:30:00Z"
  }
}

Step 3: Enrich nh_unified Collection

# Bulk update MongoDB with document data
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified \
  --dry-run

# Dry-run output:
# Would update 1,847 parts
# - 987 with manuals
# - 623 with datasheets
# - 237 with certifications
# 0 errors, 1,893 parts without matching PDFs

# Execute actual update
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified

# Progress: 1847/1847 parts updated ✅

Update Logic:

// Pseudocode for enrichment script
for (const file of manifest.files) {
  const partNumber = file.partNumber;
  const documentType = getDocumentType(file.type); // manuals/datasheets/certifications

  await db.collection('nh_unified').updateOne(
    { partNumber },
    {
      $push: {
        [`documents.${documentType}`]: {
          type: file.type,
          title: file.metadata.title,
          url: file.url,
          language: 'en',
          pageCount: file.metadata.pageCount,
          fileSize: file.size,
          uploadedAt: new Date(file.metadata.createdAt)
        }
      }
    },
    { upsert: false } // Don't create new parts
  );
}

Step 4: Verify API Integration

No Code Changes Required! The API automatically detects populated document fields.

# Check coverage endpoint (counts should be non-zero)
curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'

# Expected output:
{
  "coverage": {
    "count": 1847,        # Was 0, now 1847
    "percentage": 49.39   # Was 0, now 49.39%
  },
  "byType": {
    "manuals": 987,       # Was 0
    "datasheets": 623,    # Was 0
    "certifications": 237 # Was 0
  }
}

Verification Checklist:

  • /coverage endpoint returns non-zero PDF counts
  • /distribution endpoint includes PDF breakdowns
  • /gaps endpoint flags parts missing PDFs
  • MongoDB documents have documents.* fields populated
  • GCS bucket contains all uploaded PDFs
  • PDF URLs are publicly accessible (or signed URLs work)

Step 5: Frontend Verification

No Code Changes Needed! Frontend components already handle non-zero states.

Example React Component (already works):

function DocumentCoverage({ documents }: { documents: DocumentCoverage }) {
  return (
    <div>
      <h3>PDF Documents</h3>
      <p>Coverage: {documents.coverage.percentage.toFixed(1)}%</p>

      {documents.byType.manuals > 0 && (
        <Badge>
          {documents.byType.manuals} Manuals
        </Badge>
      )}

      {documents.byType.datasheets > 0 && (
        <Badge>
          {documents.byType.datasheets} Datasheets
        </Badge>
      )}

      {documents.byType.certifications > 0 && (
        <Badge>
          {documents.byType.certifications} Certifications
        </Badge>
      )}
    </div>
  );
}

Before Data (Phase 1): Badges don't render (counts are 0)

After Data (Phase 2): Badges automatically appear with actual counts


Scripts Reference

scripts/scan-gcs-documents.ts

Scan GCS bucket for PDF files and generate manifest.

Usage:

bun scripts/scan-gcs-documents.ts \
  --bucket=crop_docs \
  --prefix=newholland/ \
  --output=gcs-document-manifest.json \
  --filter=*.pdf

Options:

  • --bucket: GCS bucket name (default: crop_docs)
  • --prefix: Folder prefix to scan (default: newholland/)
  • --output: Output JSON file (default: gcs-document-manifest.json)
  • --filter: File pattern filter (default: *.pdf)

Implementation:

import { Storage } from '@google-cloud/storage';

export async function scanGCSDocuments(
  bucket: string,
  prefix: string
): Promise<DocumentManifest> {
  const storage = new Storage();
  const [files] = await storage.bucket(bucket).getFiles({ prefix });

  const manifest: DocumentManifest = {
    totalFiles: 0,
    byType: { manuals: 0, datasheets: 0, certifications: 0 },
    files: []
  };

  for (const file of files) {
    if (!file.name.endsWith('.pdf')) continue;

    const parsed = parseDocumentName(file.name);
    if (!parsed) continue;

    manifest.files.push({
      name: file.name,
      url: `https://storage.googleapis.com/${bucket}/${file.name}`,
      size: parseInt(file.metadata.size || '0'),
      contentType: file.metadata.contentType || 'application/pdf',
      updated: file.metadata.updated,
      partNumber: parsed.partNumber,
      type: parsed.type
    });

    manifest.byType[parsed.category]++;
    manifest.totalFiles++;
  }

  return manifest;
}

function parseDocumentName(name: string): ParsedDocument | null {
  // Parse: newholland/manuals/87840296_install.pdf
  const match = name.match(/\/([a-z]+)\/([^_]+)_([^.]+)\.pdf$/);
  if (!match) return null;

  const [, category, partNumber, type] = match;
  return {
    category: category as 'manuals' | 'datasheets' | 'certifications',
    partNumber,
    type
  };
}

scripts/parse-pdf-metadata.ts

Extract metadata from PDF files (page count, author, etc.).

Usage:

bun scripts/parse-pdf-metadata.ts \
  --manifest=gcs-document-manifest.json \
  --output=enriched-document-manifest.json \
  --concurrency=10

Options:

  • --manifest: Input manifest file
  • --output: Output enriched manifest file
  • --concurrency: Parallel PDF parsing (default: 10)

Implementation:

import PDFParser from 'pdf-parse';
import fetch from 'node-fetch';

export async function parsePDFMetadata(
  file: ManifestFile
): Promise<EnrichedFile> {
  // Download PDF from GCS
  const response = await fetch(file.url);
  const buffer = await response.arrayBuffer();

  // Parse PDF
  const pdf = await PDFParser(Buffer.from(buffer));

  return {
    ...file,
    metadata: {
      pageCount: pdf.numpages,
      author: pdf.info?.Author || 'Unknown',
      title: pdf.info?.Title || file.name,
      keywords: parseKeywords(pdf.info?.Keywords),
      createdAt: pdf.info?.CreationDate || new Date().toISOString()
    }
  };
}

function parseKeywords(keywordsStr?: string): string[] {
  if (!keywordsStr) return [];
  return keywordsStr.split(/[,;]/).map(k => k.trim()).filter(Boolean);
}

scripts/enrich-pdf-data.ts

Bulk update MongoDB with document data.

Usage:

# Dry-run first
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified \
  --dry-run

# Execute
bun scripts/enrich-pdf-data.ts \
  --manifest=enriched-document-manifest.json \
  --collection=nh_unified

Options:

  • --manifest: Enriched manifest JSON file
  • --collection: MongoDB collection name (default: nh_unified)
  • --dry-run: Preview changes without executing
  • --batch-size: MongoDB bulk write batch size (default: 100)

Implementation:

import { MongoClient } from 'mongodb';
import type { EnrichedManifest } from './types';

export async function enrichPDFData(
  manifest: EnrichedManifest,
  collection: string,
  dryRun: boolean
): Promise<void> {
  const client = new MongoClient(process.env.MONGODB_URI!);
  await client.connect();

  const db = client.db('crop');
  const coll = db.collection(collection);

  let updated = 0;
  let errors = 0;

  for (const file of manifest.files) {
    const documentType = getDocumentCategory(file.type);

    const updateDoc = {
      $push: {
        [`documents.${documentType}`]: {
          type: file.type,
          title: file.metadata.title,
          url: file.url,
          language: 'en',
          pageCount: file.metadata.pageCount,
          fileSize: file.size,
          uploadedAt: new Date(file.metadata.createdAt),
          metadata: {
            author: file.metadata.author,
            tags: file.metadata.keywords
          }
        }
      }
    };

    if (dryRun) {
      console.log(`[DRY RUN] Would update part ${file.partNumber}:`, updateDoc);
      updated++;
      continue;
    }

    try {
      const result = await coll.updateOne(
        { partNumber: file.partNumber },
        updateDoc
      );

      if (result.matchedCount > 0) {
        updated++;
      } else {
        console.warn(`Part not found: ${file.partNumber}`);
        errors++;
      }
    } catch (error) {
      console.error(`Error updating ${file.partNumber}:`, error);
      errors++;
    }
  }

  await client.close();

  console.log(`
    Migration ${dryRun ? 'Preview' : 'Complete'}:
    - Updated: ${updated}
    - Errors: ${errors}
    - Total: ${manifest.files.length}
  `);
}

Quality Assurance

Pre-Migration Checklist

Before running data migration in production:

  • Backup MongoDB: mongodump --uri="$MONGODB_URI" --out=backup-$(date +%Y%m%d)
  • Verify GCS Bucket: All PDFs accessible, correct permissions
  • Test Manifest: Run scripts in dev environment first
  • Dry-Run Migration: Review changes without executing
  • Monitor Disk Space: Ensure sufficient GCS quota
  • Alert Stakeholders: Notify frontend/content teams of migration

Post-Migration Validation

After migration completes:

1. Data Integrity

# Count parts with documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.countDocuments({
    "documents.manuals": { $exists: true, $ne: [] }
  })
'
# Expected: ~987 parts

# Verify URL accessibility
curl -I "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf"
# Expected: HTTP 200 OK

2. API Validation

# Coverage endpoint
curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents.byType' | \
  grep -q '"manuals": 987' && echo "✅ Manuals count correct"

# Gaps endpoint (parts missing PDFs)
curl 'http://localhost:3005/api/health/media/gaps?mediaType=documents&limit=10' | \
  jq '.data.totalGaps' | \
  grep -qE '^[0-9]+$' && echo "✅ Gaps API functional"

3. Sample Verification

# Fetch random sample of 10 parts with documents
mongo "$MONGODB_URI" --eval '
  db.nh_unified.aggregate([
    { $match: { "documents.manuals": { $exists: true } } },
    { $sample: { size: 10 } },
    { $project: { partNumber: 1, "documents.manuals.url": 1 } }
  ]).forEach(part => {
    print(`Part ${part.partNumber}: ${part.documents.manuals.length} manuals`);
  })
'

# Manually verify URLs are accessible

4. Performance Testing

# Coverage endpoint response time (should remain <500ms)
time curl -s http://localhost:3005/api/health/media/coverage > /dev/null

# Expected: real 0m0.250s

Rollback Plan

If migration causes issues, rollback procedure:

Step 1: Restore MongoDB Backup

# Stop writes to collection
# (set API to read-only mode or shut down)

# Restore from backup
mongorestore \
  --uri="$MONGODB_URI" \
  --nsInclude="crop.nh_unified" \
  --drop \
  backup-20251117/crop/nh_unified.bson

# Verify restoration
mongo "$MONGODB_URI" --eval '
  db.nh_unified.countDocuments({ "documents.manuals": { $exists: true } })
'
# Expected: 0 (pre-migration state)

Step 2: Verify API Returns Zero Counts

curl http://localhost:3005/api/health/media/coverage | \
  jq '.data.documents.byType' | \
  grep -q '"manuals": 0' && echo "✅ Rollback successful"

Step 3: Re-enable Writes

# Restart API service (removes read-only mode)
# Or re-enable write endpoints

Rollback Time: ~15 minutes (depends on collection size)


No Breaking Changes Guarantee

Frontend Compatibility

The API is designed to be backwards compatible with zero-data and non-zero-data states.

Before Migration (Phase 1):

{
  "documents": {
    "coverage": { "count": 0, "percentage": 0 },
    "byType": { "manuals": 0, "datasheets": 0, "certifications": 0 }
  }
}

After Migration (Phase 2):

{
  "documents": {
    "coverage": { "count": 1847, "percentage": 49.39 },
    "byType": { "manuals": 987, "datasheets": 623, "certifications": 237 }
  }
}

Frontend Code (Works in Both States):

// ✅ Handles zero state gracefully
function renderDocuments(documents: DocumentCoverage) {
  if (documents.coverage.count === 0) {
    return <EmptyState message="No documents available yet" />;
  }

  return (
    <div>
      <h3>{documents.coverage.count} Documents</h3>
      {documents.byType.manuals > 0 && <Badge>{documents.byType.manuals} Manuals</Badge>}
      {documents.byType.datasheets > 0 && <Badge>{documents.byType.datasheets} Datasheets</Badge>}
    </div>
  );
}

Key Points:

  • Zero checks (count === 0) prevent rendering empty UI
  • Conditional rendering (count > 0) shows badges only when data exists
  • No API version bump needed
  • No frontend deployments required

Monitoring & Alerts

Metrics to Track

After migration, monitor these metrics:

1. API Performance

  • /coverage response time (target: <300ms)
  • /gaps with mediaType=documents response time (target: <600ms)
  • Error rate (target: <0.1%)

2. Data Quality

  • Parts with documents: 1,847 (49.39%)
  • Average documents per part: 1.2
  • Orphaned documents (no matching part): 0

3. GCS Usage

  • Total PDF storage: ~4.2 GB
  • Bandwidth usage: Monitor for unexpected spikes
  • 404 errors: Should be near zero (all URLs valid)

Alerting Rules

Set up alerts for:

# Prometheus alert rules
groups:
  - name: media_coverage_pdf
    rules:
      - alert: PDFCoverageDropped
        expr: |
          media_coverage_documents_count < 1800
        for: 5m
        annotations:
          summary: 'PDF document count dropped below threshold'

      - alert: PDFEndpointSlow
        expr: |
          histogram_quantile(0.95, media_coverage_response_time_seconds_bucket{endpoint="/coverage"}) > 0.5
        for: 5m
        annotations:
          summary: '/coverage endpoint P95 latency >500ms'

      - alert: GCS404Rate
        expr: |
          rate(gcs_pdf_404_errors_total[5m]) > 0.01
        for: 5m
        annotations:
          summary: 'GCS PDF 404 error rate >1%'

FAQ

Q: Will the migration cause downtime?

A: No. Migration runs as background MongoDB updates. API remains available throughout.

Q: Can we migrate incrementally (e.g., manuals first, then datasheets)?

A: Yes! Run enrich-pdf-data.ts multiple times with different manifest files:

# Step 1: Migrate manuals only
bun scripts/enrich-pdf-data.ts --manifest=manuals-manifest.json

# Step 2: Later, migrate datasheets
bun scripts/enrich-pdf-data.ts --manifest=datasheets-manifest.json

Q: What if a part has multiple manuals (e.g., installation + service)?

A: MongoDB $push appends to array. A part can have unlimited documents:

{
  "documents": {
    "manuals": [
      { "type": "installation", "title": "Installation Guide", ... },
      { "type": "service", "title": "Service Manual", ... }
    ]
  }
}

Q: How do we update a PDF (e.g., new version released)?

A: Two options:

Option A: Replace Array Element

db.nh_unified.updateOne(
  {
    partNumber: '87840296',
    'documents.manuals.type': 'installation'
  },
  {
    $set: {
      'documents.manuals.$.url': 'https://...new-version.pdf',
      'documents.manuals.$.metadata.version': '2.0'
    }
  }
);

Option B: Add as New Document (keep history)

db.nh_unified.updateOne(
  { partNumber: '87840296' },
  {
    $push: {
      'documents.manuals': {
        type: 'installation',
        title: 'Installation Guide v2.0',
        url: 'https://...new-version.pdf',
        metadata: { version: '2.0', supersedes: 'v1.0' }
      }
    }
  }
);

Q: Can we prioritize high-quality parts for PDF enrichment?

A: Yes! Filter manifest by quality scores:

# Generate manifest for high-quality parts only
bun scripts/scan-gcs-documents.ts --min-quality=80 > high-quality-manifest.json

# Migrate high-quality parts first
bun scripts/enrich-pdf-data.ts --manifest=high-quality-manifest.json

Timeline Estimate

PhaseDurationOwnerStatus
1. Schema Design2 daysBackend✅ Complete
2. API Implementation3 daysBackend✅ Complete
3. Frontend Zero-State2 daysFrontend✅ Complete
4. PDF Sourcing1-2 weeksContent Team⏳ Pending
5. GCS Upload2-3 daysDevOps⏳ Pending
6. Metadata Parsing3-4 daysBackend⏳ Pending
7. MongoDB Enrichment1 dayBackend⏳ Pending
8. QA & Validation2-3 daysQA⏳ Pending

Total Estimated Time: 2-4 weeks (primarily waiting on PDF sourcing)


Contact

Questions or issues during migration?


Last Updated: 2025-11-17

Document Version: 1.0

On this page