PDF Data Migration Guide
This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.
PDF Data Migration Guide
Overview
This guide covers the integration of PDF documents (manuals, datasheets, certifications) into the Media Coverage API.
Current Status: Phase 1 (Schema Ready, Zero Data)
Future Status: Phase 2 (Data Integration Complete)
Breaking Changes: NONE - Frontend code continues to work, counts simply populate from zero to actual values.
Current State (Phase 1)
API Behavior Today
The Media Coverage API returns PDF fields with count = 0:
curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'Response:
{
"coverage": {
"count": 0,
"percentage": 0
},
"byType": {
"manuals": 0,
"datasheets": 0,
"certifications": 0
}
}MongoDB Schema (Ready)
The nh_unified collection schema includes PDF document fields:
{
_id: ObjectId("..."),
partNumber: "87840296",
sku: "117-2295-001",
title: "Hydraulic Filter Element",
// ... other fields ...
documents: {
manuals: [
{
type: "installation" | "service" | "parts" | "operator",
title: "Installation Guide - Hydraulic Filter",
url: "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
language: "en",
pageCount: 12,
fileSize: 2457600, // bytes
uploadedAt: ISODate("2025-11-17T10:00:00Z"),
metadata: {
version: "1.0",
author: "New Holland Engineering",
tags: ["hydraulic", "filter", "installation"]
}
}
],
datasheets: [
{
type: "specifications" | "dimensions" | "compatibility",
title: "Technical Specifications",
url: "https://storage.googleapis.com/crop_docs/newholland/datasheets/87840296_specs.pdf",
language: "en",
pageCount: 4,
fileSize: 524288
}
],
certifications: [
{
type: "safety" | "quality" | "environmental",
title: "CE Certification",
url: "https://storage.googleapis.com/crop_docs/newholland/certs/87840296_ce.pdf",
issuer: "TUV Rheinland",
issueDate: ISODate("2024-06-15T00:00:00Z"),
validUntil: ISODate("2026-12-31T23:59:59Z"),
certificationNumber: "CE-2024-NHL-87840296"
}
]
}
}Key Points:
- Schema exists, but
documentsfield is currently undefined/null for all parts - API gracefully handles missing field (returns 0 counts)
- No code changes needed when data arrives - automatic detection
Migration Timeline
Phase 1: Schema Ready (Current - Completed ✅)
Status: Complete
Deliverables:
- MongoDB schema includes
documentsobject - API endpoints return PDF fields with zero counts
- Frontend components handle zero state gracefully
- OpenAPI spec documents PDF structures
- TypeScript types define document interfaces
Verification:
# API returns PDF fields
curl http://localhost:3005/api/health/media/coverage | \
jq '.data.documents' && echo "✅ PDF fields present"
# MongoDB schema allows documents
mongo "$MONGODB_URI" --eval '
db.nh_unified.findOne({}, {documents: 1})
' && echo "✅ Schema ready"Phase 2: Data Integration (Future)
Trigger: When PDF files become available (scanned, uploaded to GCS)
Duration: ~2-4 weeks (data collection + processing)
Steps:
Step 1: Source PDF Files
Option A: Scan GCS Bucket
# Scan existing GCS bucket for PDF files
bun scripts/scan-gcs-documents.ts \
--bucket=crop_docs \
--prefix=newholland/manuals \
--output=gcs-document-manifest.json
# Output: gcs-document-manifest.json
{
"totalFiles": 1847,
"byType": {
"manuals": 987,
"datasheets": 623,
"certifications": 237
},
"files": [
{
"name": "newholland/manuals/87840296_install.pdf",
"url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
"size": 2457600,
"contentType": "application/pdf",
"updated": "2025-11-15T08:30:00Z",
"partNumber": "87840296",
"type": "installation"
}
]
}Option B: Scrape from Source Website
# Scrape PDFs from manufacturer website
bun scripts/scrape-nh-documents.ts \
--source=https://partstore.agriculture.newholland.com \
--download=true \
--upload-to-gcs=true
# Downloads PDFs to local cache, then uploads to GCSOption C: Manual Upload
# Upload local PDF directory to GCS
gsutil -m cp -r ./local-pdfs/* gs://crop_docs/newholland/
# Generate manifest from local files
bun scripts/generate-manifest-from-local.ts \
--directory=./local-pdfs \
--output=local-document-manifest.jsonStep 2: Extract PDF Metadata
# Parse PDF files to extract metadata
bun scripts/parse-pdf-metadata.ts \
--manifest=gcs-document-manifest.json \
--output=enriched-document-manifest.json
# Uses pdf-parse library to extract:
# - Page count
# - Author
# - Title
# - Keywords
# - Creation dateEnriched Manifest Example:
{
"name": "newholland/manuals/87840296_install.pdf",
"url": "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf",
"size": 2457600,
"partNumber": "87840296",
"type": "installation",
"metadata": {
"pageCount": 12,
"author": "New Holland Engineering",
"title": "Installation Guide - Hydraulic Filter",
"keywords": ["hydraulic", "filter", "installation"],
"createdAt": "2025-11-15T08:30:00Z"
}
}Step 3: Enrich nh_unified Collection
# Bulk update MongoDB with document data
bun scripts/enrich-pdf-data.ts \
--manifest=enriched-document-manifest.json \
--collection=nh_unified \
--dry-run
# Dry-run output:
# Would update 1,847 parts
# - 987 with manuals
# - 623 with datasheets
# - 237 with certifications
# 0 errors, 1,893 parts without matching PDFs
# Execute actual update
bun scripts/enrich-pdf-data.ts \
--manifest=enriched-document-manifest.json \
--collection=nh_unified
# Progress: 1847/1847 parts updated ✅Update Logic:
// Pseudocode for enrichment script
for (const file of manifest.files) {
const partNumber = file.partNumber;
const documentType = getDocumentType(file.type); // manuals/datasheets/certifications
await db.collection('nh_unified').updateOne(
{ partNumber },
{
$push: {
[`documents.${documentType}`]: {
type: file.type,
title: file.metadata.title,
url: file.url,
language: 'en',
pageCount: file.metadata.pageCount,
fileSize: file.size,
uploadedAt: new Date(file.metadata.createdAt)
}
}
},
{ upsert: false } // Don't create new parts
);
}Step 4: Verify API Integration
No Code Changes Required! The API automatically detects populated document fields.
# Check coverage endpoint (counts should be non-zero)
curl http://localhost:3005/api/health/media/coverage | jq '.data.documents'
# Expected output:
{
"coverage": {
"count": 1847, # Was 0, now 1847
"percentage": 49.39 # Was 0, now 49.39%
},
"byType": {
"manuals": 987, # Was 0
"datasheets": 623, # Was 0
"certifications": 237 # Was 0
}
}Verification Checklist:
-
/coverageendpoint returns non-zero PDF counts -
/distributionendpoint includes PDF breakdowns -
/gapsendpoint flags parts missing PDFs - MongoDB documents have
documents.*fields populated - GCS bucket contains all uploaded PDFs
- PDF URLs are publicly accessible (or signed URLs work)
Step 5: Frontend Verification
No Code Changes Needed! Frontend components already handle non-zero states.
Example React Component (already works):
function DocumentCoverage({ documents }: { documents: DocumentCoverage }) {
return (
<div>
<h3>PDF Documents</h3>
<p>Coverage: {documents.coverage.percentage.toFixed(1)}%</p>
{documents.byType.manuals > 0 && (
<Badge>
{documents.byType.manuals} Manuals
</Badge>
)}
{documents.byType.datasheets > 0 && (
<Badge>
{documents.byType.datasheets} Datasheets
</Badge>
)}
{documents.byType.certifications > 0 && (
<Badge>
{documents.byType.certifications} Certifications
</Badge>
)}
</div>
);
}Before Data (Phase 1): Badges don't render (counts are 0)
After Data (Phase 2): Badges automatically appear with actual counts
Scripts Reference
scripts/scan-gcs-documents.ts
Scan GCS bucket for PDF files and generate manifest.
Usage:
bun scripts/scan-gcs-documents.ts \
--bucket=crop_docs \
--prefix=newholland/ \
--output=gcs-document-manifest.json \
--filter=*.pdfOptions:
--bucket: GCS bucket name (default:crop_docs)--prefix: Folder prefix to scan (default:newholland/)--output: Output JSON file (default:gcs-document-manifest.json)--filter: File pattern filter (default:*.pdf)
Implementation:
import { Storage } from '@google-cloud/storage';
export async function scanGCSDocuments(
bucket: string,
prefix: string
): Promise<DocumentManifest> {
const storage = new Storage();
const [files] = await storage.bucket(bucket).getFiles({ prefix });
const manifest: DocumentManifest = {
totalFiles: 0,
byType: { manuals: 0, datasheets: 0, certifications: 0 },
files: []
};
for (const file of files) {
if (!file.name.endsWith('.pdf')) continue;
const parsed = parseDocumentName(file.name);
if (!parsed) continue;
manifest.files.push({
name: file.name,
url: `https://storage.googleapis.com/${bucket}/${file.name}`,
size: parseInt(file.metadata.size || '0'),
contentType: file.metadata.contentType || 'application/pdf',
updated: file.metadata.updated,
partNumber: parsed.partNumber,
type: parsed.type
});
manifest.byType[parsed.category]++;
manifest.totalFiles++;
}
return manifest;
}
function parseDocumentName(name: string): ParsedDocument | null {
// Parse: newholland/manuals/87840296_install.pdf
const match = name.match(/\/([a-z]+)\/([^_]+)_([^.]+)\.pdf$/);
if (!match) return null;
const [, category, partNumber, type] = match;
return {
category: category as 'manuals' | 'datasheets' | 'certifications',
partNumber,
type
};
}scripts/parse-pdf-metadata.ts
Extract metadata from PDF files (page count, author, etc.).
Usage:
bun scripts/parse-pdf-metadata.ts \
--manifest=gcs-document-manifest.json \
--output=enriched-document-manifest.json \
--concurrency=10Options:
--manifest: Input manifest file--output: Output enriched manifest file--concurrency: Parallel PDF parsing (default: 10)
Implementation:
import PDFParser from 'pdf-parse';
import fetch from 'node-fetch';
export async function parsePDFMetadata(
file: ManifestFile
): Promise<EnrichedFile> {
// Download PDF from GCS
const response = await fetch(file.url);
const buffer = await response.arrayBuffer();
// Parse PDF
const pdf = await PDFParser(Buffer.from(buffer));
return {
...file,
metadata: {
pageCount: pdf.numpages,
author: pdf.info?.Author || 'Unknown',
title: pdf.info?.Title || file.name,
keywords: parseKeywords(pdf.info?.Keywords),
createdAt: pdf.info?.CreationDate || new Date().toISOString()
}
};
}
function parseKeywords(keywordsStr?: string): string[] {
if (!keywordsStr) return [];
return keywordsStr.split(/[,;]/).map(k => k.trim()).filter(Boolean);
}scripts/enrich-pdf-data.ts
Bulk update MongoDB with document data.
Usage:
# Dry-run first
bun scripts/enrich-pdf-data.ts \
--manifest=enriched-document-manifest.json \
--collection=nh_unified \
--dry-run
# Execute
bun scripts/enrich-pdf-data.ts \
--manifest=enriched-document-manifest.json \
--collection=nh_unifiedOptions:
--manifest: Enriched manifest JSON file--collection: MongoDB collection name (default:nh_unified)--dry-run: Preview changes without executing--batch-size: MongoDB bulk write batch size (default: 100)
Implementation:
import { MongoClient } from 'mongodb';
import type { EnrichedManifest } from './types';
export async function enrichPDFData(
manifest: EnrichedManifest,
collection: string,
dryRun: boolean
): Promise<void> {
const client = new MongoClient(process.env.MONGODB_URI!);
await client.connect();
const db = client.db('crop');
const coll = db.collection(collection);
let updated = 0;
let errors = 0;
for (const file of manifest.files) {
const documentType = getDocumentCategory(file.type);
const updateDoc = {
$push: {
[`documents.${documentType}`]: {
type: file.type,
title: file.metadata.title,
url: file.url,
language: 'en',
pageCount: file.metadata.pageCount,
fileSize: file.size,
uploadedAt: new Date(file.metadata.createdAt),
metadata: {
author: file.metadata.author,
tags: file.metadata.keywords
}
}
}
};
if (dryRun) {
console.log(`[DRY RUN] Would update part ${file.partNumber}:`, updateDoc);
updated++;
continue;
}
try {
const result = await coll.updateOne(
{ partNumber: file.partNumber },
updateDoc
);
if (result.matchedCount > 0) {
updated++;
} else {
console.warn(`Part not found: ${file.partNumber}`);
errors++;
}
} catch (error) {
console.error(`Error updating ${file.partNumber}:`, error);
errors++;
}
}
await client.close();
console.log(`
Migration ${dryRun ? 'Preview' : 'Complete'}:
- Updated: ${updated}
- Errors: ${errors}
- Total: ${manifest.files.length}
`);
}Quality Assurance
Pre-Migration Checklist
Before running data migration in production:
- Backup MongoDB:
mongodump --uri="$MONGODB_URI" --out=backup-$(date +%Y%m%d) - Verify GCS Bucket: All PDFs accessible, correct permissions
- Test Manifest: Run scripts in dev environment first
- Dry-Run Migration: Review changes without executing
- Monitor Disk Space: Ensure sufficient GCS quota
- Alert Stakeholders: Notify frontend/content teams of migration
Post-Migration Validation
After migration completes:
1. Data Integrity
# Count parts with documents
mongo "$MONGODB_URI" --eval '
db.nh_unified.countDocuments({
"documents.manuals": { $exists: true, $ne: [] }
})
'
# Expected: ~987 parts
# Verify URL accessibility
curl -I "https://storage.googleapis.com/crop_docs/newholland/manuals/87840296_install.pdf"
# Expected: HTTP 200 OK2. API Validation
# Coverage endpoint
curl http://localhost:3005/api/health/media/coverage | \
jq '.data.documents.byType' | \
grep -q '"manuals": 987' && echo "✅ Manuals count correct"
# Gaps endpoint (parts missing PDFs)
curl 'http://localhost:3005/api/health/media/gaps?mediaType=documents&limit=10' | \
jq '.data.totalGaps' | \
grep -qE '^[0-9]+$' && echo "✅ Gaps API functional"3. Sample Verification
# Fetch random sample of 10 parts with documents
mongo "$MONGODB_URI" --eval '
db.nh_unified.aggregate([
{ $match: { "documents.manuals": { $exists: true } } },
{ $sample: { size: 10 } },
{ $project: { partNumber: 1, "documents.manuals.url": 1 } }
]).forEach(part => {
print(`Part ${part.partNumber}: ${part.documents.manuals.length} manuals`);
})
'
# Manually verify URLs are accessible4. Performance Testing
# Coverage endpoint response time (should remain <500ms)
time curl -s http://localhost:3005/api/health/media/coverage > /dev/null
# Expected: real 0m0.250sRollback Plan
If migration causes issues, rollback procedure:
Step 1: Restore MongoDB Backup
# Stop writes to collection
# (set API to read-only mode or shut down)
# Restore from backup
mongorestore \
--uri="$MONGODB_URI" \
--nsInclude="crop.nh_unified" \
--drop \
backup-20251117/crop/nh_unified.bson
# Verify restoration
mongo "$MONGODB_URI" --eval '
db.nh_unified.countDocuments({ "documents.manuals": { $exists: true } })
'
# Expected: 0 (pre-migration state)Step 2: Verify API Returns Zero Counts
curl http://localhost:3005/api/health/media/coverage | \
jq '.data.documents.byType' | \
grep -q '"manuals": 0' && echo "✅ Rollback successful"Step 3: Re-enable Writes
# Restart API service (removes read-only mode)
# Or re-enable write endpointsRollback Time: ~15 minutes (depends on collection size)
No Breaking Changes Guarantee
Frontend Compatibility
The API is designed to be backwards compatible with zero-data and non-zero-data states.
Before Migration (Phase 1):
{
"documents": {
"coverage": { "count": 0, "percentage": 0 },
"byType": { "manuals": 0, "datasheets": 0, "certifications": 0 }
}
}After Migration (Phase 2):
{
"documents": {
"coverage": { "count": 1847, "percentage": 49.39 },
"byType": { "manuals": 987, "datasheets": 623, "certifications": 237 }
}
}Frontend Code (Works in Both States):
// ✅ Handles zero state gracefully
function renderDocuments(documents: DocumentCoverage) {
if (documents.coverage.count === 0) {
return <EmptyState message="No documents available yet" />;
}
return (
<div>
<h3>{documents.coverage.count} Documents</h3>
{documents.byType.manuals > 0 && <Badge>{documents.byType.manuals} Manuals</Badge>}
{documents.byType.datasheets > 0 && <Badge>{documents.byType.datasheets} Datasheets</Badge>}
</div>
);
}Key Points:
- Zero checks (
count === 0) prevent rendering empty UI - Conditional rendering (
count > 0) shows badges only when data exists - No API version bump needed
- No frontend deployments required
Monitoring & Alerts
Metrics to Track
After migration, monitor these metrics:
1. API Performance
/coverageresponse time (target: <300ms)/gapswithmediaType=documentsresponse time (target: <600ms)- Error rate (target: <0.1%)
2. Data Quality
- Parts with documents: 1,847 (49.39%)
- Average documents per part: 1.2
- Orphaned documents (no matching part): 0
3. GCS Usage
- Total PDF storage: ~4.2 GB
- Bandwidth usage: Monitor for unexpected spikes
- 404 errors: Should be near zero (all URLs valid)
Alerting Rules
Set up alerts for:
# Prometheus alert rules
groups:
- name: media_coverage_pdf
rules:
- alert: PDFCoverageDropped
expr: |
media_coverage_documents_count < 1800
for: 5m
annotations:
summary: 'PDF document count dropped below threshold'
- alert: PDFEndpointSlow
expr: |
histogram_quantile(0.95, media_coverage_response_time_seconds_bucket{endpoint="/coverage"}) > 0.5
for: 5m
annotations:
summary: '/coverage endpoint P95 latency >500ms'
- alert: GCS404Rate
expr: |
rate(gcs_pdf_404_errors_total[5m]) > 0.01
for: 5m
annotations:
summary: 'GCS PDF 404 error rate >1%'FAQ
Q: Will the migration cause downtime?
A: No. Migration runs as background MongoDB updates. API remains available throughout.
Q: Can we migrate incrementally (e.g., manuals first, then datasheets)?
A: Yes! Run enrich-pdf-data.ts multiple times with different manifest files:
# Step 1: Migrate manuals only
bun scripts/enrich-pdf-data.ts --manifest=manuals-manifest.json
# Step 2: Later, migrate datasheets
bun scripts/enrich-pdf-data.ts --manifest=datasheets-manifest.jsonQ: What if a part has multiple manuals (e.g., installation + service)?
A: MongoDB $push appends to array. A part can have unlimited documents:
{
"documents": {
"manuals": [
{ "type": "installation", "title": "Installation Guide", ... },
{ "type": "service", "title": "Service Manual", ... }
]
}
}Q: How do we update a PDF (e.g., new version released)?
A: Two options:
Option A: Replace Array Element
db.nh_unified.updateOne(
{
partNumber: '87840296',
'documents.manuals.type': 'installation'
},
{
$set: {
'documents.manuals.$.url': 'https://...new-version.pdf',
'documents.manuals.$.metadata.version': '2.0'
}
}
);Option B: Add as New Document (keep history)
db.nh_unified.updateOne(
{ partNumber: '87840296' },
{
$push: {
'documents.manuals': {
type: 'installation',
title: 'Installation Guide v2.0',
url: 'https://...new-version.pdf',
metadata: { version: '2.0', supersedes: 'v1.0' }
}
}
}
);Q: Can we prioritize high-quality parts for PDF enrichment?
A: Yes! Filter manifest by quality scores:
# Generate manifest for high-quality parts only
bun scripts/scan-gcs-documents.ts --min-quality=80 > high-quality-manifest.json
# Migrate high-quality parts first
bun scripts/enrich-pdf-data.ts --manifest=high-quality-manifest.jsonTimeline Estimate
| Phase | Duration | Owner | Status |
|---|---|---|---|
| 1. Schema Design | 2 days | Backend | ✅ Complete |
| 2. API Implementation | 3 days | Backend | ✅ Complete |
| 3. Frontend Zero-State | 2 days | Frontend | ✅ Complete |
| 4. PDF Sourcing | 1-2 weeks | Content Team | ⏳ Pending |
| 5. GCS Upload | 2-3 days | DevOps | ⏳ Pending |
| 6. Metadata Parsing | 3-4 days | Backend | ⏳ Pending |
| 7. MongoDB Enrichment | 1 day | Backend | ⏳ Pending |
| 8. QA & Validation | 2-3 days | QA | ⏳ Pending |
Total Estimated Time: 2-4 weeks (primarily waiting on PDF sourcing)
Contact
Questions or issues during migration?
- Slack: #backend-health-analytics
- Email: backend-team@crop.example.com
- Oncall: PagerDuty escalation
Last Updated: 2025-11-17
Document Version: 1.0