Media Service
Image processing, storage, and SEO metadata
Media Service
The Media Service manages product images, 360-degree views, PDF documents, image metadata embedding, and third-party data enrichment. It provides analytics APIs for media coverage, a pipeline for SEO metadata embedding into images, and integrations with Amazon for product data enrichment.
Data Pipeline
Scrapers / Vendors / Amazon
↓
MongoDB Atlas (nh_unified / crop_stage.parts / crop_prod.parts)
↓
Transformers (normalize, generate SKU/slug, process media)
↓
Elasticsearch (parts_current alias)
↓
Search API → Frontend (SEO output)SEO Field Mapping
Data flows from MongoDB through Elasticsearch to the frontend, where it is rendered as SEO HTML elements and structured data.
| IndexedPart Field | SEO Output | HTML Element |
|---|---|---|
title | metaTitle | <title> |
description | metaDescription | <meta description> |
partNumber | schemaData.mpn | JSON-LD |
sku | schemaData.sku | JSON-LD |
manufacturer.name | schemaData.brand | JSON-LD |
media.primaryImage | ogImage | <meta og:image> |
media.images | schemaData.image | JSON-LD |
price.list.value | schemaData.offers.price | JSON-LD |
inventory.inStock | offers.availability | JSON-LD |
slug | canonicalUrl | <link canonical> |
categoryPath | BreadcrumbList | JSON-LD |
Field Origin Traces
title: Scraper → MongoDB.title → ES.title → API.title → SEO.metaTitle
partNumber: Scraper → MongoDB.partNumber → Transformer(normalize) → ES.partNumber + ES.pnNorm → API → SEO.schemaData.mpn
sku: Transformer(generate) → MongoDB.sku → ES.sku → API.sku → SEO.schemaData.sku
Format: CT-{manufacturer.code}-{partNumber}
primaryImage: GCS Upload → MongoDB.media.images[0].gcpUrl → ES.media.primaryImage → API → SEO.ogImage
price: Scraper/ERP → MongoDB.price.list → ES.price.list → API → SEO.schemaData.offers
categoryPath: Scraper → Transformer(normalize) → MongoDB.categoryPath → ES.categoryPath → API → SEO.BreadcrumbListAll field names use camelCase across every layer: MongoDB, Elasticsearch, API, and SEO output.
Media Coverage API
The Media Coverage API provides analytics on media richness across the parts catalog. It tracks gallery images, 360-degree views, and PDF documents.
Base URL (dev): http://localhost:3005/api/health/media
Production URL: https://health-analytics-service-[hash].run.app/api/health/media
GET /api/health/media/coverage
Returns a comprehensive media coverage summary.
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | string | prod | prod, dev, or stage |
Response Schema:
{
success: boolean;
data: {
summary: {
totalParts: number;
withAnyMedia: number;
withoutMedia: number;
coveragePercentage: number;
};
images: {
coverage: { count: number; percentage: number };
gallery: { count: number; percentage: number };
view360: {
count: number;
percentage: number;
withFrames: number;
avgFrameCount: number;
};
};
documents: {
coverage: { count: number; percentage: number };
byType: {
manuals: number;
datasheets: number;
certifications: number;
};
};
qualityCorrelation: {
withMediaAvgQuality: number;
withoutMediaAvgQuality: number;
delta: number;
};
};
meta: {
timestamp: string;
environment: string;
database: string;
collection: string;
};
}Response time: ~150-300ms | Cache TTL: 10 minutes
GET /api/health/media/distribution
Returns detailed distribution of image types and 360-degree view characteristics.
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | string | prod | prod, dev, or stage |
groupBy | string | type | type, quality, or frames |
Response includes:
imageTypes-- counts and percentages per type (marketing, front, back, left, right, angle1, angle2)view360Distribution-- breakdowns by frame count (24/36/48), quality (high/standard), and grid layout (4x6, 6x6, 8x6)
Response time: ~200-400ms | Cache TTL: 15 minutes
GET /api/health/media/gaps
Identifies high-quality parts missing media (enrichment opportunities).
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
minQuality | number | 70 | Minimum quality score (0-100) |
mediaType | string | all | all, gallery, view360, or documents |
limit | number | 50 | Max results (1-500) |
offset | number | 0 | Pagination offset |
sortBy | string | quality | quality, partNumber, or sku |
sortOrder | string | desc | asc or desc |
environment | string | prod | prod, dev, or stage |
Each gap item includes partNumber, sku, title, qualityScore, missingMedia flags, existingMedia summary, enrichmentPriority (high/medium/low), and estimatedImpact.
Response time: ~300-600ms | Cache TTL: 5 minutes
Error Codes
| Code | Description |
|---|---|
| 400 | Invalid parameters (environment, minQuality, limit, groupBy) |
| 404 | Collection not found |
| 500 | MongoDB query error |
| 503 | Service unavailable |
Rate Limiting (Production)
- Coverage and distribution endpoints: 100 req/min per IP
- Gaps endpoint: 60 req/min per IP
Media Data Model
Gallery Images
Static product photos in various angles, stored in GCS.
- Bucket:
gs://crop_parts/newholland/images/ - Path:
{partNumber}/{partNumber}_{TYPE}.jpg - Clear background variant:
{partNumber}/{partNumber}_CB_{TYPE}.jpg
Image Types and Priority Order:
| Type | Description | Priority |
|---|---|---|
| FRONT | Front-facing product photo | 1 (primary product identification) |
| BACK | Rear view | 2 (installation reference) |
| MARKETING | Lifestyle/promotional | 3 (visual appeal) |
| LEFT / RIGHT | Side views | 4 (detailed inspection) |
| ANGLE1 / ANGLE2 | Perspective views (R01_C24, R02_C24) | 5 (additional context) |
360-Degree Views
Interactive spin views with multiple frames.
| Frames | Quality | Grid | Description |
|---|---|---|---|
| 24 | Standard | 4x6 | Basic spin view |
| 36 | High | 6x6 | Smooth rotation |
| 48+ | Premium | 8x6 | Maximum detail |
Status levels: gcp (GCS-hosted, best), external (third-party), url_only (needs migration), not_available, none
Document structure:
{
"view360": {
"status": "gcp",
"frameCount": 24,
"rows": 4,
"columns": 6,
"frames": [
{ "url": "https://storage.googleapis.com/.../frame_001.jpg", "row": 0, "col": 0 }
]
}
}PDF Documents
Types: manuals (installation, service, parts, operator), datasheets (specifications, dimensions, compatibility), certifications (safety, quality, environmental).
Current state: Schema is ready. The documents field in nh_unified is currently empty; the API returns zero counts and will automatically reflect non-zero values once data is populated -- no code changes required.
MongoDB schema:
{
"documents": {
"manuals": [{
"type": "installation",
"title": "Installation Guide",
"url": "https://storage.googleapis.com/crop_docs/newholland/manuals/{pn}_install.pdf",
"language": "en",
"pageCount": 12,
"fileSize": 2457600,
"uploadedAt": "2025-11-17T10:00:00Z",
"metadata": { "version": "1.0", "author": "...", "tags": [] }
}],
"datasheets": [{ "type": "specifications", "..." : "..." }],
"certifications": [{
"type": "safety",
"issuer": "TUV Rheinland",
"validUntil": "2026-12-31T23:59:59Z",
"certificationNumber": "CE-2024-NHL-87840296",
"..." : "..."
}]
}
}PDF Migration Plan
Migration is designed with zero downtime and no breaking changes. Frontend code handles both zero and non-zero states.
Steps
- Source PDFs -- scan GCS bucket (
bun scripts/scan-gcs-documents.ts), scrape from vendor, or manual upload viagsutil - Extract metadata -- parse with
pdf-parseto get page count, author, title, keywords (bun scripts/parse-pdf-metadata.ts) - Enrich MongoDB -- bulk-update
nh_unifiedwith$pushtodocuments.*arrays (bun scripts/enrich-pdf-data.ts --dry-runthen without--dry-run) - Verify -- API automatically returns non-zero counts; check
/coverage,/distribution,/gaps
Rollback
# Restore from backup
mongorestore --uri="$MONGODB_URI" --nsInclude="crop.nh_unified" --drop backup-YYYYMMDD/crop/nh_unified.bsonRollback time: ~15 minutes.
Image Metadata Embedding (XMP)
Product metadata is embedded directly into images using the XMP standard so it is never lost when images are shared or indexed by search engines.
Storage Architecture
| Layer | Location | Purpose |
|---|---|---|
| JSON sidecar | GCS, next to image | Source of truth |
| XMP tags | Inside the image file | Inseparable from image |
| MongoDB | media.images[].embeddedMetadata | Fast queries |
| Elasticsearch | parts_current index | Full-text search |
GCS Folder Structure
gs://crop_parts/
├── ct/gallery/nhl/{partNumber}/
│ ├── {pn}-1.jpg # Image with XMP
│ └── {pn}-1.meta.json # JSON sidecar (source of truth)
├── ct/360/nhl/{partNumber}/
│ └── frame-001.jpg ...
├── vendor_scraped/
├── vendor_direct/
└── manual/Metadata Schema
Each image has a JSON sidecar with these sections:
- company -- name, type ("Authorized Reseller"), website, contact
- legal -- copyright, license, termsUrl
- product -- sku (
CT-{VENDOR}-{PARTNUMBER}), partNumber, pnNorm, title, manufacturer, categoryName/categoryPath, equipmentFitment, status - image -- type, sortOrder, source, originalUrl, alt, contentHash
- catalog -- product page URL, slug
- embedding -- embedded flag, timestamp, version, processor ID
XMP Tag Mapping
| JSON Field | XMP Tag |
|---|---|
company.name | dc:creator |
legal.copyright | dc:rights |
product.title | dc:title |
product.sku | crop:SKU |
product.partNumber | crop:PartNumber |
product.manufacturer.name | crop:Manufacturer |
product.manufacturer.code | crop:ManufacturerCode |
product.categoryPath[0] | crop:Category |
legal.license | xmpRights:UsageTerms |
catalog.url | crop:CatalogURL |
embedding.version | crop:Version |
TypeScript Types
export interface ImageMetadata {
schemaVersion: string;
createdAt: string;
updatedAt: string;
company: CompanyInfo;
legal: LegalInfo;
product: ProductInfo;
image: ImageInfo;
catalog: CatalogInfo;
embedding: EmbeddingInfo;
}
export interface ProductInfo {
sku: string; // "CT-NHL-00907566"
partNumber: string;
pnNorm?: string;
title: string;
description?: string;
manufacturer: { name: string; code: string };
categoryName?: string[];
categoryPath?: string[];
equipmentFitment?: string[];
status?: 'active' | 'discontinued' | 'superseded';
}A Zod validation schema (ImageMetadataSchema) enforces structure at runtime, including regex validation on the SKU format (/^CT-[A-Z]{2,3}-\w+$/).
Hybrid Processing System
Metadata embedding uses a hybrid local/cloud architecture:
Orchestrator (Cloud Run)
├── Task Queue → Router → Local Worker (primary, ~free)
│ └→ Cloud Worker (fallback, $0.12/GB)
└── Monitor & Alerts- Local worker pulls tasks, downloads images in bulk, embeds XMP with
exiftool-vendored, uploads results - Cloud worker (Cloud Run) auto-scales 0-1000, used as fallback when local is unhealthy or for urgent tasks
- Routing logic checks local worker health (heartbeat timeout 2min, CPU >95%, memory >90%, disk <10GB) and falls back to cloud automatically
Cost at scale (5M images): ~$50-100 hybrid vs ~$590 cloud-only.
Security Rules
Metadata includes: company name, SKU, part number, manufacturer, copyright, catalog URL. Metadata excludes: prices, cost data, inventory levels, internal IDs, customer data, API keys.
Amazon Data Enrichment
Amazon Product Advertising API is used to enrich parts with descriptions, dimensions, images, and specifications.
Architecture
MongoDB (source) → Enrichment Service → Amazon PA API
→ Oxylabs (optional)
→ Rainforest API (optional)
↓
ES Index (search)Matching Strategy
| Stage | Method | Confidence |
|---|---|---|
| 1 | UPC match | 99% |
| 2 | Part number exact search | 85% |
| 3 | Manufacturer + part number | 75% |
| 4 | Title fuzzy search | 70% |
Fields Enriched
Priority 1 (Critical):
| Field | API Path | Usage |
|---|---|---|
| Features / Bullets | ItemInfo.Features.DisplayValues | Enhanced descriptions, SEO keywords, selling points |
| Description | ItemInfo.ProductInfo.ItemDescription | Extended product page text |
| Dimensions | ItemInfo.ProductInfo.ItemDimensions | Shipping calculation, size filters |
| Weight | ItemInfo.ProductInfo.ItemDimensions.Weight | Shipping cost, product grouping |
| UPC / GTIN / EAN | ItemInfo.ProductInfo.UPCList | Cross-referencing, POS integration, deduplication |
| Images (high-res) | Images.Primary, Images.Variants | Up to 1500px product photos |
Priority 2 (High-value):
| Field | API Path | Usage |
|---|---|---|
| Categories | BrowseNodeInfo.BrowseNodes | Auto-categorization, SEO breadcrumbs |
| Technical specs | ItemInfo.TechnicalInfo | Spec sheets, comparison, filters |
| Brand info | ItemInfo.ByLineInfo.Brand | Manufacturer verification, brand pages |
Priority 3 (Nice to have): Customer reviews/ratings, Q&A count, warranty info, package dimensions, related products.
All fields come in a single API call -- no additional cost per field.
Cost Estimate
| Category | Cost |
|---|---|
| One-time enrichment (~56k parts) | ~$96 |
| Monthly maintenance | ~$25 |
| Re-enrichment (quarterly) | ~$38 |
| Total Year 1 | ~$237 |
Coverage Expectations (56k parts)
| Data Field | Estimated Coverage |
|---|---|
| Features/Bullets | 65% |
| Dimensions | 70% |
| Weight | 75% |
| UPC | 80% |
| High-res images | 65% |
| Categories | 70% |
| Specifications | 60% |
| Reviews | 50% |
Overall enrichment rate: ~70%.
Environment Variables
# Amazon Product Advertising API
AMAZON_PA_ACCESS_KEY=your_access_key
AMAZON_PA_SECRET_KEY=your_secret_key
AMAZON_PA_PARTNER_TAG=your_partner_tag
# Optional alternatives
OXYLABS_USERNAME=your_username
OXYLABS_PASSWORD=your_password
RAINFOREST_API_KEY=your_api_keyImplementation Phases
- Weeks 1-2: Foundation and API setup
- Week 3: Pilot enrichment (100 parts)
- Weeks 4-6: Bulk enrichment (56k parts)
- Week 7: Data merge and ES sync
- Week 8: Production deployment
MongoDB Indexes
Required indexes for optimal API performance:
db.nh_unified.createIndex({ 'media.images': 1 });
db.nh_unified.createIndex({ 'media.view360.status': 1 });
db.nh_unified.createIndex({ 'media.view360.frameCount': 1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1, 'media.imagesCount': 1 });The coverage endpoint uses a $facet aggregation (single-pass over all documents). The gaps endpoint relies on the compound quality + imagesCount index for filtering.
Caching
| Endpoint | TTL | Cache Key Pattern |
|---|---|---|
/coverage | 10 min | media:coverage:{env}:{ts_10min} |
/distribution | 15 min | media:distribution:{env}:{groupBy}:{ts_15min} |
/gaps | 5 min | media:gaps:{env}:{minQuality}:{mediaType}:{offset}:{ts_5min} |
Invalidation: automatic via TTL, manual after bulk sync (bun scripts/sync-mongodb-to-es.ts), or via webhook on GCS manifest updates. A stale-while-revalidate pattern serves cached data while refreshing in the background.
Troubleshooting
Coverage returns 0 parts -- verify MONGODB_COLLECTION is nh_unified and check connectivity with curl http://localhost:3005/health.
Gaps returns empty array -- lower minQuality threshold; check quality score distribution with a $bucket aggregation.
Image type counts do not match gallery count -- images.gallery.count is parts with any gallery image; imageTypes.front.count is parts with FRONT-type specifically. A part can have multiple images of the same type (e.g., standard + clear-background variants).
360-degree frame counts seem wrong -- compare media.view360.frameCount field against actual media.view360.frames array length; re-scan manifests if mismatched.
API response >1s -- check for missing MongoDB indexes (look for COLLSCAN in .explain("executionStats")).
Debug mode:
export LOG_LEVEL=debug
bun run dev