Media Service

The Media Service manages product images, 360-degree views, PDF documents, image metadata embedding, and third-party data enrichment. It provides analytics APIs for media coverage, a pipeline for SEO metadata embedding into images, and integrations with Amazon for product data enrichment.

Data Pipeline

Scrapers / Vendors / Amazon
        ↓
   MongoDB Atlas (nh_unified / crop_stage.parts / crop_prod.parts)
        ↓
   Transformers (normalize, generate SKU/slug, process media)
        ↓
   Elasticsearch (parts_current alias)
        ↓
   Search API → Frontend (SEO output)

SEO Field Mapping

Data flows from MongoDB through Elasticsearch to the frontend, where it is rendered as SEO HTML elements and structured data.

IndexedPart Field	SEO Output	HTML Element
`title`	metaTitle	`<title>`
`description`	metaDescription	`<meta description>`
`partNumber`	schemaData.mpn	JSON-LD
`sku`	schemaData.sku	JSON-LD
`manufacturer.name`	schemaData.brand	JSON-LD
`media.primaryImage`	ogImage	`<meta og:image>`
`media.images`	schemaData.image	JSON-LD
`price.list.value`	schemaData.offers.price	JSON-LD
`inventory.inStock`	offers.availability	JSON-LD
`slug`	canonicalUrl	`<link canonical>`
`categoryPath`	BreadcrumbList	JSON-LD

Field Origin Traces

title:        Scraper → MongoDB.title → ES.title → API.title → SEO.metaTitle
partNumber:   Scraper → MongoDB.partNumber → Transformer(normalize) → ES.partNumber + ES.pnNorm → API → SEO.schemaData.mpn
sku:          Transformer(generate) → MongoDB.sku → ES.sku → API.sku → SEO.schemaData.sku
              Format: CT-{manufacturer.code}-{partNumber}
primaryImage: GCS Upload → MongoDB.media.images[0].gcpUrl → ES.media.primaryImage → API → SEO.ogImage
price:        Scraper/ERP → MongoDB.price.list → ES.price.list → API → SEO.schemaData.offers
categoryPath: Scraper → Transformer(normalize) → MongoDB.categoryPath → ES.categoryPath → API → SEO.BreadcrumbList

All field names use camelCase across every layer: MongoDB, Elasticsearch, API, and SEO output.

Media Coverage API

The Media Coverage API provides analytics on media richness across the parts catalog. It tracks gallery images, 360-degree views, and PDF documents.

Base URL (dev): http://localhost:3005/api/health/media Production URL: https://health-analytics-service-[hash].run.app/api/health/media

GET /api/health/media/coverage

Returns a comprehensive media coverage summary.

Query Parameters:

Parameter	Type	Default	Description
`environment`	string	`prod`	`prod`, `dev`, or `stage`

Response Schema:

{
  success: boolean;
  data: {
    summary: {
      totalParts: number;
      withAnyMedia: number;
      withoutMedia: number;
      coveragePercentage: number;
    };
    images: {
      coverage: { count: number; percentage: number };
      gallery: { count: number; percentage: number };
      view360: {
        count: number;
        percentage: number;
        withFrames: number;
        avgFrameCount: number;
      };
    };
    documents: {
      coverage: { count: number; percentage: number };
      byType: {
        manuals: number;
        datasheets: number;
        certifications: number;
      };
    };
    qualityCorrelation: {
      withMediaAvgQuality: number;
      withoutMediaAvgQuality: number;
      delta: number;
    };
  };
  meta: {
    timestamp: string;
    environment: string;
    database: string;
    collection: string;
  };
}

Response time: ~150-300ms | Cache TTL: 10 minutes

GET /api/health/media/distribution

Returns detailed distribution of image types and 360-degree view characteristics.

Query Parameters:

Parameter	Type	Default	Description
`environment`	string	`prod`	`prod`, `dev`, or `stage`
`groupBy`	string	`type`	`type`, `quality`, or `frames`

Response includes:

imageTypes -- counts and percentages per type (marketing, front, back, left, right, angle1, angle2)
view360Distribution -- breakdowns by frame count (24/36/48), quality (high/standard), and grid layout (4x6, 6x6, 8x6)

Response time: ~200-400ms | Cache TTL: 15 minutes

GET /api/health/media/gaps

Identifies high-quality parts missing media (enrichment opportunities).

Query Parameters:

Parameter	Type	Default	Description
`minQuality`	number	`70`	Minimum quality score (0-100)
`mediaType`	string	`all`	`all`, `gallery`, `view360`, or `documents`
`limit`	number	`50`	Max results (1-500)
`offset`	number	`0`	Pagination offset
`sortBy`	string	`quality`	`quality`, `partNumber`, or `sku`
`sortOrder`	string	`desc`	`asc` or `desc`
`environment`	string	`prod`	`prod`, `dev`, or `stage`

Each gap item includes partNumber, sku, title, qualityScore, missingMedia flags, existingMedia summary, enrichmentPriority (high/medium/low), and estimatedImpact.

Response time: ~300-600ms | Cache TTL: 5 minutes

Error Codes

Code	Description
400	Invalid parameters (environment, minQuality, limit, groupBy)
404	Collection not found
500	MongoDB query error
503	Service unavailable

Rate Limiting (Production)

Coverage and distribution endpoints: 100 req/min per IP
Gaps endpoint: 60 req/min per IP

Media Data Model

Gallery Images

Static product photos in various angles, stored in GCS.

Bucket: gs://crop_parts/newholland/images/
Path: {partNumber}/{partNumber}_{TYPE}.jpg
Clear background variant: {partNumber}/{partNumber}_CB_{TYPE}.jpg

Image Types and Priority Order:

Type	Description	Priority
FRONT	Front-facing product photo	1 (primary product identification)
BACK	Rear view	2 (installation reference)
MARKETING	Lifestyle/promotional	3 (visual appeal)
LEFT / RIGHT	Side views	4 (detailed inspection)
ANGLE1 / ANGLE2	Perspective views (R01_C24, R02_C24)	5 (additional context)

360-Degree Views

Interactive spin views with multiple frames.

Frames	Quality	Grid	Description
24	Standard	4x6	Basic spin view
36	High	6x6	Smooth rotation
48+	Premium	8x6	Maximum detail

Status levels: gcp (GCS-hosted, best), external (third-party), url_only (needs migration), not_available, none

Document structure:

{
  "view360": {
    "status": "gcp",
    "frameCount": 24,
    "rows": 4,
    "columns": 6,
    "frames": [
      { "url": "https://storage.googleapis.com/.../frame_001.jpg", "row": 0, "col": 0 }
    ]
  }
}

PDF Documents

Types: manuals (installation, service, parts, operator), datasheets (specifications, dimensions, compatibility), certifications (safety, quality, environmental).

Current state: Schema is ready. The documents field in nh_unified is currently empty; the API returns zero counts and will automatically reflect non-zero values once data is populated -- no code changes required.

MongoDB schema:

{
  "documents": {
    "manuals": [{
      "type": "installation",
      "title": "Installation Guide",
      "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/{pn}_install.pdf",
      "language": "en",
      "pageCount": 12,
      "fileSize": 2457600,
      "uploadedAt": "2025-11-17T10:00:00Z",
      "metadata": { "version": "1.0", "author": "...", "tags": [] }
    }],
    "datasheets": [{ "type": "specifications", "..." : "..." }],
    "certifications": [{
      "type": "safety",
      "issuer": "TUV Rheinland",
      "validUntil": "2026-12-31T23:59:59Z",
      "certificationNumber": "CE-2024-NHL-87840296",
      "..."  : "..."
    }]
  }
}

PDF Migration Plan

Migration is designed with zero downtime and no breaking changes. Frontend code handles both zero and non-zero states.

Steps

Source PDFs -- scan GCS bucket (bun scripts/scan-gcs-documents.ts), scrape from vendor, or manual upload via gsutil
Extract metadata -- parse with pdf-parse to get page count, author, title, keywords (bun scripts/parse-pdf-metadata.ts)
Enrich MongoDB -- bulk-update nh_unified with $push to documents.* arrays (bun scripts/enrich-pdf-data.ts --dry-run then without --dry-run)
Verify -- API automatically returns non-zero counts; check /coverage, /distribution, /gaps

Rollback

# Restore from backup
mongorestore --uri="$MONGODB_URI" --nsInclude="crop.nh_unified" --drop backup-YYYYMMDD/crop/nh_unified.bson

Rollback time: ~15 minutes.

Image Metadata Embedding (XMP)

Product metadata is embedded directly into images using the XMP standard so it is never lost when images are shared or indexed by search engines.

Storage Architecture

Layer	Location	Purpose
JSON sidecar	GCS, next to image	Source of truth
XMP tags	Inside the image file	Inseparable from image
MongoDB	`media.images[].embeddedMetadata`	Fast queries
Elasticsearch	`parts_current` index	Full-text search

GCS Folder Structure

gs://crop_parts/
├── ct/gallery/nhl/{partNumber}/
│   ├── {pn}-1.jpg              # Image with XMP
│   └── {pn}-1.meta.json        # JSON sidecar (source of truth)
├── ct/360/nhl/{partNumber}/
│   └── frame-001.jpg ...
├── vendor_scraped/
├── vendor_direct/
└── manual/

Metadata Schema

Each image has a JSON sidecar with these sections:

company -- name, type ("Authorized Reseller"), website, contact
legal -- copyright, license, termsUrl
product -- sku (CT-{VENDOR}-{PARTNUMBER}), partNumber, pnNorm, title, manufacturer, categoryName/categoryPath, equipmentFitment, status
image -- type, sortOrder, source, originalUrl, alt, contentHash
catalog -- product page URL, slug
embedding -- embedded flag, timestamp, version, processor ID

XMP Tag Mapping

JSON Field	XMP Tag
`company.name`	`dc:creator`
`legal.copyright`	`dc:rights`
`product.title`	`dc:title`
`product.sku`	`crop:SKU`
`product.partNumber`	`crop:PartNumber`
`product.manufacturer.name`	`crop:Manufacturer`
`product.manufacturer.code`	`crop:ManufacturerCode`
`product.categoryPath[0]`	`crop:Category`
`legal.license`	`xmpRights:UsageTerms`
`catalog.url`	`crop:CatalogURL`
`embedding.version`	`crop:Version`

TypeScript Types

export interface ImageMetadata {
  schemaVersion: string;
  createdAt: string;
  updatedAt: string;
  company: CompanyInfo;
  legal: LegalInfo;
  product: ProductInfo;
  image: ImageInfo;
  catalog: CatalogInfo;
  embedding: EmbeddingInfo;
}

export interface ProductInfo {
  sku: string;                     // "CT-NHL-00907566"
  partNumber: string;
  pnNorm?: string;
  title: string;
  description?: string;
  manufacturer: { name: string; code: string };
  categoryName?: string[];
  categoryPath?: string[];
  equipmentFitment?: string[];
  status?: 'active' | 'discontinued' | 'superseded';
}

A Zod validation schema (ImageMetadataSchema) enforces structure at runtime, including regex validation on the SKU format (/^CT-[A-Z]{2,3}-\w+$/).

Hybrid Processing System

Metadata embedding uses a hybrid local/cloud architecture:

Orchestrator (Cloud Run)
  ├── Task Queue → Router → Local Worker (primary, ~free)
  │                      └→ Cloud Worker (fallback, $0.12/GB)
  └── Monitor & Alerts

Local worker pulls tasks, downloads images in bulk, embeds XMP with exiftool-vendored, uploads results
Cloud worker (Cloud Run) auto-scales 0-1000, used as fallback when local is unhealthy or for urgent tasks
Routing logic checks local worker health (heartbeat timeout 2min, CPU >95%, memory >90%, disk <10GB) and falls back to cloud automatically

Cost at scale (5M images): ~$50-100 hybrid vs ~$590 cloud-only.

Security Rules

Metadata includes: company name, SKU, part number, manufacturer, copyright, catalog URL. Metadata excludes: prices, cost data, inventory levels, internal IDs, customer data, API keys.

Amazon Data Enrichment

Amazon Product Advertising API is used to enrich parts with descriptions, dimensions, images, and specifications.

Architecture

MongoDB (source) → Enrichment Service → Amazon PA API
                                      → Oxylabs (optional)
                                      → Rainforest API (optional)
                         ↓
                   ES Index (search)

Matching Strategy

Stage	Method	Confidence
1	UPC match	99%
2	Part number exact search	85%
3	Manufacturer + part number	75%
4	Title fuzzy search	70%

Fields Enriched

Priority 1 (Critical):

Field	API Path	Usage
Features / Bullets	`ItemInfo.Features.DisplayValues`	Enhanced descriptions, SEO keywords, selling points
Description	`ItemInfo.ProductInfo.ItemDescription`	Extended product page text
Dimensions	`ItemInfo.ProductInfo.ItemDimensions`	Shipping calculation, size filters
Weight	`ItemInfo.ProductInfo.ItemDimensions.Weight`	Shipping cost, product grouping
UPC / GTIN / EAN	`ItemInfo.ProductInfo.UPCList`	Cross-referencing, POS integration, deduplication
Images (high-res)	`Images.Primary`, `Images.Variants`	Up to 1500px product photos

Priority 2 (High-value):

Field	API Path	Usage
Categories	`BrowseNodeInfo.BrowseNodes`	Auto-categorization, SEO breadcrumbs
Technical specs	`ItemInfo.TechnicalInfo`	Spec sheets, comparison, filters
Brand info	`ItemInfo.ByLineInfo.Brand`	Manufacturer verification, brand pages

Priority 3 (Nice to have): Customer reviews/ratings, Q&A count, warranty info, package dimensions, related products.

All fields come in a single API call -- no additional cost per field.

Cost Estimate

Category	Cost
One-time enrichment (~56k parts)	~$96
Monthly maintenance	~$25
Re-enrichment (quarterly)	~$38
Total Year 1	~$237

Coverage Expectations (56k parts)

Data Field	Estimated Coverage
Features/Bullets	65%
Dimensions	70%
Weight	75%
UPC	80%
High-res images	65%
Categories	70%
Specifications	60%
Reviews	50%

Overall enrichment rate: ~70%.

Environment Variables

# Amazon Product Advertising API
AMAZON_PA_ACCESS_KEY=your_access_key
AMAZON_PA_SECRET_KEY=your_secret_key
AMAZON_PA_PARTNER_TAG=your_partner_tag

# Optional alternatives
OXYLABS_USERNAME=your_username
OXYLABS_PASSWORD=your_password
RAINFOREST_API_KEY=your_api_key

Implementation Phases

Weeks 1-2: Foundation and API setup
Week 3: Pilot enrichment (100 parts)
Weeks 4-6: Bulk enrichment (56k parts)
Week 7: Data merge and ES sync
Week 8: Production deployment

MongoDB Indexes

Required indexes for optimal API performance:

db.nh_unified.createIndex({ 'media.images': 1 });
db.nh_unified.createIndex({ 'media.view360.status': 1 });
db.nh_unified.createIndex({ 'media.view360.frameCount': 1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1, 'media.imagesCount': 1 });

The coverage endpoint uses a $facet aggregation (single-pass over all documents). The gaps endpoint relies on the compound quality + imagesCount index for filtering.

Caching

Endpoint	TTL	Cache Key Pattern
`/coverage`	10 min	`media:coverage:{env}:{ts_10min}`
`/distribution`	15 min	`media:distribution:{env}:{groupBy}:{ts_15min}`
`/gaps`	5 min	`media:gaps:{env}:{minQuality}:{mediaType}:{offset}:{ts_5min}`

Invalidation: automatic via TTL, manual after bulk sync (bun scripts/sync-mongodb-to-es.ts), or via webhook on GCS manifest updates. A stale-while-revalidate pattern serves cached data while refreshing in the background.

Troubleshooting

Coverage returns 0 parts -- verify MONGODB_COLLECTION is nh_unified and check connectivity with curl http://localhost:3005/health.

Gaps returns empty array -- lower minQuality threshold; check quality score distribution with a $bucket aggregation.

Image type counts do not match gallery count -- images.gallery.count is parts with any gallery image; imageTypes.front.count is parts with FRONT-type specifically. A part can have multiple images of the same type (e.g., standard + clear-background variants).

360-degree frame counts seem wrong -- compare media.view360.frameCount field against actual media.view360.frames array length; re-scan manifests if mismatched.

API response >1s -- check for missing MongoDB indexes (look for COLLSCAN in .explain("executionStats")).

Debug mode:

export LOG_LEVEL=debug
bun run dev

Media Service

On this page