CROP

Media Service

Image processing, storage, and SEO metadata

Media Service

The Media Service manages product images, 360-degree views, PDF documents, image metadata embedding, and third-party data enrichment. It provides analytics APIs for media coverage, a pipeline for SEO metadata embedding into images, and integrations with Amazon for product data enrichment.

Data Pipeline

Scrapers / Vendors / Amazon

   MongoDB Atlas (nh_unified / crop_stage.parts / crop_prod.parts)

   Transformers (normalize, generate SKU/slug, process media)

   Elasticsearch (parts_current alias)

   Search API → Frontend (SEO output)

SEO Field Mapping

Data flows from MongoDB through Elasticsearch to the frontend, where it is rendered as SEO HTML elements and structured data.

IndexedPart FieldSEO OutputHTML Element
titlemetaTitle<title>
descriptionmetaDescription<meta description>
partNumberschemaData.mpnJSON-LD
skuschemaData.skuJSON-LD
manufacturer.nameschemaData.brandJSON-LD
media.primaryImageogImage<meta og:image>
media.imagesschemaData.imageJSON-LD
price.list.valueschemaData.offers.priceJSON-LD
inventory.inStockoffers.availabilityJSON-LD
slugcanonicalUrl<link canonical>
categoryPathBreadcrumbListJSON-LD

Field Origin Traces

title:        Scraper → MongoDB.title → ES.title → API.title → SEO.metaTitle
partNumber:   Scraper → MongoDB.partNumber → Transformer(normalize) → ES.partNumber + ES.pnNorm → API → SEO.schemaData.mpn
sku:          Transformer(generate) → MongoDB.sku → ES.sku → API.sku → SEO.schemaData.sku
              Format: CT-{manufacturer.code}-{partNumber}
primaryImage: GCS Upload → MongoDB.media.images[0].gcpUrl → ES.media.primaryImage → API → SEO.ogImage
price:        Scraper/ERP → MongoDB.price.list → ES.price.list → API → SEO.schemaData.offers
categoryPath: Scraper → Transformer(normalize) → MongoDB.categoryPath → ES.categoryPath → API → SEO.BreadcrumbList

All field names use camelCase across every layer: MongoDB, Elasticsearch, API, and SEO output.


Media Coverage API

The Media Coverage API provides analytics on media richness across the parts catalog. It tracks gallery images, 360-degree views, and PDF documents.

Base URL (dev): http://localhost:3005/api/health/media Production URL: https://health-analytics-service-[hash].run.app/api/health/media

GET /api/health/media/coverage

Returns a comprehensive media coverage summary.

Query Parameters:

ParameterTypeDefaultDescription
environmentstringprodprod, dev, or stage

Response Schema:

{
  success: boolean;
  data: {
    summary: {
      totalParts: number;
      withAnyMedia: number;
      withoutMedia: number;
      coveragePercentage: number;
    };
    images: {
      coverage: { count: number; percentage: number };
      gallery: { count: number; percentage: number };
      view360: {
        count: number;
        percentage: number;
        withFrames: number;
        avgFrameCount: number;
      };
    };
    documents: {
      coverage: { count: number; percentage: number };
      byType: {
        manuals: number;
        datasheets: number;
        certifications: number;
      };
    };
    qualityCorrelation: {
      withMediaAvgQuality: number;
      withoutMediaAvgQuality: number;
      delta: number;
    };
  };
  meta: {
    timestamp: string;
    environment: string;
    database: string;
    collection: string;
  };
}

Response time: ~150-300ms | Cache TTL: 10 minutes

GET /api/health/media/distribution

Returns detailed distribution of image types and 360-degree view characteristics.

Query Parameters:

ParameterTypeDefaultDescription
environmentstringprodprod, dev, or stage
groupBystringtypetype, quality, or frames

Response includes:

  • imageTypes -- counts and percentages per type (marketing, front, back, left, right, angle1, angle2)
  • view360Distribution -- breakdowns by frame count (24/36/48), quality (high/standard), and grid layout (4x6, 6x6, 8x6)

Response time: ~200-400ms | Cache TTL: 15 minutes

GET /api/health/media/gaps

Identifies high-quality parts missing media (enrichment opportunities).

Query Parameters:

ParameterTypeDefaultDescription
minQualitynumber70Minimum quality score (0-100)
mediaTypestringallall, gallery, view360, or documents
limitnumber50Max results (1-500)
offsetnumber0Pagination offset
sortBystringqualityquality, partNumber, or sku
sortOrderstringdescasc or desc
environmentstringprodprod, dev, or stage

Each gap item includes partNumber, sku, title, qualityScore, missingMedia flags, existingMedia summary, enrichmentPriority (high/medium/low), and estimatedImpact.

Response time: ~300-600ms | Cache TTL: 5 minutes

Error Codes

CodeDescription
400Invalid parameters (environment, minQuality, limit, groupBy)
404Collection not found
500MongoDB query error
503Service unavailable

Rate Limiting (Production)

  • Coverage and distribution endpoints: 100 req/min per IP
  • Gaps endpoint: 60 req/min per IP

Media Data Model

Static product photos in various angles, stored in GCS.

  • Bucket: gs://crop_parts/newholland/images/
  • Path: {partNumber}/{partNumber}_{TYPE}.jpg
  • Clear background variant: {partNumber}/{partNumber}_CB_{TYPE}.jpg

Image Types and Priority Order:

TypeDescriptionPriority
FRONTFront-facing product photo1 (primary product identification)
BACKRear view2 (installation reference)
MARKETINGLifestyle/promotional3 (visual appeal)
LEFT / RIGHTSide views4 (detailed inspection)
ANGLE1 / ANGLE2Perspective views (R01_C24, R02_C24)5 (additional context)

360-Degree Views

Interactive spin views with multiple frames.

FramesQualityGridDescription
24Standard4x6Basic spin view
36High6x6Smooth rotation
48+Premium8x6Maximum detail

Status levels: gcp (GCS-hosted, best), external (third-party), url_only (needs migration), not_available, none

Document structure:

{
  "view360": {
    "status": "gcp",
    "frameCount": 24,
    "rows": 4,
    "columns": 6,
    "frames": [
      { "url": "https://storage.googleapis.com/.../frame_001.jpg", "row": 0, "col": 0 }
    ]
  }
}

PDF Documents

Types: manuals (installation, service, parts, operator), datasheets (specifications, dimensions, compatibility), certifications (safety, quality, environmental).

Current state: Schema is ready. The documents field in nh_unified is currently empty; the API returns zero counts and will automatically reflect non-zero values once data is populated -- no code changes required.

MongoDB schema:

{
  "documents": {
    "manuals": [{
      "type": "installation",
      "title": "Installation Guide",
      "url": "https://storage.googleapis.com/crop_docs/newholland/manuals/{pn}_install.pdf",
      "language": "en",
      "pageCount": 12,
      "fileSize": 2457600,
      "uploadedAt": "2025-11-17T10:00:00Z",
      "metadata": { "version": "1.0", "author": "...", "tags": [] }
    }],
    "datasheets": [{ "type": "specifications", "..." : "..." }],
    "certifications": [{
      "type": "safety",
      "issuer": "TUV Rheinland",
      "validUntil": "2026-12-31T23:59:59Z",
      "certificationNumber": "CE-2024-NHL-87840296",
      "..."  : "..."
    }]
  }
}

PDF Migration Plan

Migration is designed with zero downtime and no breaking changes. Frontend code handles both zero and non-zero states.

Steps

  1. Source PDFs -- scan GCS bucket (bun scripts/scan-gcs-documents.ts), scrape from vendor, or manual upload via gsutil
  2. Extract metadata -- parse with pdf-parse to get page count, author, title, keywords (bun scripts/parse-pdf-metadata.ts)
  3. Enrich MongoDB -- bulk-update nh_unified with $push to documents.* arrays (bun scripts/enrich-pdf-data.ts --dry-run then without --dry-run)
  4. Verify -- API automatically returns non-zero counts; check /coverage, /distribution, /gaps

Rollback

# Restore from backup
mongorestore --uri="$MONGODB_URI" --nsInclude="crop.nh_unified" --drop backup-YYYYMMDD/crop/nh_unified.bson

Rollback time: ~15 minutes.


Image Metadata Embedding (XMP)

Product metadata is embedded directly into images using the XMP standard so it is never lost when images are shared or indexed by search engines.

Storage Architecture

LayerLocationPurpose
JSON sidecarGCS, next to imageSource of truth
XMP tagsInside the image fileInseparable from image
MongoDBmedia.images[].embeddedMetadataFast queries
Elasticsearchparts_current indexFull-text search

GCS Folder Structure

gs://crop_parts/
├── ct/gallery/nhl/{partNumber}/
│   ├── {pn}-1.jpg              # Image with XMP
│   └── {pn}-1.meta.json        # JSON sidecar (source of truth)
├── ct/360/nhl/{partNumber}/
│   └── frame-001.jpg ...
├── vendor_scraped/
├── vendor_direct/
└── manual/

Metadata Schema

Each image has a JSON sidecar with these sections:

  • company -- name, type ("Authorized Reseller"), website, contact
  • legal -- copyright, license, termsUrl
  • product -- sku (CT-{VENDOR}-{PARTNUMBER}), partNumber, pnNorm, title, manufacturer, categoryName/categoryPath, equipmentFitment, status
  • image -- type, sortOrder, source, originalUrl, alt, contentHash
  • catalog -- product page URL, slug
  • embedding -- embedded flag, timestamp, version, processor ID

XMP Tag Mapping

JSON FieldXMP Tag
company.namedc:creator
legal.copyrightdc:rights
product.titledc:title
product.skucrop:SKU
product.partNumbercrop:PartNumber
product.manufacturer.namecrop:Manufacturer
product.manufacturer.codecrop:ManufacturerCode
product.categoryPath[0]crop:Category
legal.licensexmpRights:UsageTerms
catalog.urlcrop:CatalogURL
embedding.versioncrop:Version

TypeScript Types

export interface ImageMetadata {
  schemaVersion: string;
  createdAt: string;
  updatedAt: string;
  company: CompanyInfo;
  legal: LegalInfo;
  product: ProductInfo;
  image: ImageInfo;
  catalog: CatalogInfo;
  embedding: EmbeddingInfo;
}

export interface ProductInfo {
  sku: string;                     // "CT-NHL-00907566"
  partNumber: string;
  pnNorm?: string;
  title: string;
  description?: string;
  manufacturer: { name: string; code: string };
  categoryName?: string[];
  categoryPath?: string[];
  equipmentFitment?: string[];
  status?: 'active' | 'discontinued' | 'superseded';
}

A Zod validation schema (ImageMetadataSchema) enforces structure at runtime, including regex validation on the SKU format (/^CT-[A-Z]{2,3}-\w+$/).

Hybrid Processing System

Metadata embedding uses a hybrid local/cloud architecture:

Orchestrator (Cloud Run)
  ├── Task Queue → Router → Local Worker (primary, ~free)
  │                      └→ Cloud Worker (fallback, $0.12/GB)
  └── Monitor & Alerts
  • Local worker pulls tasks, downloads images in bulk, embeds XMP with exiftool-vendored, uploads results
  • Cloud worker (Cloud Run) auto-scales 0-1000, used as fallback when local is unhealthy or for urgent tasks
  • Routing logic checks local worker health (heartbeat timeout 2min, CPU >95%, memory >90%, disk <10GB) and falls back to cloud automatically

Cost at scale (5M images): ~$50-100 hybrid vs ~$590 cloud-only.

Security Rules

Metadata includes: company name, SKU, part number, manufacturer, copyright, catalog URL. Metadata excludes: prices, cost data, inventory levels, internal IDs, customer data, API keys.


Amazon Data Enrichment

Amazon Product Advertising API is used to enrich parts with descriptions, dimensions, images, and specifications.

Architecture

MongoDB (source) → Enrichment Service → Amazon PA API
                                      → Oxylabs (optional)
                                      → Rainforest API (optional)

                   ES Index (search)

Matching Strategy

StageMethodConfidence
1UPC match99%
2Part number exact search85%
3Manufacturer + part number75%
4Title fuzzy search70%

Fields Enriched

Priority 1 (Critical):

FieldAPI PathUsage
Features / BulletsItemInfo.Features.DisplayValuesEnhanced descriptions, SEO keywords, selling points
DescriptionItemInfo.ProductInfo.ItemDescriptionExtended product page text
DimensionsItemInfo.ProductInfo.ItemDimensionsShipping calculation, size filters
WeightItemInfo.ProductInfo.ItemDimensions.WeightShipping cost, product grouping
UPC / GTIN / EANItemInfo.ProductInfo.UPCListCross-referencing, POS integration, deduplication
Images (high-res)Images.Primary, Images.VariantsUp to 1500px product photos

Priority 2 (High-value):

FieldAPI PathUsage
CategoriesBrowseNodeInfo.BrowseNodesAuto-categorization, SEO breadcrumbs
Technical specsItemInfo.TechnicalInfoSpec sheets, comparison, filters
Brand infoItemInfo.ByLineInfo.BrandManufacturer verification, brand pages

Priority 3 (Nice to have): Customer reviews/ratings, Q&A count, warranty info, package dimensions, related products.

All fields come in a single API call -- no additional cost per field.

Cost Estimate

CategoryCost
One-time enrichment (~56k parts)~$96
Monthly maintenance~$25
Re-enrichment (quarterly)~$38
Total Year 1~$237

Coverage Expectations (56k parts)

Data FieldEstimated Coverage
Features/Bullets65%
Dimensions70%
Weight75%
UPC80%
High-res images65%
Categories70%
Specifications60%
Reviews50%

Overall enrichment rate: ~70%.

Environment Variables

# Amazon Product Advertising API
AMAZON_PA_ACCESS_KEY=your_access_key
AMAZON_PA_SECRET_KEY=your_secret_key
AMAZON_PA_PARTNER_TAG=your_partner_tag

# Optional alternatives
OXYLABS_USERNAME=your_username
OXYLABS_PASSWORD=your_password
RAINFOREST_API_KEY=your_api_key

Implementation Phases

  1. Weeks 1-2: Foundation and API setup
  2. Week 3: Pilot enrichment (100 parts)
  3. Weeks 4-6: Bulk enrichment (56k parts)
  4. Week 7: Data merge and ES sync
  5. Week 8: Production deployment

MongoDB Indexes

Required indexes for optimal API performance:

db.nh_unified.createIndex({ 'media.images': 1 });
db.nh_unified.createIndex({ 'media.view360.status': 1 });
db.nh_unified.createIndex({ 'media.view360.frameCount': 1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1 });
db.nh_unified.createIndex({ 'qualityScore.total': -1, 'media.imagesCount': 1 });

The coverage endpoint uses a $facet aggregation (single-pass over all documents). The gaps endpoint relies on the compound quality + imagesCount index for filtering.


Caching

EndpointTTLCache Key Pattern
/coverage10 minmedia:coverage:{env}:{ts_10min}
/distribution15 minmedia:distribution:{env}:{groupBy}:{ts_15min}
/gaps5 minmedia:gaps:{env}:{minQuality}:{mediaType}:{offset}:{ts_5min}

Invalidation: automatic via TTL, manual after bulk sync (bun scripts/sync-mongodb-to-es.ts), or via webhook on GCS manifest updates. A stale-while-revalidate pattern serves cached data while refreshing in the background.


Troubleshooting

Coverage returns 0 parts -- verify MONGODB_COLLECTION is nh_unified and check connectivity with curl http://localhost:3005/health.

Gaps returns empty array -- lower minQuality threshold; check quality score distribution with a $bucket aggregation.

Image type counts do not match gallery count -- images.gallery.count is parts with any gallery image; imageTypes.front.count is parts with FRONT-type specifically. A part can have multiple images of the same type (e.g., standard + clear-background variants).

360-degree frame counts seem wrong -- compare media.view360.frameCount field against actual media.view360.frames array length; re-scan manifests if mismatched.

API response >1s -- check for missing MongoDB indexes (look for COLLSCAN in .explain("executionStats")).

Debug mode:

export LOG_LEVEL=debug
bun run dev

On this page