Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.

Task: Scrape Marcrest (MAR) Parts Catalogs from Ricambio

[!NOTE] Discussion: Open an issue to comment on this plan. Attach supporting documents to the issue or link them here.

Objective

Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.

Source

URL: https://marcrest.ricambio.net/site/pagece5.wplus?ID_COUNT=ce_5_home&LN=2&CEPV=Marcrest001&CELN=2&CEME=2&NDS=CE_3&PRF=3&PRNDS=CE_591&PRC=|R|CE_2|CE_591|CE_3&KPRD=CE_3#CE_3

Platform: Ricambio (ricambio.net) -- standard platform for agricultural equipment parts catalogs. JavaScript-heavy, dynamic content loading.

Equipment Models (full coverage needed)

Bale Baron (main product line)

Model	Notes
4200-4210
4220
4230
4240
4245
4250
5250
6240

Other products

Bale Handling
Hydraulic Power Units
Power Linx
Swingmax

Site Structure

Hierarchical navigation:

Marcrest Parts Books (root)
+-- Bale Baron
|   +-- 4200-4210
|   |   +-- Baler Hardware
|   |   +-- Brakes
|   |   +-- Electrical
|   |   +-- Endgates
|   |   +-- Hitch
|   |   +-- Hydraulics
|   |   +-- Knotter
|   |   +-- Loading and Compression
|   |   +-- Pickup
|   |   +-- Power Unit
|   |   +-- PTO Pump Kit
|   |   +-- Roller Chute and Expellers
|   |   +-- Twine Boxes, Shields & Railings
|   |   +-- Wheels, Hubs, and Axles
|   +-- 4220
|   |   +-- ... (similar categories)
|   +-- ...
+-- Bale Handling
+-- Hydraulic Power Units
+-- Power Linx
+-- Swingmax

Each category contains:

Diagram -- SVG/image with position reference numbers
Parts table -- parts list (ref number, part number, description, qty)

URL Pattern

pagece5.wplus?ID_COUNT=[page_id]&LN=2&CEPV=Marcrest001&CELN=2&CEME=2&NDS=[node_code]&PRF=[level]&PRC=[breadcrumb]&KPRD=[product_code]

NDS -- category/model node code (e.g., CE_3 for 5250)
PRF -- hierarchy level
PRC -- breadcrumb path

Expected Output

Option A: PDF files (preferred)

If the site provides printable PDFs per section/model -- download them. Ricambio typically has a "Print" button or generates PDFs.

Option B: Structured data (if no PDFs available)

For each section of each model:

Diagram image (SVG or PNG) -- schematic with position numbers
Parts table data as JSON:

{
  "model": "5250",
  "section": "Hydraulics",
  "category": "HYDRAULICS",
  "parts": [
    {
      "ref": "1",
      "partNumber": "ABC-12345",
      "description": "Hydraulic pump",
      "qty": 1
    }
  ]
}

GCS Destination

gs://crop-processed-data/providers/marcrest/{document_id}/
+-- metadata.json
+-- pages/
    +-- page_{N}/
        +-- page_text.txt

metadata.json format (example):

{
  "document_id": "marcrest_5250_hydraulics",
  "provider": "marcrest",
  "filename": "5250-Hydraulics.pdf",
  "model_number": "5250",
  "page_count": 3,
  "processing_date": "2026-02-24"
}

If going with Option B (structured data without PDF) -- just place files in GCS in this format, we will write a parser to read them.

Existing Code Reference

There is an existing scraper prototype:

CROP-John-PythonProject-equiptment/PythonProject-marcrest/

Key files:

main.py -- orchestrator (site audit + download)
probe_site.py -- site structure analysis
download_all_schematics.py -- downloads SVG schematics via Playwright
download_marcrest_diagram_pdfs.py -- downloads PDFs from Ricambio site
build_offline_database.py -- consolidates data into JSON/SQLite/CSV

Tech stack: Python 3.10+, Playwright (Brave Browser), CDP port 9222, resumable downloads via progress.json.

Current MAR Data in CROP

Data	Count	Status
Parts in MongoDB	74	Present (from DIS)
Equipment fitment	0	No data
Catalogs in MongoDB	0	No PDFs
Equipment models	0	No data
DIS API coverage	~80%	Working
CT photos	66	Present

Priority and Estimated Volume

Models: ~12 (8 Bale Baron + 4 other products)
Sections per model: ~15 (based on 5250)
Total documents: ~150-180 sections
Priority: Medium -- 74 published parts without fitment data

Post-Scraping Steps (handled by us)

Upload data to GCS (gs://crop-processed-data/providers/marcrest/)
Backfill into MongoDB (catalogs + catalog_pages)
Write fitment parser for MAR in equipment-fitment
Enrich parts via enrich-parts-from-fitment.ts
ES re-sync

Marcrest Scraping Task