CROP

Marcrest Scraping Task

Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.

Task: Scrape Marcrest (MAR) Parts Catalogs from Ricambio

[!NOTE] Discussion: Open an issue to comment on this plan. Attach supporting documents to the issue or link them here.

Objective

Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.

Source

URL: https://marcrest.ricambio.net/site/pagece5.wplus?ID_COUNT=ce_5_home&LN=2&CEPV=Marcrest001&CELN=2&CEME=2&NDS=CE_3&PRF=3&PRNDS=CE_591&PRC=|R|CE_2|CE_591|CE_3&KPRD=CE_3#CE_3

Platform: Ricambio (ricambio.net) -- standard platform for agricultural equipment parts catalogs. JavaScript-heavy, dynamic content loading.

Equipment Models (full coverage needed)

Bale Baron (main product line)

ModelNotes
4200-4210
4220
4230
4240
4245
4250
5250
6240

Other products

  • Bale Handling
  • Hydraulic Power Units
  • Power Linx
  • Swingmax

Site Structure

Hierarchical navigation:

Marcrest Parts Books (root)
+-- Bale Baron
|   +-- 4200-4210
|   |   +-- Baler Hardware
|   |   +-- Brakes
|   |   +-- Electrical
|   |   +-- Endgates
|   |   +-- Hitch
|   |   +-- Hydraulics
|   |   +-- Knotter
|   |   +-- Loading and Compression
|   |   +-- Pickup
|   |   +-- Power Unit
|   |   +-- PTO Pump Kit
|   |   +-- Roller Chute and Expellers
|   |   +-- Twine Boxes, Shields & Railings
|   |   +-- Wheels, Hubs, and Axles
|   +-- 4220
|   |   +-- ... (similar categories)
|   +-- ...
+-- Bale Handling
+-- Hydraulic Power Units
+-- Power Linx
+-- Swingmax

Each category contains:

  • Diagram -- SVG/image with position reference numbers
  • Parts table -- parts list (ref number, part number, description, qty)

URL Pattern

pagece5.wplus?ID_COUNT=[page_id]&LN=2&CEPV=Marcrest001&CELN=2&CEME=2&NDS=[node_code]&PRF=[level]&PRC=[breadcrumb]&KPRD=[product_code]
  • NDS -- category/model node code (e.g., CE_3 for 5250)
  • PRF -- hierarchy level
  • PRC -- breadcrumb path

Expected Output

Option A: PDF files (preferred)

If the site provides printable PDFs per section/model -- download them. Ricambio typically has a "Print" button or generates PDFs.

Option B: Structured data (if no PDFs available)

For each section of each model:

  1. Diagram image (SVG or PNG) -- schematic with position numbers
  2. Parts table data as JSON:
{
  "model": "5250",
  "section": "Hydraulics",
  "category": "HYDRAULICS",
  "parts": [
    {
      "ref": "1",
      "partNumber": "ABC-12345",
      "description": "Hydraulic pump",
      "qty": 1
    }
  ]
}

GCS Destination

gs://crop-processed-data/providers/marcrest/{document_id}/
+-- metadata.json
+-- pages/
    +-- page_{N}/
        +-- page_text.txt

metadata.json format (example):

{
  "document_id": "marcrest_5250_hydraulics",
  "provider": "marcrest",
  "filename": "5250-Hydraulics.pdf",
  "model_number": "5250",
  "page_count": 3,
  "processing_date": "2026-02-24"
}

If going with Option B (structured data without PDF) -- just place files in GCS in this format, we will write a parser to read them.

Existing Code Reference

There is an existing scraper prototype:

CROP-John-PythonProject-equiptment/PythonProject-marcrest/

Key files:

  • main.py -- orchestrator (site audit + download)
  • probe_site.py -- site structure analysis
  • download_all_schematics.py -- downloads SVG schematics via Playwright
  • download_marcrest_diagram_pdfs.py -- downloads PDFs from Ricambio site
  • build_offline_database.py -- consolidates data into JSON/SQLite/CSV

Tech stack: Python 3.10+, Playwright (Brave Browser), CDP port 9222, resumable downloads via progress.json.

Current MAR Data in CROP

DataCountStatus
Parts in MongoDB74Present (from DIS)
Equipment fitment0No data
Catalogs in MongoDB0No PDFs
Equipment models0No data
DIS API coverage~80%Working
CT photos66Present

Priority and Estimated Volume

  • Models: ~12 (8 Bale Baron + 4 other products)
  • Sections per model: ~15 (based on 5250)
  • Total documents: ~150-180 sections
  • Priority: Medium -- 74 published parts without fitment data

Post-Scraping Steps (handled by us)

  1. Upload data to GCS (gs://crop-processed-data/providers/marcrest/)
  2. Backfill into MongoDB (catalogs + catalog_pages)
  3. Write fitment parser for MAR in equipment-fitment
  4. Enrich parts via enrich-parts-from-fitment.ts
  5. ES re-sync

On this page