Marcrest Scraping Task
Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.
Task: Scrape Marcrest (MAR) Parts Catalogs from Ricambio
[!NOTE] Discussion: Open an issue to comment on this plan. Attach supporting documents to the issue or link them here.
Objective
Download all parts catalog PDFs from the Marcrest Ricambio site and upload to GCS for processing through the existing CROP PDF pipeline.
Source
Platform: Ricambio (ricambio.net) -- standard platform for agricultural equipment parts catalogs. JavaScript-heavy, dynamic content loading.
Equipment Models (full coverage needed)
Bale Baron (main product line)
| Model | Notes |
|---|---|
| 4200-4210 | |
| 4220 | |
| 4230 | |
| 4240 | |
| 4245 | |
| 4250 | |
| 5250 | |
| 6240 |
Other products
- Bale Handling
- Hydraulic Power Units
- Power Linx
- Swingmax
Site Structure
Hierarchical navigation:
Marcrest Parts Books (root)
+-- Bale Baron
| +-- 4200-4210
| | +-- Baler Hardware
| | +-- Brakes
| | +-- Electrical
| | +-- Endgates
| | +-- Hitch
| | +-- Hydraulics
| | +-- Knotter
| | +-- Loading and Compression
| | +-- Pickup
| | +-- Power Unit
| | +-- PTO Pump Kit
| | +-- Roller Chute and Expellers
| | +-- Twine Boxes, Shields & Railings
| | +-- Wheels, Hubs, and Axles
| +-- 4220
| | +-- ... (similar categories)
| +-- ...
+-- Bale Handling
+-- Hydraulic Power Units
+-- Power Linx
+-- SwingmaxEach category contains:
- Diagram -- SVG/image with position reference numbers
- Parts table -- parts list (ref number, part number, description, qty)
URL Pattern
pagece5.wplus?ID_COUNT=[page_id]&LN=2&CEPV=Marcrest001&CELN=2&CEME=2&NDS=[node_code]&PRF=[level]&PRC=[breadcrumb]&KPRD=[product_code]NDS-- category/model node code (e.g.,CE_3for 5250)PRF-- hierarchy levelPRC-- breadcrumb path
Expected Output
Option A: PDF files (preferred)
If the site provides printable PDFs per section/model -- download them. Ricambio typically has a "Print" button or generates PDFs.
Option B: Structured data (if no PDFs available)
For each section of each model:
- Diagram image (SVG or PNG) -- schematic with position numbers
- Parts table data as JSON:
{
"model": "5250",
"section": "Hydraulics",
"category": "HYDRAULICS",
"parts": [
{
"ref": "1",
"partNumber": "ABC-12345",
"description": "Hydraulic pump",
"qty": 1
}
]
}GCS Destination
gs://crop-processed-data/providers/marcrest/{document_id}/
+-- metadata.json
+-- pages/
+-- page_{N}/
+-- page_text.txtmetadata.json format (example):
{
"document_id": "marcrest_5250_hydraulics",
"provider": "marcrest",
"filename": "5250-Hydraulics.pdf",
"model_number": "5250",
"page_count": 3,
"processing_date": "2026-02-24"
}If going with Option B (structured data without PDF) -- just place files in GCS in this format, we will write a parser to read them.
Existing Code Reference
There is an existing scraper prototype:
CROP-John-PythonProject-equiptment/PythonProject-marcrest/Key files:
main.py-- orchestrator (site audit + download)probe_site.py-- site structure analysisdownload_all_schematics.py-- downloads SVG schematics via Playwrightdownload_marcrest_diagram_pdfs.py-- downloads PDFs from Ricambio sitebuild_offline_database.py-- consolidates data into JSON/SQLite/CSV
Tech stack: Python 3.10+, Playwright (Brave Browser), CDP port 9222, resumable downloads via progress.json.
Current MAR Data in CROP
| Data | Count | Status |
|---|---|---|
| Parts in MongoDB | 74 | Present (from DIS) |
| Equipment fitment | 0 | No data |
| Catalogs in MongoDB | 0 | No PDFs |
| Equipment models | 0 | No data |
| DIS API coverage | ~80% | Working |
| CT photos | 66 | Present |
Priority and Estimated Volume
- Models: ~12 (8 Bale Baron + 4 other products)
- Sections per model: ~15 (based on 5250)
- Total documents: ~150-180 sections
- Priority: Medium -- 74 published parts without fitment data
Post-Scraping Steps (handled by us)
- Upload data to GCS (
gs://crop-processed-data/providers/marcrest/) - Backfill into MongoDB (
catalogs+catalog_pages) - Write fitment parser for MAR in
equipment-fitment - Enrich
partsviaenrich-parts-from-fitment.ts - ES re-sync