CROP
ProjectsPDF Parser

CLIP Embeddings and Metadata Search - Complete Guide

1. Vector Search (Semantic Search)

CLIP Embeddings and Metadata Search - Complete Guide

🎯 How Search Works

CLIP converts images and text into vectors (embeddings) in a shared space. This enables:

  • Image search: Find similar images
  • Text search: Find images by text description
  • Cross-modal search: Search between images and text
Part photo → CLIP → [0.123, -0.456, 0.789, ...] (768 numbers)

            Store in Weaviate with metadata

Query "oil filter" → CLIP → [0.234, -0.345, 0.567, ...]

            Vector comparison (cosine similarity)

            Top-10 most similar parts

Metadata allows combining vector search with structured filters:

Combines semantic search (vectors) with metadata filters:

# Example: Find "oil filter" for Toyota Camry
results = weaviate_client.query.get(
    "Part",
    ["part_number", "category", "manufacturer", "model", "price", "photo_url"]
).with_near_vector({
    "vector": query_embedding,  # Vector from "oil filter"
    "certainty": 0.7  # Minimum similarity
}).with_where({
    "path": ["manufacturer"],
    "operator": "Equal",
    "valueString": "Toyota"
}).with_where({
    "path": ["model"],
    "operator": "Equal",
    "valueString": "Camry 2020"
}).with_limit(10).do()

B. Post-filtering

First find similar, then filter:

# 1. Vector search (find all similar to "oil filter")
similar_parts = vector_search("oil filter", limit=100)

# 2. Filter by metadata
filtered = [
    part for part in similar_parts 
    if part["manufacturer"] == "Toyota" 
    and part["category"] == "engine"
    and part["price"] < 50.0
]

C. Pre-filtering

First filter by metadata, then search among them:

# First find all Toyota Camry parts
toyota_parts = filter_by_metadata({
    "manufacturer": "Toyota",
    "model": "Camry 2020"
})

# Then find nearest to "oil filter" among them
results = vector_search("oil filter", candidates=toyota_parts)

📋 Search Types for Parts

Scenario: User searches for "engine oil filter for Toyota"

import requests

# 1. Convert text to embedding via CLIP
response = requests.post(
    "http://34.139.6.131:8002/embed-text",
    json={"text": "engine oil filter for Toyota"}
)
query_vector = response.json()["embedding"]

# 2. Search in Weaviate with filter
import weaviate

client = weaviate.Client("http://weaviate-service:8002")

results = (
    client.query
    .get("Part", ["part_number", "category", "manufacturer", "model", "price", "photo_url"])
    .with_near_vector({
        "vector": query_vector,
        "certainty": 0.75  # High similarity
    })
    .with_where({
        "path": ["manufacturer"],
        "operator": "Equal",
        "valueString": "Toyota"
    })
    .with_limit(10)
    .do()
)

# Results are already sorted by similarity
for part in results["data"]["Get"]["Part"]:
    print(f"{part['part_number']}: {part['category']} - ${part['price']}")

Result: Parts that are visually/semantically similar to "engine oil filter" and suitable for Toyota.

Scenario: User photographed a part and wants to find similar or identical parts

# 1. Convert photo to embedding
with open('unknown_part.jpg', 'rb') as f:
    files = {'file': ('unknown_part.jpg', f, 'image/jpeg')}
    response = requests.post(
        "http://34.139.6.131:8002/embed",
        files=files
    )
query_vector = response.json()["embedding"]

# 2. Search for similar images in Weaviate
results = (
    client.query
    .get("Part", ["part_number", "category", "manufacturer", "model", "price", "photo_url"])
    .with_near_vector({
        "vector": query_vector,
        "certainty": 0.85  # Very high similarity for visual search
    })
    .with_limit(5)
    .do()
)

# Show most similar parts
for part in results["data"]["Get"]["Part"]:
    similarity_score = part["_additional"]["certainty"]
    print(f"Similarity: {similarity_score:.2%}")
    print(f"Part: {part['part_number']} - {part['category']}")
    print(f"Photo: {part['photo_url']}\n")

Result: Parts that look similar to the photo.

3. Search with Combined Metadata Filters

Scenario: Find "oil filter" for Toyota Camry 2020, price under $50, in stock

# 1. Text query
response = requests.post(
    "http://34.139.6.131:8002/embed-text",
    json={"text": "oil filter"}
)
query_vector = response.json()["embedding"]

# 2. Hybrid search: vector + multiple filters
results = (
    client.query
    .get("Part", ["part_number", "category", "manufacturer", "model", "price", "stock", "photo_url"])
    .with_near_vector({
        "vector": query_vector,
        "certainty": 0.7
    })
    .with_where({
        "operator": "And",
        "operands": [
            {
                "path": ["manufacturer"],
                "operator": "Equal",
                "valueString": "Toyota"
            },
            {
                "path": ["model"],
                "operator": "Equal",
                "valueString": "Camry 2020"
            },
            {
                "path": ["price"],
                "operator": "LessThan",
                "valueNumber": 50.0
            },
            {
                "path": ["stock"],
                "operator": "GreaterThan",
                "valueInt": 0
            }
        ]
    })
    .with_limit(10)
    .do()
)

Result: Only relevant parts that meet all criteria.

4. Search with Metadata Sorting

Scenario: Find "brake pad", sort by price (lowest first)

# Weaviate supports sorting by metadata fields
results = (
    client.query
    .get("Part", ["part_number", "category", "price", "stock"])
    .with_near_vector({
        "vector": query_vector,
        "certainty": 0.7
    })
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "brake"
    })
    .with_sort([
        {
            "path": ["price"],
            "order": "asc"  # From cheapest to most expensive
        }
    ])
    .with_limit(20)
    .do()
)

🔍 Metadata Role in Different Scenarios

Task: User knows part number "ABC-12345"

# CLIP not needed here - just exact search by metadata
results = (
    client.query
    .get("Part", ["part_number", "category", "price", "stock"])
    .with_where({
        "path": ["part_number"],
        "operator": "Equal",
        "valueString": "ABC-12345"
    })
    .do()
)

Metadata role: Primary - exact lookup by number.

Scenario 2: Fuzzy Search by Description

Task: "I need an engine filter similar to this one"

# 1. Convert photo to vector
photo_vector = embed_image("part_photo.jpg")

# 2. Search similar with category filter
results = (
    client.query
    .get("Part", ["part_number", "category", "manufacturer", "price"])
    .with_near_vector({
        "vector": photo_vector,
        "certainty": 0.8
    })
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "engine"
    })
    .do()
)

Metadata role: Auxiliary - limits search to "engine" category.

Scenario 3: Recommendations Based on Metadata

Task: "Show me all parts that are frequently bought together with this oil filter"

# 1. Find current part
current_part = find_part_by_vector(photo_vector)

# 2. Use metadata for recommendations
recommendations = (
    client.query
    .get("Part", ["part_number", "category", "price"])
    .with_near_vector({
        "vector": current_part["vector"],
        "certainty": 0.6  # Lower similarity for broader search
    })
    .with_where({
        "operator": "And",
        "operands": [
            {
                "path": ["manufacturer"],
                "operator": "Equal",
                "valueString": current_part["manufacturer"]
            },
            {
                "path": ["category"],
                "operator": "NotEqual",  # Other categories
                "valueString": current_part["category"]
            }
        ]
    })
    .with_limit(5)
    .do()
)

Metadata role: Key - finds parts from the same manufacturer but different categories.


💡 Practical Recommendations

1. When to Use Vector Search Only?

  • Fuzzy queries: "similar part", "something like this"
  • Visual search: find by photo
  • Semantic search: find by text description

2. When to Add Metadata Filters?

  • Manufacturer/model restrictions
  • Price/stock filters
  • Category restrictions
  • Sorting by relevance (price, rating, etc.)

3. Important Metadata Fields for Parts

Required for filtering:

  • manufacturer - manufacturer
  • model - vehicle model
  • category - part category
  • part_number - part number (for exact search)

Useful for sorting/filtering:

  • price - price
  • stock - quantity in stock
  • rating - rating
  • supplier_id - supplier

For recommendations:

  • compatible_with - compatibility with other parts
  • frequently_bought_with - frequently bought together
  • replacement_for - replacement for

4. Optimizing Certainty Threshold

# High threshold (0.85+) - only very similar
# Use for: exact visual search
certainty = 0.85

# Medium threshold (0.7-0.85) - similar + relevant
# Use for: general search with filters
certainty = 0.75

# Low threshold (0.5-0.7) - broad search
# Use for: recommendations, exploration
certainty = 0.6

📊 Complete Example: Adding New Part Workflow

import requests
import weaviate
import json

# 1. Upload photo and get embedding
with open('new_part.jpg', 'rb') as f:
    files = {'file': ('new_part.jpg', f, 'image/jpeg')}
    metadata = {
        "part_number": "ABC-12345",
        "category": "engine",
        "manufacturer": "Toyota",
        "model": "Camry 2020",
        "description": "Engine oil filter",
        "price": 29.99,
        "stock": 10
    }
    data = {'metadata': json.dumps(metadata)}
    
    response = requests.post(
        'http://34.139.6.131:8002/embed',
        files=files,
        data=data
    )

result = response.json()
embedding_vector = result["embedding"]
part_metadata = result["metadata"]

# 2. Save to Weaviate
client = weaviate.Client("http://weaviate-service:8002")

client.data_object.create(
    data_object={
        **part_metadata,
        "photo_url": "https://storage.googleapis.com/parts/new_part.jpg"
    },
    class_name="Part",
    vector=embedding_vector  # CLIP embedding for search
)

print("✅ Part added and ready for search!")

🎯 Conclusion

Metadata helps:

  1. Exact filtering - limit search by specific criteria
  2. Hybrid search - combine semantic and structured search
  3. Sorting - order results by price, rating, etc.
  4. Business logic - check availability, price, compatibility
  5. Recommendations - find related parts

CLIP embeddings help:

  1. Semantic search - find by text description
  2. Visual search - find by photo
  3. Fuzzy search - find similar parts

Together they create a powerful search system that can find parts by both photo and description, taking into account all business requirements!

On this page