CLIP Embeddings and Metadata Search - Complete Guide
1. Vector Search (Semantic Search)
CLIP Embeddings and Metadata Search - Complete Guide
🎯 How Search Works
1. Vector Search (Semantic Search)
CLIP converts images and text into vectors (embeddings) in a shared space. This enables:
- Image search: Find similar images
- Text search: Find images by text description
- Cross-modal search: Search between images and text
Part photo → CLIP → [0.123, -0.456, 0.789, ...] (768 numbers)
↓
Store in Weaviate with metadata
↓
Query "oil filter" → CLIP → [0.234, -0.345, 0.567, ...]
↓
Vector comparison (cosine similarity)
↓
Top-10 most similar parts2. How Metadata Helps in Search
Metadata allows combining vector search with structured filters:
A. Hybrid Search
Combines semantic search (vectors) with metadata filters:
# Example: Find "oil filter" for Toyota Camry
results = weaviate_client.query.get(
"Part",
["part_number", "category", "manufacturer", "model", "price", "photo_url"]
).with_near_vector({
"vector": query_embedding, # Vector from "oil filter"
"certainty": 0.7 # Minimum similarity
}).with_where({
"path": ["manufacturer"],
"operator": "Equal",
"valueString": "Toyota"
}).with_where({
"path": ["model"],
"operator": "Equal",
"valueString": "Camry 2020"
}).with_limit(10).do()B. Post-filtering
First find similar, then filter:
# 1. Vector search (find all similar to "oil filter")
similar_parts = vector_search("oil filter", limit=100)
# 2. Filter by metadata
filtered = [
part for part in similar_parts
if part["manufacturer"] == "Toyota"
and part["category"] == "engine"
and part["price"] < 50.0
]C. Pre-filtering
First filter by metadata, then search among them:
# First find all Toyota Camry parts
toyota_parts = filter_by_metadata({
"manufacturer": "Toyota",
"model": "Camry 2020"
})
# Then find nearest to "oil filter" among them
results = vector_search("oil filter", candidates=toyota_parts)📋 Search Types for Parts
1. Text Query Search
Scenario: User searches for "engine oil filter for Toyota"
import requests
# 1. Convert text to embedding via CLIP
response = requests.post(
"http://34.139.6.131:8002/embed-text",
json={"text": "engine oil filter for Toyota"}
)
query_vector = response.json()["embedding"]
# 2. Search in Weaviate with filter
import weaviate
client = weaviate.Client("http://weaviate-service:8002")
results = (
client.query
.get("Part", ["part_number", "category", "manufacturer", "model", "price", "photo_url"])
.with_near_vector({
"vector": query_vector,
"certainty": 0.75 # High similarity
})
.with_where({
"path": ["manufacturer"],
"operator": "Equal",
"valueString": "Toyota"
})
.with_limit(10)
.do()
)
# Results are already sorted by similarity
for part in results["data"]["Get"]["Part"]:
print(f"{part['part_number']}: {part['category']} - ${part['price']}")Result: Parts that are visually/semantically similar to "engine oil filter" and suitable for Toyota.
2. Image Search
Scenario: User photographed a part and wants to find similar or identical parts
# 1. Convert photo to embedding
with open('unknown_part.jpg', 'rb') as f:
files = {'file': ('unknown_part.jpg', f, 'image/jpeg')}
response = requests.post(
"http://34.139.6.131:8002/embed",
files=files
)
query_vector = response.json()["embedding"]
# 2. Search for similar images in Weaviate
results = (
client.query
.get("Part", ["part_number", "category", "manufacturer", "model", "price", "photo_url"])
.with_near_vector({
"vector": query_vector,
"certainty": 0.85 # Very high similarity for visual search
})
.with_limit(5)
.do()
)
# Show most similar parts
for part in results["data"]["Get"]["Part"]:
similarity_score = part["_additional"]["certainty"]
print(f"Similarity: {similarity_score:.2%}")
print(f"Part: {part['part_number']} - {part['category']}")
print(f"Photo: {part['photo_url']}\n")Result: Parts that look similar to the photo.
3. Search with Combined Metadata Filters
Scenario: Find "oil filter" for Toyota Camry 2020, price under $50, in stock
# 1. Text query
response = requests.post(
"http://34.139.6.131:8002/embed-text",
json={"text": "oil filter"}
)
query_vector = response.json()["embedding"]
# 2. Hybrid search: vector + multiple filters
results = (
client.query
.get("Part", ["part_number", "category", "manufacturer", "model", "price", "stock", "photo_url"])
.with_near_vector({
"vector": query_vector,
"certainty": 0.7
})
.with_where({
"operator": "And",
"operands": [
{
"path": ["manufacturer"],
"operator": "Equal",
"valueString": "Toyota"
},
{
"path": ["model"],
"operator": "Equal",
"valueString": "Camry 2020"
},
{
"path": ["price"],
"operator": "LessThan",
"valueNumber": 50.0
},
{
"path": ["stock"],
"operator": "GreaterThan",
"valueInt": 0
}
]
})
.with_limit(10)
.do()
)Result: Only relevant parts that meet all criteria.
4. Search with Metadata Sorting
Scenario: Find "brake pad", sort by price (lowest first)
# Weaviate supports sorting by metadata fields
results = (
client.query
.get("Part", ["part_number", "category", "price", "stock"])
.with_near_vector({
"vector": query_vector,
"certainty": 0.7
})
.with_where({
"path": ["category"],
"operator": "Equal",
"valueString": "brake"
})
.with_sort([
{
"path": ["price"],
"order": "asc" # From cheapest to most expensive
}
])
.with_limit(20)
.do()
)🔍 Metadata Role in Different Scenarios
Scenario 1: Exact Part Search
Task: User knows part number "ABC-12345"
# CLIP not needed here - just exact search by metadata
results = (
client.query
.get("Part", ["part_number", "category", "price", "stock"])
.with_where({
"path": ["part_number"],
"operator": "Equal",
"valueString": "ABC-12345"
})
.do()
)Metadata role: Primary - exact lookup by number.
Scenario 2: Fuzzy Search by Description
Task: "I need an engine filter similar to this one"
# 1. Convert photo to vector
photo_vector = embed_image("part_photo.jpg")
# 2. Search similar with category filter
results = (
client.query
.get("Part", ["part_number", "category", "manufacturer", "price"])
.with_near_vector({
"vector": photo_vector,
"certainty": 0.8
})
.with_where({
"path": ["category"],
"operator": "Equal",
"valueString": "engine"
})
.do()
)Metadata role: Auxiliary - limits search to "engine" category.
Scenario 3: Recommendations Based on Metadata
Task: "Show me all parts that are frequently bought together with this oil filter"
# 1. Find current part
current_part = find_part_by_vector(photo_vector)
# 2. Use metadata for recommendations
recommendations = (
client.query
.get("Part", ["part_number", "category", "price"])
.with_near_vector({
"vector": current_part["vector"],
"certainty": 0.6 # Lower similarity for broader search
})
.with_where({
"operator": "And",
"operands": [
{
"path": ["manufacturer"],
"operator": "Equal",
"valueString": current_part["manufacturer"]
},
{
"path": ["category"],
"operator": "NotEqual", # Other categories
"valueString": current_part["category"]
}
]
})
.with_limit(5)
.do()
)Metadata role: Key - finds parts from the same manufacturer but different categories.
💡 Practical Recommendations
1. When to Use Vector Search Only?
- Fuzzy queries: "similar part", "something like this"
- Visual search: find by photo
- Semantic search: find by text description
2. When to Add Metadata Filters?
- Manufacturer/model restrictions
- Price/stock filters
- Category restrictions
- Sorting by relevance (price, rating, etc.)
3. Important Metadata Fields for Parts
Required for filtering:
manufacturer- manufacturermodel- vehicle modelcategory- part categorypart_number- part number (for exact search)
Useful for sorting/filtering:
price- pricestock- quantity in stockrating- ratingsupplier_id- supplier
For recommendations:
compatible_with- compatibility with other partsfrequently_bought_with- frequently bought togetherreplacement_for- replacement for
4. Optimizing Certainty Threshold
# High threshold (0.85+) - only very similar
# Use for: exact visual search
certainty = 0.85
# Medium threshold (0.7-0.85) - similar + relevant
# Use for: general search with filters
certainty = 0.75
# Low threshold (0.5-0.7) - broad search
# Use for: recommendations, exploration
certainty = 0.6📊 Complete Example: Adding New Part Workflow
import requests
import weaviate
import json
# 1. Upload photo and get embedding
with open('new_part.jpg', 'rb') as f:
files = {'file': ('new_part.jpg', f, 'image/jpeg')}
metadata = {
"part_number": "ABC-12345",
"category": "engine",
"manufacturer": "Toyota",
"model": "Camry 2020",
"description": "Engine oil filter",
"price": 29.99,
"stock": 10
}
data = {'metadata': json.dumps(metadata)}
response = requests.post(
'http://34.139.6.131:8002/embed',
files=files,
data=data
)
result = response.json()
embedding_vector = result["embedding"]
part_metadata = result["metadata"]
# 2. Save to Weaviate
client = weaviate.Client("http://weaviate-service:8002")
client.data_object.create(
data_object={
**part_metadata,
"photo_url": "https://storage.googleapis.com/parts/new_part.jpg"
},
class_name="Part",
vector=embedding_vector # CLIP embedding for search
)
print("✅ Part added and ready for search!")🎯 Conclusion
Metadata helps:
- ✅ Exact filtering - limit search by specific criteria
- ✅ Hybrid search - combine semantic and structured search
- ✅ Sorting - order results by price, rating, etc.
- ✅ Business logic - check availability, price, compatibility
- ✅ Recommendations - find related parts
CLIP embeddings help:
- ✅ Semantic search - find by text description
- ✅ Visual search - find by photo
- ✅ Fuzzy search - find similar parts
Together they create a powerful search system that can find parts by both photo and description, taking into account all business requirements!