ADR 001: Search Service Retry Strategy
Status: Accepted Date: 2024-12-04 Authors: Search Service Team Related PRs: TBD
ADR 001: Search Service Retry Strategy
Status: Accepted Date: 2024-12-04 Authors: Search Service Team Related PRs: #TBD
Context and Problem Statement
The search service API is a critical dependency for the frontend application. Transient network failures and temporary backend issues (5xx errors, timeouts) can cause user-facing errors even when the backend is generally healthy. We need a strategy to improve resilience against these temporary failures without impacting user experience.
Key Requirements:
- Minimize user-visible errors from transient failures
- Maintain acceptable response times (p95 < 500ms for critical paths)
- Avoid thundering herd effect during backend recovery
- Differentiate between retryable (5xx, timeouts) and non-retryable (4xx) errors
Decision Drivers
- User Experience: Reduce error rates for critical operations (part detail pages, search)
- Backend Load: Prevent overwhelming recovering services with retry storms
- Latency Sensitivity: Different endpoints have different latency requirements
- Error Types: 4xx errors (client mistakes) should never be retried
Considered Options
Option 1: No Retry (Status Quo)
Pros:
- Simplest implementation
- Predictable latency
- No risk of retry storms
Cons:
- ❌ High error rate from transient failures
- ❌ Poor user experience during temporary backend issues
- ❌ No differentiation between recoverable and fatal errors
Option 2: Simple Retry (Fixed Delay)
Pros:
- Easy to implement
- Bounded retry count
Cons:
- ❌ Thundering herd problem (all clients retry simultaneously)
- ❌ Fixed delay doesn't adapt to backend recovery time
- ❌ Still causes spikes during recovery
Option 3: Exponential Backoff with Jitter (Selected)
Pros:
- ✅ Prevents thundering herd via randomized delays
- ✅ Adapts to backend recovery (longer delays for sustained failures)
- ✅ Industry standard pattern (AWS SDK, Stripe, etc.)
- ✅ Bounded retry count and maximum delay
Cons:
- More complex implementation
- Higher maximum latency for sustained failures
Option 4: Circuit Breaker Pattern
Pros:
- Protects backend during extended outages
- Fast failure after threshold
Cons:
- ❌ Overly complex for current needs
- ❌ Requires distributed state or coordination
- ❌ May prematurely stop retries during intermittent issues
Decision
Implement exponential backoff with jitter for critical read endpoints.
Configuration
{
maxAttempts: 3, // Total 3 attempts (1 initial + 2 retries)
initialDelayMs: 100, // Start with 100ms delay
backoffMultiplier: 3, // Aggressive backoff (100ms → 300ms → 900ms)
jitter: true // ±25% randomization to prevent thundering herd
}Retry Conditions
Retry Only:
- HTTP 5xx server errors (500-599)
- Timeout errors (AbortError from fetch timeout)
Never Retry:
- HTTP 4xx client errors (400-499) - these indicate client mistakes
- Successful responses (2xx, 3xx)
- Validation errors (our own schema validation failures)
Endpoint Coverage
✅ Retry Enabled
getPartById- Critical for part detail pagesgetPartBySlug- Alternative part lookupsearchParts- Main catalog searchfetchFilters- Filter data for search UI
❌ No Retry
fetchAutocomplete- Latency-critical, fast-fail preferredhealth- Diagnostic endpoint, should reflect real-time status
Rationale for Selective Retry:
- Part detail pages are critical user journeys where retry improves UX significantly
- Search results are important but can tolerate fast-fail if backend is down
- Filters are loaded async and benefit from retry without blocking UI
- Autocomplete requires <200ms response time - retry would degrade UX
- Health checks should reflect actual backend state, not retry-masked availability
Consequences
Positive
- ✅ Reduced error rate: Transient failures masked from users
- ✅ Better UX: Users don't see errors for temporary backend blips
- ✅ No thundering herd: Jitter prevents simultaneous retry storms
- ✅ Graceful degradation: Exponential backoff adapts to backend recovery time
Negative
- ⚠️ Increased latency on failures: Max retry adds ~1.3s latency (100ms + 300ms + 900ms)
- ⚠️ Hidden backend issues: Retry may mask systematic problems in metrics
- ⚠️ Increased backend load: Each retry adds 1-2 additional requests for failed operations
Mitigation Strategies
-
Monitor retry metrics:
- Track retry attempt rate (should be <1% of requests)
- Alert if retry rate >5% (indicates backend issues)
- Log final failures after exhausted retries
-
Latency monitoring:
- Track p95/p99 latency separately for retried vs non-retried requests
- Alert if p95 >500ms for critical endpoints
-
Backend correlation:
- Include
X-Retry-Countheader in retry attempts - Backend can differentiate retry traffic for debugging
- Include
Implementation Details
Core Retry Logic
Located in lib/search-service/retry.ts:
isRetryableError()- Determines if error should be retriedretry()- Main retry wrapper with exponential backoffsleep()- AbortSignal-aware delay with memory leak prevention
HTTP Client Integration
Located in lib/search-service/http-client.ts:
- Retry wrapper applied at request level via
options.retry - Timeout signal properly chained through retry attempts
- Cleanup logic prevents event listener memory leaks
Test Coverage
Located in lib/search-service/retry.test.ts:
- 13 unit tests covering all retry scenarios
- Includes timing tests for backoff verification
- Tests abort signal handling and jitter variance
References
- AWS SDK Retry Strategy
- Stripe API Retry Documentation
- Google Cloud Retry Best Practices
- Martin Fowler: Circuit Breaker Pattern
Future Considerations
- Adaptive backoff: Adjust multiplier based on backend recovery patterns
- Per-endpoint configuration: Different retry configs for different endpoint types
- Client-side caching: Reduce retry need by caching successful responses
- Circuit breaker: Add circuit breaker pattern if retry rate consistently high
- Request deduplication: Prevent duplicate retries from concurrent requests