ADR 001: Search Service Retry Strategy

Status: Accepted Date: 2024-12-04 Authors: Search Service Team Related PRs: #TBD

Context and Problem Statement

The search service API is a critical dependency for the frontend application. Transient network failures and temporary backend issues (5xx errors, timeouts) can cause user-facing errors even when the backend is generally healthy. We need a strategy to improve resilience against these temporary failures without impacting user experience.

Key Requirements:

Minimize user-visible errors from transient failures
Maintain acceptable response times (p95 < 500ms for critical paths)
Avoid thundering herd effect during backend recovery
Differentiate between retryable (5xx, timeouts) and non-retryable (4xx) errors

Decision Drivers

User Experience: Reduce error rates for critical operations (part detail pages, search)
Backend Load: Prevent overwhelming recovering services with retry storms
Latency Sensitivity: Different endpoints have different latency requirements
Error Types: 4xx errors (client mistakes) should never be retried

Considered Options

Option 1: No Retry (Status Quo)

Pros:

Simplest implementation
Predictable latency
No risk of retry storms

Cons:

❌ High error rate from transient failures
❌ Poor user experience during temporary backend issues
❌ No differentiation between recoverable and fatal errors

Option 2: Simple Retry (Fixed Delay)

Pros:

Easy to implement
Bounded retry count

Cons:

❌ Thundering herd problem (all clients retry simultaneously)
❌ Fixed delay doesn't adapt to backend recovery time
❌ Still causes spikes during recovery

Option 3: Exponential Backoff with Jitter (Selected)

Pros:

✅ Prevents thundering herd via randomized delays
✅ Adapts to backend recovery (longer delays for sustained failures)
✅ Industry standard pattern (AWS SDK, Stripe, etc.)
✅ Bounded retry count and maximum delay

Cons:

More complex implementation
Higher maximum latency for sustained failures

Option 4: Circuit Breaker Pattern

Pros:

Protects backend during extended outages
Fast failure after threshold

Cons:

❌ Overly complex for current needs
❌ Requires distributed state or coordination
❌ May prematurely stop retries during intermittent issues

Decision

Implement exponential backoff with jitter for critical read endpoints.

Configuration

{
  maxAttempts: 3,           // Total 3 attempts (1 initial + 2 retries)
  initialDelayMs: 100,       // Start with 100ms delay
  backoffMultiplier: 3,      // Aggressive backoff (100ms → 300ms → 900ms)
  jitter: true               // ±25% randomization to prevent thundering herd
}

Retry Conditions

Retry Only:

HTTP 5xx server errors (500-599)
Timeout errors (AbortError from fetch timeout)

Never Retry:

HTTP 4xx client errors (400-499) - these indicate client mistakes
Successful responses (2xx, 3xx)
Validation errors (our own schema validation failures)

Endpoint Coverage

✅ Retry Enabled

getPartById - Critical for part detail pages
getPartBySlug - Alternative part lookup
searchParts - Main catalog search
fetchFilters - Filter data for search UI

❌ No Retry

fetchAutocomplete - Latency-critical, fast-fail preferred
health - Diagnostic endpoint, should reflect real-time status

Rationale for Selective Retry:

Part detail pages are critical user journeys where retry improves UX significantly
Search results are important but can tolerate fast-fail if backend is down
Filters are loaded async and benefit from retry without blocking UI
Autocomplete requires <200ms response time - retry would degrade UX
Health checks should reflect actual backend state, not retry-masked availability

Consequences

Positive

✅ Reduced error rate: Transient failures masked from users
✅ Better UX: Users don't see errors for temporary backend blips
✅ No thundering herd: Jitter prevents simultaneous retry storms
✅ Graceful degradation: Exponential backoff adapts to backend recovery time

Negative

⚠️ Increased latency on failures: Max retry adds ~1.3s latency (100ms + 300ms + 900ms)
⚠️ Hidden backend issues: Retry may mask systematic problems in metrics
⚠️ Increased backend load: Each retry adds 1-2 additional requests for failed operations

Mitigation Strategies

Monitor retry metrics:
- Track retry attempt rate (should be <1% of requests)
- Alert if retry rate >5% (indicates backend issues)
- Log final failures after exhausted retries
Latency monitoring:
- Track p95/p99 latency separately for retried vs non-retried requests
- Alert if p95 >500ms for critical endpoints
Backend correlation:
- Include X-Retry-Count header in retry attempts
- Backend can differentiate retry traffic for debugging

Implementation Details

Core Retry Logic

Located in lib/search-service/retry.ts:

isRetryableError() - Determines if error should be retried
retry() - Main retry wrapper with exponential backoff
sleep() - AbortSignal-aware delay with memory leak prevention

HTTP Client Integration

Located in lib/search-service/http-client.ts:

Retry wrapper applied at request level via options.retry
Timeout signal properly chained through retry attempts
Cleanup logic prevents event listener memory leaks

Test Coverage

Located in lib/search-service/retry.test.ts:

13 unit tests covering all retry scenarios
Includes timing tests for backoff verification
Tests abort signal handling and jitter variance

References

Future Considerations

Adaptive backoff: Adjust multiplier based on backend recovery patterns
Per-endpoint configuration: Different retry configs for different endpoint types
Client-side caching: Reduce retry need by caching successful responses
Circuit breaker: Add circuit breaker pattern if retry rate consistently high
Request deduplication: Prevent duplicate retries from concurrent requests

ADR 001: Search Service Retry Strategy

On this page