CROP
ProjectsCROP FrontendAdr

ADR 001: Search Service Retry Strategy

Status: Accepted Date: 2024-12-04 Authors: Search Service Team Related PRs: TBD

ADR 001: Search Service Retry Strategy

Status: Accepted Date: 2024-12-04 Authors: Search Service Team Related PRs: #TBD

Context and Problem Statement

The search service API is a critical dependency for the frontend application. Transient network failures and temporary backend issues (5xx errors, timeouts) can cause user-facing errors even when the backend is generally healthy. We need a strategy to improve resilience against these temporary failures without impacting user experience.

Key Requirements:

  • Minimize user-visible errors from transient failures
  • Maintain acceptable response times (p95 < 500ms for critical paths)
  • Avoid thundering herd effect during backend recovery
  • Differentiate between retryable (5xx, timeouts) and non-retryable (4xx) errors

Decision Drivers

  • User Experience: Reduce error rates for critical operations (part detail pages, search)
  • Backend Load: Prevent overwhelming recovering services with retry storms
  • Latency Sensitivity: Different endpoints have different latency requirements
  • Error Types: 4xx errors (client mistakes) should never be retried

Considered Options

Option 1: No Retry (Status Quo)

Pros:

  • Simplest implementation
  • Predictable latency
  • No risk of retry storms

Cons:

  • ❌ High error rate from transient failures
  • ❌ Poor user experience during temporary backend issues
  • ❌ No differentiation between recoverable and fatal errors

Option 2: Simple Retry (Fixed Delay)

Pros:

  • Easy to implement
  • Bounded retry count

Cons:

  • ❌ Thundering herd problem (all clients retry simultaneously)
  • ❌ Fixed delay doesn't adapt to backend recovery time
  • ❌ Still causes spikes during recovery

Option 3: Exponential Backoff with Jitter (Selected)

Pros:

  • ✅ Prevents thundering herd via randomized delays
  • ✅ Adapts to backend recovery (longer delays for sustained failures)
  • ✅ Industry standard pattern (AWS SDK, Stripe, etc.)
  • ✅ Bounded retry count and maximum delay

Cons:

  • More complex implementation
  • Higher maximum latency for sustained failures

Option 4: Circuit Breaker Pattern

Pros:

  • Protects backend during extended outages
  • Fast failure after threshold

Cons:

  • ❌ Overly complex for current needs
  • ❌ Requires distributed state or coordination
  • ❌ May prematurely stop retries during intermittent issues

Decision

Implement exponential backoff with jitter for critical read endpoints.

Configuration

{
  maxAttempts: 3,           // Total 3 attempts (1 initial + 2 retries)
  initialDelayMs: 100,       // Start with 100ms delay
  backoffMultiplier: 3,      // Aggressive backoff (100ms → 300ms → 900ms)
  jitter: true               // ±25% randomization to prevent thundering herd
}

Retry Conditions

Retry Only:

  • HTTP 5xx server errors (500-599)
  • Timeout errors (AbortError from fetch timeout)

Never Retry:

  • HTTP 4xx client errors (400-499) - these indicate client mistakes
  • Successful responses (2xx, 3xx)
  • Validation errors (our own schema validation failures)

Endpoint Coverage

✅ Retry Enabled

  1. getPartById - Critical for part detail pages
  2. getPartBySlug - Alternative part lookup
  3. searchParts - Main catalog search
  4. fetchFilters - Filter data for search UI

❌ No Retry

  1. fetchAutocomplete - Latency-critical, fast-fail preferred
  2. health - Diagnostic endpoint, should reflect real-time status

Rationale for Selective Retry:

  • Part detail pages are critical user journeys where retry improves UX significantly
  • Search results are important but can tolerate fast-fail if backend is down
  • Filters are loaded async and benefit from retry without blocking UI
  • Autocomplete requires <200ms response time - retry would degrade UX
  • Health checks should reflect actual backend state, not retry-masked availability

Consequences

Positive

  • Reduced error rate: Transient failures masked from users
  • Better UX: Users don't see errors for temporary backend blips
  • No thundering herd: Jitter prevents simultaneous retry storms
  • Graceful degradation: Exponential backoff adapts to backend recovery time

Negative

  • ⚠️ Increased latency on failures: Max retry adds ~1.3s latency (100ms + 300ms + 900ms)
  • ⚠️ Hidden backend issues: Retry may mask systematic problems in metrics
  • ⚠️ Increased backend load: Each retry adds 1-2 additional requests for failed operations

Mitigation Strategies

  1. Monitor retry metrics:

    • Track retry attempt rate (should be <1% of requests)
    • Alert if retry rate >5% (indicates backend issues)
    • Log final failures after exhausted retries
  2. Latency monitoring:

    • Track p95/p99 latency separately for retried vs non-retried requests
    • Alert if p95 >500ms for critical endpoints
  3. Backend correlation:

    • Include X-Retry-Count header in retry attempts
    • Backend can differentiate retry traffic for debugging

Implementation Details

Core Retry Logic

Located in lib/search-service/retry.ts:

  • isRetryableError() - Determines if error should be retried
  • retry() - Main retry wrapper with exponential backoff
  • sleep() - AbortSignal-aware delay with memory leak prevention

HTTP Client Integration

Located in lib/search-service/http-client.ts:

  • Retry wrapper applied at request level via options.retry
  • Timeout signal properly chained through retry attempts
  • Cleanup logic prevents event listener memory leaks

Test Coverage

Located in lib/search-service/retry.test.ts:

  • 13 unit tests covering all retry scenarios
  • Includes timing tests for backoff verification
  • Tests abort signal handling and jitter variance

References

Future Considerations

  1. Adaptive backoff: Adjust multiplier based on backend recovery patterns
  2. Per-endpoint configuration: Different retry configs for different endpoint types
  3. Client-side caching: Reduce retry need by caching successful responses
  4. Circuit breaker: Add circuit breaker pattern if retry rate consistently high
  5. Request deduplication: Prevent duplicate retries from concurrent requests

On this page