CROP
ProjectsParts ServicesPayment

Payment Service Monitoring & Alerting Setup

Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team

Payment Service Monitoring & Alerting Setup

Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team


Table of Contents

  1. Overview
  2. Metrics & Dashboards
  3. Alerting Rules
  4. Log Management
  5. SLA & SLOs
  6. Runbook Links

Overview

Monitoring Stack

  • Metrics: Google Cloud Monitoring
  • Logs: Google Cloud Logging
  • APM: Cloud Trace (future)
  • Alerting: Cloud Monitoring Alerts → PagerDuty
  • Dashboards: Cloud Monitoring Dashboards

Key Service Indicators

MetricTargetAlert ThresholdImpact
Uptime99.9%<99.5%Critical
Error Rate<0.1%>1%High
P95 Latency<500ms>1000msMedium
MongoDB ConnectionStableDisconnectedCritical
Webhook Success Rate>99%<95%High

Metrics & Dashboards

Cloud Run Metrics

Navigate: Cloud Run Metrics

Request Metrics

  • run.googleapis.com/request_count - Total requests
  • run.googleapis.com/request_latencies - Request duration (P50, P95, P99)
  • run.googleapis.com/request_count (by response_code) - Success/error breakdown

Resource Metrics

  • run.googleapis.com/container/memory/utilizations - Memory usage
  • run.googleapis.com/container/cpu/utilizations - CPU usage
  • run.googleapis.com/container/instance_count - Active instances
  • run.googleapis.com/container/billable_instance_time - Cost tracking

Custom Application Metrics

Future Implementation:

// Example: Instrument with OpenTelemetry
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('payment-service');

// Webhook processing counter
const webhookCounter = meter.createCounter('webhook_processed', {
  description: 'Number of webhooks processed',
});

webhookCounter.add(1, { type: 'clerk', event: 'user.created' });

MongoDB Metrics (Atlas)

Navigate: MongoDB Atlas Monitoring

  • Connections: Current / Available
  • Query Performance: Slow queries (>100ms)
  • Storage: Database size, index usage
  • Alerts: Connection spikes, slow queries

Alerting Rules

Critical Alerts (Page On-Call)

1. Service Down

Condition: Health check failing for 2 minutes

Metric: run.googleapis.com/request_count
Filter:
  service_name: payment-service
  response_code: 5xx
Threshold: > 10 requests in 2 minutes
Notification: PagerDuty - Critical

Gcloud Command:

gcloud alpha monitoring policies create \
  --display-name="Payment Service - High Error Rate" \
  --condition-display-name="5xx errors > 10" \
  --condition-threshold-value=10 \
  --condition-threshold-duration=120s \
  --condition-threshold-aggregation-per-series-aligner=ALIGN_RATE \
  --condition-threshold-comparison=COMPARISON_GT \
  --notification-channels=PAGERDUTY_CHANNEL_ID

2. MongoDB Connection Lost

Condition: Health check returns 503 with "mongodb: disconnected"

Metric: logging.googleapis.com/user/mongodb_connection_error
Threshold: > 0 in 1 minute
Notification: PagerDuty - Critical

Log-based Metric:

gcloud logging metrics create mongodb_connection_error \
  --description="MongoDB connection failures" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"MongoDB.*disconnect"'

3. Webhook Failure Spike

Condition: >5 webhook failures in 5 minutes

Metric: logging.googleapis.com/user/webhook_failures
Threshold: > 5 in 5 minutes
Notification: PagerDuty - High

High Priority Alerts (Slack)

4. Elevated Error Rate

Condition: Error rate >1% for 5 minutes

Metric: run.googleapis.com/request_count
Filter:
  response_code_class: 5xx
Threshold: > 1% of total requests
Window: 5 minutes
Notification: Slack #alerts

5. High Latency

Condition: P95 latency >1000ms for 10 minutes

Metric: run.googleapis.com/request_latencies
Percentile: 95th
Threshold: > 1000ms
Window: 10 minutes
Notification: Slack #alerts

6. Memory Pressure

Condition: Memory utilization >80% for 10 minutes

Metric: run.googleapis.com/container/memory/utilizations
Threshold: > 0.8
Window: 10 minutes
Notification: Slack #alerts
Action: Consider increasing memory allocation

Medium Priority Alerts (Email)

7. Unusual Traffic Pattern

Condition: Request count 3x above baseline

Metric: run.googleapis.com/request_count
Threshold: 3x moving average (7 days)
Window: 30 minutes
Notification: Email - DevOps

8. Slow Queries (MongoDB)

Condition: Query duration >100ms

MongoDB Atlas Alert:

  • Navigate: Atlas → Alerts → Create Alert
  • Metric: Query Targeting: Scanned Objects / Returned
  • Threshold: >100
  • Notification: Email - Backend Team

Dashboards

1. Service Health Dashboard

Create Dashboard:

# Create dashboard JSON config
cat > payment-service-dashboard.json <<EOF
{
  "displayName": "Payment Service Health",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Request Rate",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              },
              "plotType": "LINE"
            }]
          }
        }
      },
      {
        "width": 6,
        "height": 4,
        "xPos": 6,
        "widget": {
          "title": "Error Rate",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              },
              "plotType": "LINE"
            }]
          }
        }
      }
    ]
  }
}
EOF

# Import dashboard
gcloud monitoring dashboards create --config-from-file=payment-service-dashboard.json

2. Business Metrics Dashboard

Future Implementation:

Track business-critical events:

  • Payment intents created
  • Successful payments
  • Failed payments (by error type)
  • Refunds processed
  • Webhook events by type
  • User signups (Clerk)
  • Average transaction value

3. Performance Dashboard

Charts:

  • P50/P95/P99 Latency trends
  • Request duration by endpoint
  • MongoDB query performance
  • Instance count over time
  • Cold start frequency

Log Management

Structured Logging

Current Format:

{
  "severity": "INFO",
  "message": "Clerk webhook verified",
  "eventType": "user.created",
  "clerkId": "user_xxx",
  "timestamp": "2025-11-13T10:30:00Z"
}

Log Queries

View Clerk Webhook Events:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"Clerk webhook"

View Errors:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
severity="ERROR"

View Slow Requests:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
httpRequest.latency > "1s"

Log Retention

  • Default: 30 days (Google Cloud Logging)
  • Long-term: Export to Cloud Storage or BigQuery

Setup Log Sink:

gcloud logging sinks create payment-service-logs \
  bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
  --log-filter='resource.type="cloud_run_revision" resource.labels.service_name="payment-service"'

Log Alerts

Unexpected Errors:

gcloud alpha monitoring policies create \
  --display-name="Payment Service - Unexpected Errors" \
  --condition-log-match='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    severity="ERROR"
    jsonPayload.message!~"Expected error pattern"'

SLA & SLOs

Service Level Agreement (SLA)

  • Uptime: 99.9% monthly (43.8 minutes downtime/month)
  • Response Time: P95 <500ms
  • Data Durability: 99.999% (MongoDB Atlas)

Service Level Objectives (SLOs)

Availability SLO

Target: 99.9% successful requests

Calculation:

SLO = (total_requests - 5xx_errors) / total_requests

Error Budget:

  • Monthly requests: ~10M
  • Allowed errors: 10,000 (0.1%)
  • Alert when 50% budget consumed

Implementation:

SLO Window: 30 days rolling
Measurement: Error rate <0.1%
Alert: When error budget <50% remaining

Latency SLO

Target: P95 latency <500ms

Measurement Window: 7 days rolling

Alert: P95 >750ms for 1 hour

Webhook Processing SLO

Target: 99% of webhooks processed within 5 seconds

Measurement:

  • Track receivedAt to processedAt duration
  • Alert if processing time >10s or failure rate >1%

Monitoring Best Practices

1. Golden Signals

Latency:

  • Track request duration P50, P95, P99
  • Alert on P95 >1000ms sustained

Traffic:

  • Monitor request rate trends
  • Alert on sudden 3x spike or drop

Errors:

  • Track error rate by type (4xx, 5xx)
  • Alert on rate >1%

Saturation:

  • Monitor CPU, memory, MongoDB connections
  • Alert on sustained >80% utilization

2. Alert Fatigue Prevention

  • Use appropriate thresholds (don't alert on noise)
  • Group related alerts (batch similar conditions)
  • Auto-resolve alerts when condition clears
  • Silence during maintenance windows

3. On-Call Rotation

  • Primary: Responds within 15 minutes
  • Secondary: Backup if primary unavailable
  • Escalation: Team lead after 30 minutes

PagerDuty Schedule:

  • Week 1: Engineer A (Primary), Engineer B (Secondary)
  • Week 2: Engineer B (Primary), Engineer C (Secondary)
  • Rotation: Every Monday 9am UTC

When alerts fire, refer to:


Appendix: Setup Commands

Create Notification Channels

# PagerDuty channel
gcloud alpha monitoring channels create \
  --display-name="PagerDuty - Critical" \
  --type=pagerduty \
  --channel-labels=service_key=YOUR_PAGERDUTY_KEY

# Slack channel
gcloud alpha monitoring channels create \
  --display-name="Slack - Alerts" \
  --type=slack \
  --channel-labels=url=YOUR_SLACK_WEBHOOK_URL

# Email channel
gcloud alpha monitoring channels create \
  --display-name="DevOps Team Email" \
  --type=email \
  --channel-labels=email_address=devops@crop.com

Create Log-Based Metrics

# Webhook failures
gcloud logging metrics create webhook_failures \
  --description="Clerk/Stripe webhook failures" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"webhook.*fail"
    severity="ERROR"'

# MongoDB errors
gcloud logging metrics create mongodb_errors \
  --description="MongoDB connection/query errors" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"MongoDB.*error"'

Export Logs to BigQuery

# Create dataset
bq mk --dataset crop-platform:payment_logs

# Create log sink
gcloud logging sinks create payment-logs-bigquery \
  bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"'

Document Version: 1.0 Last Review: 2025-11-13 Next Review: 2025-12-13

On this page