Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team

Payment Service Monitoring & Alerting Setup

Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team

Overview
Metrics & Dashboards
Alerting Rules
Log Management
SLA & SLOs
Runbook Links

Overview

Monitoring Stack

Metrics: Google Cloud Monitoring
Logs: Google Cloud Logging
APM: Cloud Trace (future)
Alerting: Cloud Monitoring Alerts → PagerDuty
Dashboards: Cloud Monitoring Dashboards

Key Service Indicators

Metric	Target	Alert Threshold	Impact
Uptime	99.9%	<99.5%	Critical
Error Rate	<0.1%	>1%	High
P95 Latency	<500ms	>1000ms	Medium
MongoDB Connection	Stable	Disconnected	Critical
Webhook Success Rate	>99%	<95%	High

Metrics & Dashboards

Cloud Run Metrics

Navigate: Cloud Run Metrics

Request Metrics

run.googleapis.com/request_count - Total requests
run.googleapis.com/request_latencies - Request duration (P50, P95, P99)
run.googleapis.com/request_count (by response_code) - Success/error breakdown

Resource Metrics

run.googleapis.com/container/memory/utilizations - Memory usage
run.googleapis.com/container/cpu/utilizations - CPU usage
run.googleapis.com/container/instance_count - Active instances
run.googleapis.com/container/billable_instance_time - Cost tracking

Custom Application Metrics

Future Implementation:

// Example: Instrument with OpenTelemetry
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('payment-service');

// Webhook processing counter
const webhookCounter = meter.createCounter('webhook_processed', {
  description: 'Number of webhooks processed',
});

webhookCounter.add(1, { type: 'clerk', event: 'user.created' });

MongoDB Metrics (Atlas)

Navigate: MongoDB Atlas Monitoring

Connections: Current / Available
Query Performance: Slow queries (>100ms)
Storage: Database size, index usage
Alerts: Connection spikes, slow queries

Alerting Rules

Critical Alerts (Page On-Call)

1. Service Down

Condition: Health check failing for 2 minutes

Metric: run.googleapis.com/request_count
Filter:
  service_name: payment-service
  response_code: 5xx
Threshold: > 10 requests in 2 minutes
Notification: PagerDuty - Critical

Gcloud Command:

gcloud alpha monitoring policies create \
  --display-name="Payment Service - High Error Rate" \
  --condition-display-name="5xx errors > 10" \
  --condition-threshold-value=10 \
  --condition-threshold-duration=120s \
  --condition-threshold-aggregation-per-series-aligner=ALIGN_RATE \
  --condition-threshold-comparison=COMPARISON_GT \
  --notification-channels=PAGERDUTY_CHANNEL_ID

2. MongoDB Connection Lost

Condition: Health check returns 503 with "mongodb: disconnected"

Metric: logging.googleapis.com/user/mongodb_connection_error
Threshold: > 0 in 1 minute
Notification: PagerDuty - Critical

Log-based Metric:

gcloud logging metrics create mongodb_connection_error \
  --description="MongoDB connection failures" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"MongoDB.*disconnect"'

3. Webhook Failure Spike

Condition: >5 webhook failures in 5 minutes

Metric: logging.googleapis.com/user/webhook_failures
Threshold: > 5 in 5 minutes
Notification: PagerDuty - High

High Priority Alerts (Slack)

4. Elevated Error Rate

Condition: Error rate >1% for 5 minutes

Metric: run.googleapis.com/request_count
Filter:
  response_code_class: 5xx
Threshold: > 1% of total requests
Window: 5 minutes
Notification: Slack #alerts

5. High Latency

Condition: P95 latency >1000ms for 10 minutes

Metric: run.googleapis.com/request_latencies
Percentile: 95th
Threshold: > 1000ms
Window: 10 minutes
Notification: Slack #alerts

6. Memory Pressure

Condition: Memory utilization >80% for 10 minutes

Metric: run.googleapis.com/container/memory/utilizations
Threshold: > 0.8
Window: 10 minutes
Notification: Slack #alerts
Action: Consider increasing memory allocation

Medium Priority Alerts (Email)

7. Unusual Traffic Pattern

Condition: Request count 3x above baseline

Metric: run.googleapis.com/request_count
Threshold: 3x moving average (7 days)
Window: 30 minutes
Notification: Email - DevOps

8. Slow Queries (MongoDB)

Condition: Query duration >100ms

MongoDB Atlas Alert:

Navigate: Atlas → Alerts → Create Alert
Metric: Query Targeting: Scanned Objects / Returned
Threshold: >100
Notification: Email - Backend Team

Dashboards

1. Service Health Dashboard

Create Dashboard:

# Create dashboard JSON config
cat > payment-service-dashboard.json <<EOF
{
  "displayName": "Payment Service Health",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Request Rate",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              },
              "plotType": "LINE"
            }]
          }
        }
      },
      {
        "width": 6,
        "height": 4,
        "xPos": 6,
        "widget": {
          "title": "Error Rate",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              },
              "plotType": "LINE"
            }]
          }
        }
      }
    ]
  }
}
EOF

# Import dashboard
gcloud monitoring dashboards create --config-from-file=payment-service-dashboard.json

2. Business Metrics Dashboard

Future Implementation:

Track business-critical events:

Payment intents created
Successful payments
Failed payments (by error type)
Refunds processed
Webhook events by type
User signups (Clerk)
Average transaction value

3. Performance Dashboard

Charts:

P50/P95/P99 Latency trends
Request duration by endpoint
MongoDB query performance
Instance count over time
Cold start frequency

Log Management

Structured Logging

Current Format:

{
  "severity": "INFO",
  "message": "Clerk webhook verified",
  "eventType": "user.created",
  "clerkId": "user_xxx",
  "timestamp": "2025-11-13T10:30:00Z"
}

Log Queries

View Clerk Webhook Events:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"Clerk webhook"

View Errors:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
severity="ERROR"

View Slow Requests:

resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
httpRequest.latency > "1s"

Log Retention

Default: 30 days (Google Cloud Logging)
Long-term: Export to Cloud Storage or BigQuery

Setup Log Sink:

gcloud logging sinks create payment-service-logs \
  bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
  --log-filter='resource.type="cloud_run_revision" resource.labels.service_name="payment-service"'

Log Alerts

Unexpected Errors:

gcloud alpha monitoring policies create \
  --display-name="Payment Service - Unexpected Errors" \
  --condition-log-match='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    severity="ERROR"
    jsonPayload.message!~"Expected error pattern"'

SLA & SLOs

Service Level Agreement (SLA)

Uptime: 99.9% monthly (43.8 minutes downtime/month)
Response Time: P95 <500ms
Data Durability: 99.999% (MongoDB Atlas)

Service Level Objectives (SLOs)

Availability SLO

Target: 99.9% successful requests

Calculation:

SLO = (total_requests - 5xx_errors) / total_requests

Error Budget:

Monthly requests: ~10M
Allowed errors: 10,000 (0.1%)
Alert when 50% budget consumed

Implementation:

SLO Window: 30 days rolling
Measurement: Error rate <0.1%
Alert: When error budget <50% remaining

Latency SLO

Target: P95 latency <500ms

Measurement Window: 7 days rolling

Alert: P95 >750ms for 1 hour

Webhook Processing SLO

Target: 99% of webhooks processed within 5 seconds

Measurement:

Track receivedAt to processedAt duration
Alert if processing time >10s or failure rate >1%

Monitoring Best Practices

1. Golden Signals

Latency:

Track request duration P50, P95, P99
Alert on P95 >1000ms sustained

Traffic:

Monitor request rate trends
Alert on sudden 3x spike or drop

Errors:

Track error rate by type (4xx, 5xx)
Alert on rate >1%

Saturation:

Monitor CPU, memory, MongoDB connections
Alert on sustained >80% utilization

2. Alert Fatigue Prevention

Use appropriate thresholds (don't alert on noise)
Group related alerts (batch similar conditions)
Auto-resolve alerts when condition clears
Silence during maintenance windows

3. On-Call Rotation

Primary: Responds within 15 minutes
Secondary: Backup if primary unavailable
Escalation: Team lead after 30 minutes

PagerDuty Schedule:

Week 1: Engineer A (Primary), Engineer B (Secondary)
Week 2: Engineer B (Primary), Engineer C (Secondary)
Rotation: Every Monday 9am UTC

Runbook Links

When alerts fire, refer to:

High Error Rate: Troubleshooting Guide
MongoDB Issues: Database Troubleshooting
Webhook Failures: Webhook Troubleshooting
Rollback Procedures: Rollback Guide

Appendix: Setup Commands

Create Notification Channels

# PagerDuty channel
gcloud alpha monitoring channels create \
  --display-name="PagerDuty - Critical" \
  --type=pagerduty \
  --channel-labels=service_key=YOUR_PAGERDUTY_KEY

# Slack channel
gcloud alpha monitoring channels create \
  --display-name="Slack - Alerts" \
  --type=slack \
  --channel-labels=url=YOUR_SLACK_WEBHOOK_URL

# Email channel
gcloud alpha monitoring channels create \
  --display-name="DevOps Team Email" \
  --type=email \
  --channel-labels=email_address=devops@crop.com

Create Log-Based Metrics

# Webhook failures
gcloud logging metrics create webhook_failures \
  --description="Clerk/Stripe webhook failures" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"webhook.*fail"
    severity="ERROR"'

# MongoDB errors
gcloud logging metrics create mongodb_errors \
  --description="MongoDB connection/query errors" \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"
    jsonPayload.message=~"MongoDB.*error"'

Export Logs to BigQuery

# Create dataset
bq mk --dataset crop-platform:payment_logs

# Create log sink
gcloud logging sinks create payment-logs-bigquery \
  bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
  --log-filter='resource.type="cloud_run_revision"
    resource.labels.service_name="payment-service"'

Document Version: 1.0 Last Review: 2025-11-13 Next Review: 2025-12-13

Payment Service Monitoring & Alerting Setup

On this page