Payment Service Monitoring & Alerting Setup
Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team
Payment Service Monitoring & Alerting Setup
Version: 1.0 Last Updated: 2025-11-13 Service: payment-service Maintainer: DevOps Team
Table of Contents
Overview
Monitoring Stack
- Metrics: Google Cloud Monitoring
- Logs: Google Cloud Logging
- APM: Cloud Trace (future)
- Alerting: Cloud Monitoring Alerts → PagerDuty
- Dashboards: Cloud Monitoring Dashboards
Key Service Indicators
| Metric | Target | Alert Threshold | Impact |
|---|---|---|---|
| Uptime | 99.9% | <99.5% | Critical |
| Error Rate | <0.1% | >1% | High |
| P95 Latency | <500ms | >1000ms | Medium |
| MongoDB Connection | Stable | Disconnected | Critical |
| Webhook Success Rate | >99% | <95% | High |
Metrics & Dashboards
Cloud Run Metrics
Navigate: Cloud Run Metrics
Request Metrics
run.googleapis.com/request_count- Total requestsrun.googleapis.com/request_latencies- Request duration (P50, P95, P99)run.googleapis.com/request_count (by response_code)- Success/error breakdown
Resource Metrics
run.googleapis.com/container/memory/utilizations- Memory usagerun.googleapis.com/container/cpu/utilizations- CPU usagerun.googleapis.com/container/instance_count- Active instancesrun.googleapis.com/container/billable_instance_time- Cost tracking
Custom Application Metrics
Future Implementation:
// Example: Instrument with OpenTelemetry
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('payment-service');
// Webhook processing counter
const webhookCounter = meter.createCounter('webhook_processed', {
description: 'Number of webhooks processed',
});
webhookCounter.add(1, { type: 'clerk', event: 'user.created' });MongoDB Metrics (Atlas)
Navigate: MongoDB Atlas Monitoring
- Connections: Current / Available
- Query Performance: Slow queries (>100ms)
- Storage: Database size, index usage
- Alerts: Connection spikes, slow queries
Alerting Rules
Critical Alerts (Page On-Call)
1. Service Down
Condition: Health check failing for 2 minutes
Metric: run.googleapis.com/request_count
Filter:
service_name: payment-service
response_code: 5xx
Threshold: > 10 requests in 2 minutes
Notification: PagerDuty - CriticalGcloud Command:
gcloud alpha monitoring policies create \
--display-name="Payment Service - High Error Rate" \
--condition-display-name="5xx errors > 10" \
--condition-threshold-value=10 \
--condition-threshold-duration=120s \
--condition-threshold-aggregation-per-series-aligner=ALIGN_RATE \
--condition-threshold-comparison=COMPARISON_GT \
--notification-channels=PAGERDUTY_CHANNEL_ID2. MongoDB Connection Lost
Condition: Health check returns 503 with "mongodb: disconnected"
Metric: logging.googleapis.com/user/mongodb_connection_error
Threshold: > 0 in 1 minute
Notification: PagerDuty - CriticalLog-based Metric:
gcloud logging metrics create mongodb_connection_error \
--description="MongoDB connection failures" \
--log-filter='resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"MongoDB.*disconnect"'3. Webhook Failure Spike
Condition: >5 webhook failures in 5 minutes
Metric: logging.googleapis.com/user/webhook_failures
Threshold: > 5 in 5 minutes
Notification: PagerDuty - HighHigh Priority Alerts (Slack)
4. Elevated Error Rate
Condition: Error rate >1% for 5 minutes
Metric: run.googleapis.com/request_count
Filter:
response_code_class: 5xx
Threshold: > 1% of total requests
Window: 5 minutes
Notification: Slack #alerts5. High Latency
Condition: P95 latency >1000ms for 10 minutes
Metric: run.googleapis.com/request_latencies
Percentile: 95th
Threshold: > 1000ms
Window: 10 minutes
Notification: Slack #alerts6. Memory Pressure
Condition: Memory utilization >80% for 10 minutes
Metric: run.googleapis.com/container/memory/utilizations
Threshold: > 0.8
Window: 10 minutes
Notification: Slack #alerts
Action: Consider increasing memory allocationMedium Priority Alerts (Email)
7. Unusual Traffic Pattern
Condition: Request count 3x above baseline
Metric: run.googleapis.com/request_count
Threshold: 3x moving average (7 days)
Window: 30 minutes
Notification: Email - DevOps8. Slow Queries (MongoDB)
Condition: Query duration >100ms
MongoDB Atlas Alert:
- Navigate: Atlas → Alerts → Create Alert
- Metric: Query Targeting: Scanned Objects / Returned
- Threshold: >100
- Notification: Email - Backend Team
Dashboards
1. Service Health Dashboard
Create Dashboard:
# Create dashboard JSON config
cat > payment-service-dashboard.json <<EOF
{
"displayName": "Payment Service Health",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Request Rate",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}]
}
}
},
{
"width": 6,
"height": 4,
"xPos": 6,
"widget": {
"title": "Error Rate",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" resource.labels.service_name=\"payment-service\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
},
"plotType": "LINE"
}]
}
}
}
]
}
}
EOF
# Import dashboard
gcloud monitoring dashboards create --config-from-file=payment-service-dashboard.json2. Business Metrics Dashboard
Future Implementation:
Track business-critical events:
- Payment intents created
- Successful payments
- Failed payments (by error type)
- Refunds processed
- Webhook events by type
- User signups (Clerk)
- Average transaction value
3. Performance Dashboard
Charts:
- P50/P95/P99 Latency trends
- Request duration by endpoint
- MongoDB query performance
- Instance count over time
- Cold start frequency
Log Management
Structured Logging
Current Format:
{
"severity": "INFO",
"message": "Clerk webhook verified",
"eventType": "user.created",
"clerkId": "user_xxx",
"timestamp": "2025-11-13T10:30:00Z"
}Log Queries
View Clerk Webhook Events:
resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"Clerk webhook"View Errors:
resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
severity="ERROR"View Slow Requests:
resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
httpRequest.latency > "1s"Log Retention
- Default: 30 days (Google Cloud Logging)
- Long-term: Export to Cloud Storage or BigQuery
Setup Log Sink:
gcloud logging sinks create payment-service-logs \
bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
--log-filter='resource.type="cloud_run_revision" resource.labels.service_name="payment-service"'Log Alerts
Unexpected Errors:
gcloud alpha monitoring policies create \
--display-name="Payment Service - Unexpected Errors" \
--condition-log-match='resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
severity="ERROR"
jsonPayload.message!~"Expected error pattern"'SLA & SLOs
Service Level Agreement (SLA)
- Uptime: 99.9% monthly (43.8 minutes downtime/month)
- Response Time: P95 <500ms
- Data Durability: 99.999% (MongoDB Atlas)
Service Level Objectives (SLOs)
Availability SLO
Target: 99.9% successful requests
Calculation:
SLO = (total_requests - 5xx_errors) / total_requestsError Budget:
- Monthly requests: ~10M
- Allowed errors: 10,000 (0.1%)
- Alert when 50% budget consumed
Implementation:
SLO Window: 30 days rolling
Measurement: Error rate <0.1%
Alert: When error budget <50% remainingLatency SLO
Target: P95 latency <500ms
Measurement Window: 7 days rolling
Alert: P95 >750ms for 1 hour
Webhook Processing SLO
Target: 99% of webhooks processed within 5 seconds
Measurement:
- Track
receivedAttoprocessedAtduration - Alert if processing time >10s or failure rate >1%
Monitoring Best Practices
1. Golden Signals
Latency:
- Track request duration P50, P95, P99
- Alert on P95 >1000ms sustained
Traffic:
- Monitor request rate trends
- Alert on sudden 3x spike or drop
Errors:
- Track error rate by type (4xx, 5xx)
- Alert on rate >1%
Saturation:
- Monitor CPU, memory, MongoDB connections
- Alert on sustained >80% utilization
2. Alert Fatigue Prevention
- Use appropriate thresholds (don't alert on noise)
- Group related alerts (batch similar conditions)
- Auto-resolve alerts when condition clears
- Silence during maintenance windows
3. On-Call Rotation
- Primary: Responds within 15 minutes
- Secondary: Backup if primary unavailable
- Escalation: Team lead after 30 minutes
PagerDuty Schedule:
- Week 1: Engineer A (Primary), Engineer B (Secondary)
- Week 2: Engineer B (Primary), Engineer C (Secondary)
- Rotation: Every Monday 9am UTC
Runbook Links
When alerts fire, refer to:
- High Error Rate: Troubleshooting Guide
- MongoDB Issues: Database Troubleshooting
- Webhook Failures: Webhook Troubleshooting
- Rollback Procedures: Rollback Guide
Appendix: Setup Commands
Create Notification Channels
# PagerDuty channel
gcloud alpha monitoring channels create \
--display-name="PagerDuty - Critical" \
--type=pagerduty \
--channel-labels=service_key=YOUR_PAGERDUTY_KEY
# Slack channel
gcloud alpha monitoring channels create \
--display-name="Slack - Alerts" \
--type=slack \
--channel-labels=url=YOUR_SLACK_WEBHOOK_URL
# Email channel
gcloud alpha monitoring channels create \
--display-name="DevOps Team Email" \
--type=email \
--channel-labels=email_address=devops@crop.comCreate Log-Based Metrics
# Webhook failures
gcloud logging metrics create webhook_failures \
--description="Clerk/Stripe webhook failures" \
--log-filter='resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"webhook.*fail"
severity="ERROR"'
# MongoDB errors
gcloud logging metrics create mongodb_errors \
--description="MongoDB connection/query errors" \
--log-filter='resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"
jsonPayload.message=~"MongoDB.*error"'Export Logs to BigQuery
# Create dataset
bq mk --dataset crop-platform:payment_logs
# Create log sink
gcloud logging sinks create payment-logs-bigquery \
bigquery.googleapis.com/projects/crop-platform/datasets/payment_logs \
--log-filter='resource.type="cloud_run_revision"
resource.labels.service_name="payment-service"'Document Version: 1.0 Last Review: 2025-11-13 Next Review: 2025-12-13