Deployment Runbook

Last Updated: 2025-11-12 Owner: DevOps Team Incident Contact: @vova

Overview

This runbook covers deployment procedures, rollback strategies, and emergency procedures for CROP microservices platform.

Pre-Deployment Checklist
Normal Deployment
Emergency Rollback
Health Checks
Troubleshooting
Post-Deployment

Pre-Deployment Checklist

Required Checks

All CI/CD checks passing (tests, lint, security scans)
PR approved and merged to main
Database migrations tested (if applicable)
Secrets rotated if needed (check Secret Manager)
Monitoring dashboards open
Team notified in Slack

Security Gates

Automated security checks (must pass):

✅ Gitleaks - No secrets in code
✅ Trivy - No CRITICAL/HIGH vulnerabilities in container
✅ OSV-Scanner - No known CVEs in dependencies

Stakeholder Notification

📦 Deployment: search-service v1.2.3
⏰ ETA: 2025-11-12 14:00 UTC
🔗 PR: https://github.com/MyWatson/CROP-parts-services/pull/123
👤 Deployer: @vova

Normal Deployment

Automatic Deployment (Recommended)

Deployments are triggered automatically on merge to main:

Merge PR to main branch
GitHub Actions runs:
- Builds Docker image
- Scans with Trivy
- Pushes to GCR
- Deploys to Cloud Run (no-traffic)
- Runs smoke tests
- Gradually shifts traffic (1% → 10% → 100%)
Monitor dashboards during rollout
Verify health checks pass

Timeline: ~15-20 minutes from merge to 100% traffic

Manual Deployment (Emergency)

If CI/CD is down:

# 1. Build and push image
cd services/search
docker build -t gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) .
docker push gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD)

# 2. Deploy with no traffic
gcloud run deploy search-service \
  --region=us-east1 \
  --image=gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) \
  --no-traffic \
  --tag=manual-deploy

# 3. Test preview URL
PREVIEW_URL="https://manual-deploy---search-service-222426967009.us-east1.run.app"
curl -f "$PREVIEW_URL/health"

# 4. Shift traffic gradually
gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=search-service-REVISION=1

sleep 30 && gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=search-service-REVISION=10

sleep 30 && gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=LATEST=100

Emergency Rollback

Quick Rollback (< 5 minutes)

When to use:

5xx errors > 1%
Latency p95 > 2s
Failed smoke tests
Critical bug discovered

Steps:

# 1. Get previous revision
CURRENT=$(gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.traffic[0].revisionName)')

PREVIOUS=$(gcloud run revisions list \
  --service=search-service \
  --region=us-east1 \
  --format='value(name)' \
  --limit=2 | tail -1)

echo "Current: $CURRENT"
echo "Rolling back to: $PREVIOUS"

# 2. Instant rollback (100% traffic to previous)
gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=$PREVIOUS=100

# 3. Verify
curl -f https://search-service-222426967009.us-east1.run.app/health

# 4. Notify team
echo "🚨 ROLLBACK: search-service → $PREVIOUS"

Recovery Time Objective (RTO): < 5 minutes Recovery Point Objective (RPO): Last stable revision

Rollback via GitHub Actions

If automated rollback triggered by smoke tests:

Check GitHub Actions logs
Verify rollback completed
Review error logs: gcloud logging read
Create incident ticket

Health Checks

Service Health Endpoints

Endpoint	Purpose	Expected Response
`/health`	Shallow health check	`{"ok":true}`
`/ready`	Deep health check (DB/ES)	`{"ok":true,"mongodb":"ok","elasticsearch":"ok"}`
`/live`	Liveness probe	`200 OK`
`/metrics`	Prometheus metrics	Metrics in text format

Manual Health Verification

SERVICE_URL="https://search-service-222426967009.us-east1.run.app"

# 1. Shallow health
curl -f "$SERVICE_URL/health"

# 2. Deep health (requires services)
curl -f "$SERVICE_URL/ready?deep=true"

# 3. Smoke test
bash services/search/scripts/smoke-test.sh "$SERVICE_URL"

# 4. Check error rate
gcloud logging read "resource.type=cloud_run_revision \
  AND resource.labels.service_name=search-service \
  AND severity>=ERROR" \
  --limit=10 \
  --format=json | jq -r '.[].textPayload'

Monitoring Dashboards

Cloud Run Metrics: https://console.cloud.google.com/run/detail/us-east1/search-service
Logs: https://console.cloud.google.com/logs/query

Key Metrics:

Request count
Request latency (p50, p95, p99)
Error rate (5xx)
Instance count
CPU/Memory utilization

Troubleshooting

Issue: Container Fails to Start

Symptoms:

Health checks failing
Container exits immediately
Error logs show initialization failures

Diagnosis:

# Check latest logs
gcloud run services logs read search-service \
  --region=us-east1 \
  --limit=50

# Check secret access
gcloud secrets versions access latest --secret=MONGODB_URI

# Test locally
docker run --rm -e MONGODB_URI="..." \
  gcr.io/noted-bliss-466410-q6/search-service:latest

Resolution:

Fix configuration/secrets
Redeploy
If urgent: rollback to previous revision

Issue: High Latency

Symptoms:

p95 latency > 1s
User complaints
Timeout errors

Diagnosis:

# Check Elasticsearch health
curl -X GET "http://10.0.0.52:9200/_cluster/health"

# Check MongoDB connection
mongosh "$MONGODB_URI" --eval "db.adminCommand('ping')"

# Check Cloud Run metrics
gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.conditions)'

Resolution:

Scale up instances: Update Terraform config
Increase CPU/memory: Update Cloud Run config
Optimize queries: Review ES_LOG_QUERIES=true logs
Check VPC connector throughput

Issue: Database Connection Failures

Symptoms:

/ready returns {"mongodb":"error"}
503 Service Unavailable errors

Diagnosis:

# Check MongoDB connection from Cloud Run
gcloud run services proxy search-service --region=us-east1

# In another terminal
curl localhost:8080/ready?deep=true | jq .

# Check VPC connector
gcloud compute networks vpc-access connectors describe crop-connector \
  --region=us-east1

Resolution:

Verify VPC connector status
Check MongoDB Atlas IP whitelist
Verify Secret Manager secrets
Test connection from Cloud Shell

Post-Deployment

Verification Checklist

Monitoring Period

First 30 minutes:

Watch dashboards continuously
Monitor error logs
Check user reports

First 24 hours:

Review metrics hourly
Check for memory leaks (gradual memory increase)
Monitor instance scaling patterns

Incident Response

If issues detected post-deployment:

Assess severity (P0 = rollback immediately, P1 = investigate)
Gather evidence (logs, metrics, screenshots)
Create incident in Linear with label Incident
Notify team in Slack #incidents
Execute rollback if needed (see Emergency Rollback)
Post-mortem within 48 hours

Success Criteria

✅ Deployment successful if:

Zero 5xx errors for 1 hour
Latency within SLA
All health checks green
No user complaints
Metrics stable

Contacts

On-Call Rotation:

Primary: @vova
Secondary: @team-lead

Escalation Path:

DevOps team (#devops)
Engineering lead
CTO

External Dependencies:

MongoDB Atlas: https://cloud.mongodb.com/
GCP Support: https://console.cloud.google.com/support
Elasticsearch: Internal VPC (10.0.0.52:9200)

Appendix

Useful Commands

# Get current revision
gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.latestReadyRevisionName)'

# List all revisions
gcloud run revisions list \
  --service=search-service \
  --region=us-east1

# Get revision traffic split
gcloud run services describe search-service \
  --region=us-east1 \
  --format='table(status.traffic[].revisionName,status.traffic[].percent)'

# Stream logs
gcloud run services logs tail search-service --region=us-east1

# Execute command in container
gcloud run services proxy search-service --region=us-east1

Document Version: 1.0 Last Reviewed: 2025-11-12 Next Review: 2025-12-12

Deployment Runbook

Overview

Table of Contents

Pre-Deployment Checklist

Required Checks

Security Gates

Stakeholder Notification

Normal Deployment

Automatic Deployment (Recommended)

Manual Deployment (Emergency)

Emergency Rollback

Quick Rollback (< 5 minutes)

Rollback via GitHub Actions

Health Checks

Service Health Endpoints

Manual Health Verification

Monitoring Dashboards

Troubleshooting

Issue: Container Fails to Start

Issue: High Latency

Issue: Database Connection Failures

Post-Deployment

Verification Checklist

Monitoring Period

Incident Response

Success Criteria

Contacts

Appendix

Useful Commands

Deployment Runbook

On this page