CROP
ProjectsParts Services

Deployment Runbook

Last Updated: 2025-11-12 Owner: DevOps Team Incident Contact: @vova

Deployment Runbook

Last Updated: 2025-11-12 Owner: DevOps Team Incident Contact: @vova


Overview

This runbook covers deployment procedures, rollback strategies, and emergency procedures for CROP microservices platform.

Table of Contents


Pre-Deployment Checklist

Required Checks

  • All CI/CD checks passing (tests, lint, security scans)
  • PR approved and merged to main
  • Database migrations tested (if applicable)
  • Secrets rotated if needed (check Secret Manager)
  • Monitoring dashboards open
  • Team notified in Slack

Security Gates

Automated security checks (must pass):

  • Gitleaks - No secrets in code
  • Trivy - No CRITICAL/HIGH vulnerabilities in container
  • OSV-Scanner - No known CVEs in dependencies

Stakeholder Notification

📦 Deployment: search-service v1.2.3
⏰ ETA: 2025-11-12 14:00 UTC
🔗 PR: https://github.com/MyWatson/CROP-parts-services/pull/123
👤 Deployer: @vova

Normal Deployment

Deployments are triggered automatically on merge to main:

  1. Merge PR to main branch
  2. GitHub Actions runs:
    • Builds Docker image
    • Scans with Trivy
    • Pushes to GCR
    • Deploys to Cloud Run (no-traffic)
    • Runs smoke tests
    • Gradually shifts traffic (1% → 10% → 100%)
  3. Monitor dashboards during rollout
  4. Verify health checks pass

Timeline: ~15-20 minutes from merge to 100% traffic

Manual Deployment (Emergency)

If CI/CD is down:

# 1. Build and push image
cd services/search
docker build -t gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) .
docker push gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD)

# 2. Deploy with no traffic
gcloud run deploy search-service \
  --region=us-east1 \
  --image=gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) \
  --no-traffic \
  --tag=manual-deploy

# 3. Test preview URL
PREVIEW_URL="https://manual-deploy---search-service-222426967009.us-east1.run.app"
curl -f "$PREVIEW_URL/health"

# 4. Shift traffic gradually
gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=search-service-REVISION=1

sleep 30 && gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=search-service-REVISION=10

sleep 30 && gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=LATEST=100

Emergency Rollback

Quick Rollback (< 5 minutes)

When to use:

  • 5xx errors > 1%
  • Latency p95 > 2s
  • Failed smoke tests
  • Critical bug discovered

Steps:

# 1. Get previous revision
CURRENT=$(gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.traffic[0].revisionName)')

PREVIOUS=$(gcloud run revisions list \
  --service=search-service \
  --region=us-east1 \
  --format='value(name)' \
  --limit=2 | tail -1)

echo "Current: $CURRENT"
echo "Rolling back to: $PREVIOUS"

# 2. Instant rollback (100% traffic to previous)
gcloud run services update-traffic search-service \
  --region=us-east1 \
  --to-revisions=$PREVIOUS=100

# 3. Verify
curl -f https://search-service-222426967009.us-east1.run.app/health

# 4. Notify team
echo "🚨 ROLLBACK: search-service → $PREVIOUS"

Recovery Time Objective (RTO): < 5 minutes Recovery Point Objective (RPO): Last stable revision

Rollback via GitHub Actions

If automated rollback triggered by smoke tests:

  1. Check GitHub Actions logs
  2. Verify rollback completed
  3. Review error logs: gcloud logging read
  4. Create incident ticket

Health Checks

Service Health Endpoints

EndpointPurposeExpected Response
/healthShallow health check{"ok":true}
/readyDeep health check (DB/ES){"ok":true,"mongodb":"ok","elasticsearch":"ok"}
/liveLiveness probe200 OK
/metricsPrometheus metricsMetrics in text format

Manual Health Verification

SERVICE_URL="https://search-service-222426967009.us-east1.run.app"

# 1. Shallow health
curl -f "$SERVICE_URL/health"

# 2. Deep health (requires services)
curl -f "$SERVICE_URL/ready?deep=true"

# 3. Smoke test
bash services/search/scripts/smoke-test.sh "$SERVICE_URL"

# 4. Check error rate
gcloud logging read "resource.type=cloud_run_revision \
  AND resource.labels.service_name=search-service \
  AND severity>=ERROR" \
  --limit=10 \
  --format=json | jq -r '.[].textPayload'

Monitoring Dashboards

Key Metrics:

  • Request count
  • Request latency (p50, p95, p99)
  • Error rate (5xx)
  • Instance count
  • CPU/Memory utilization

Troubleshooting

Issue: Container Fails to Start

Symptoms:

  • Health checks failing
  • Container exits immediately
  • Error logs show initialization failures

Diagnosis:

# Check latest logs
gcloud run services logs read search-service \
  --region=us-east1 \
  --limit=50

# Check secret access
gcloud secrets versions access latest --secret=MONGODB_URI

# Test locally
docker run --rm -e MONGODB_URI="..." \
  gcr.io/noted-bliss-466410-q6/search-service:latest

Resolution:

  1. Fix configuration/secrets
  2. Redeploy
  3. If urgent: rollback to previous revision

Issue: High Latency

Symptoms:

  • p95 latency > 1s
  • User complaints
  • Timeout errors

Diagnosis:

# Check Elasticsearch health
curl -X GET "http://10.0.0.52:9200/_cluster/health"

# Check MongoDB connection
mongosh "$MONGODB_URI" --eval "db.adminCommand('ping')"

# Check Cloud Run metrics
gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.conditions)'

Resolution:

  1. Scale up instances: Update Terraform config
  2. Increase CPU/memory: Update Cloud Run config
  3. Optimize queries: Review ES_LOG_QUERIES=true logs
  4. Check VPC connector throughput

Issue: Database Connection Failures

Symptoms:

  • /ready returns {"mongodb":"error"}
  • 503 Service Unavailable errors

Diagnosis:

# Check MongoDB connection from Cloud Run
gcloud run services proxy search-service --region=us-east1

# In another terminal
curl localhost:8080/ready?deep=true | jq .

# Check VPC connector
gcloud compute networks vpc-access connectors describe crop-connector \
  --region=us-east1

Resolution:

  1. Verify VPC connector status
  2. Check MongoDB Atlas IP whitelist
  3. Verify Secret Manager secrets
  4. Test connection from Cloud Shell

Post-Deployment

Verification Checklist

  • Health checks passing for 10+ minutes
  • Error rate < 0.1%
  • Latency p95 < 500ms
  • No critical errors in logs
  • Smoke tests pass
  • Key user flows tested (search, autocomplete)

Monitoring Period

First 30 minutes:

  • Watch dashboards continuously
  • Monitor error logs
  • Check user reports

First 24 hours:

  • Review metrics hourly
  • Check for memory leaks (gradual memory increase)
  • Monitor instance scaling patterns

Incident Response

If issues detected post-deployment:

  1. Assess severity (P0 = rollback immediately, P1 = investigate)
  2. Gather evidence (logs, metrics, screenshots)
  3. Create incident in Linear with label Incident
  4. Notify team in Slack #incidents
  5. Execute rollback if needed (see Emergency Rollback)
  6. Post-mortem within 48 hours

Success Criteria

✅ Deployment successful if:

  • Zero 5xx errors for 1 hour
  • Latency within SLA
  • All health checks green
  • No user complaints
  • Metrics stable

Contacts

On-Call Rotation:

  • Primary: @vova
  • Secondary: @team-lead

Escalation Path:

  1. DevOps team (#devops)
  2. Engineering lead
  3. CTO

External Dependencies:


Appendix

Useful Commands

# Get current revision
gcloud run services describe search-service \
  --region=us-east1 \
  --format='value(status.latestReadyRevisionName)'

# List all revisions
gcloud run revisions list \
  --service=search-service \
  --region=us-east1

# Get revision traffic split
gcloud run services describe search-service \
  --region=us-east1 \
  --format='table(status.traffic[].revisionName,status.traffic[].percent)'

# Stream logs
gcloud run services logs tail search-service --region=us-east1

# Execute command in container
gcloud run services proxy search-service --region=us-east1

Document Version: 1.0 Last Reviewed: 2025-11-12 Next Review: 2025-12-12

On this page