Deployment Runbook
Last Updated: 2025-11-12 Owner: DevOps Team Incident Contact: @vova
Deployment Runbook
Last Updated: 2025-11-12 Owner: DevOps Team Incident Contact: @vova
Overview
This runbook covers deployment procedures, rollback strategies, and emergency procedures for CROP microservices platform.
Table of Contents
- Pre-Deployment Checklist
- Normal Deployment
- Emergency Rollback
- Health Checks
- Troubleshooting
- Post-Deployment
Pre-Deployment Checklist
Required Checks
- All CI/CD checks passing (tests, lint, security scans)
- PR approved and merged to
main - Database migrations tested (if applicable)
- Secrets rotated if needed (check Secret Manager)
- Monitoring dashboards open
- Team notified in Slack
Security Gates
Automated security checks (must pass):
- ✅ Gitleaks - No secrets in code
- ✅ Trivy - No CRITICAL/HIGH vulnerabilities in container
- ✅ OSV-Scanner - No known CVEs in dependencies
Stakeholder Notification
📦 Deployment: search-service v1.2.3
⏰ ETA: 2025-11-12 14:00 UTC
🔗 PR: https://github.com/MyWatson/CROP-parts-services/pull/123
👤 Deployer: @vovaNormal Deployment
Automatic Deployment (Recommended)
Deployments are triggered automatically on merge to main:
- Merge PR to
mainbranch - GitHub Actions runs:
- Builds Docker image
- Scans with Trivy
- Pushes to GCR
- Deploys to Cloud Run (no-traffic)
- Runs smoke tests
- Gradually shifts traffic (1% → 10% → 100%)
- Monitor dashboards during rollout
- Verify health checks pass
Timeline: ~15-20 minutes from merge to 100% traffic
Manual Deployment (Emergency)
If CI/CD is down:
# 1. Build and push image
cd services/search
docker build -t gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) .
docker push gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD)
# 2. Deploy with no traffic
gcloud run deploy search-service \
--region=us-east1 \
--image=gcr.io/noted-bliss-466410-q6/search-service:$(git rev-parse HEAD) \
--no-traffic \
--tag=manual-deploy
# 3. Test preview URL
PREVIEW_URL="https://manual-deploy---search-service-222426967009.us-east1.run.app"
curl -f "$PREVIEW_URL/health"
# 4. Shift traffic gradually
gcloud run services update-traffic search-service \
--region=us-east1 \
--to-revisions=search-service-REVISION=1
sleep 30 && gcloud run services update-traffic search-service \
--region=us-east1 \
--to-revisions=search-service-REVISION=10
sleep 30 && gcloud run services update-traffic search-service \
--region=us-east1 \
--to-revisions=LATEST=100Emergency Rollback
Quick Rollback (< 5 minutes)
When to use:
- 5xx errors > 1%
- Latency p95 > 2s
- Failed smoke tests
- Critical bug discovered
Steps:
# 1. Get previous revision
CURRENT=$(gcloud run services describe search-service \
--region=us-east1 \
--format='value(status.traffic[0].revisionName)')
PREVIOUS=$(gcloud run revisions list \
--service=search-service \
--region=us-east1 \
--format='value(name)' \
--limit=2 | tail -1)
echo "Current: $CURRENT"
echo "Rolling back to: $PREVIOUS"
# 2. Instant rollback (100% traffic to previous)
gcloud run services update-traffic search-service \
--region=us-east1 \
--to-revisions=$PREVIOUS=100
# 3. Verify
curl -f https://search-service-222426967009.us-east1.run.app/health
# 4. Notify team
echo "🚨 ROLLBACK: search-service → $PREVIOUS"Recovery Time Objective (RTO): < 5 minutes Recovery Point Objective (RPO): Last stable revision
Rollback via GitHub Actions
If automated rollback triggered by smoke tests:
- Check GitHub Actions logs
- Verify rollback completed
- Review error logs:
gcloud logging read - Create incident ticket
Health Checks
Service Health Endpoints
| Endpoint | Purpose | Expected Response |
|---|---|---|
/health | Shallow health check | {"ok":true} |
/ready | Deep health check (DB/ES) | {"ok":true,"mongodb":"ok","elasticsearch":"ok"} |
/live | Liveness probe | 200 OK |
/metrics | Prometheus metrics | Metrics in text format |
Manual Health Verification
SERVICE_URL="https://search-service-222426967009.us-east1.run.app"
# 1. Shallow health
curl -f "$SERVICE_URL/health"
# 2. Deep health (requires services)
curl -f "$SERVICE_URL/ready?deep=true"
# 3. Smoke test
bash services/search/scripts/smoke-test.sh "$SERVICE_URL"
# 4. Check error rate
gcloud logging read "resource.type=cloud_run_revision \
AND resource.labels.service_name=search-service \
AND severity>=ERROR" \
--limit=10 \
--format=json | jq -r '.[].textPayload'Monitoring Dashboards
- Cloud Run Metrics: https://console.cloud.google.com/run/detail/us-east1/search-service
- Logs: https://console.cloud.google.com/logs/query
Key Metrics:
- Request count
- Request latency (p50, p95, p99)
- Error rate (5xx)
- Instance count
- CPU/Memory utilization
Troubleshooting
Issue: Container Fails to Start
Symptoms:
- Health checks failing
- Container exits immediately
- Error logs show initialization failures
Diagnosis:
# Check latest logs
gcloud run services logs read search-service \
--region=us-east1 \
--limit=50
# Check secret access
gcloud secrets versions access latest --secret=MONGODB_URI
# Test locally
docker run --rm -e MONGODB_URI="..." \
gcr.io/noted-bliss-466410-q6/search-service:latestResolution:
- Fix configuration/secrets
- Redeploy
- If urgent: rollback to previous revision
Issue: High Latency
Symptoms:
- p95 latency > 1s
- User complaints
- Timeout errors
Diagnosis:
# Check Elasticsearch health
curl -X GET "http://10.0.0.52:9200/_cluster/health"
# Check MongoDB connection
mongosh "$MONGODB_URI" --eval "db.adminCommand('ping')"
# Check Cloud Run metrics
gcloud run services describe search-service \
--region=us-east1 \
--format='value(status.conditions)'Resolution:
- Scale up instances: Update Terraform config
- Increase CPU/memory: Update Cloud Run config
- Optimize queries: Review
ES_LOG_QUERIES=truelogs - Check VPC connector throughput
Issue: Database Connection Failures
Symptoms:
/readyreturns{"mongodb":"error"}- 503 Service Unavailable errors
Diagnosis:
# Check MongoDB connection from Cloud Run
gcloud run services proxy search-service --region=us-east1
# In another terminal
curl localhost:8080/ready?deep=true | jq .
# Check VPC connector
gcloud compute networks vpc-access connectors describe crop-connector \
--region=us-east1Resolution:
- Verify VPC connector status
- Check MongoDB Atlas IP whitelist
- Verify Secret Manager secrets
- Test connection from Cloud Shell
Post-Deployment
Verification Checklist
- Health checks passing for 10+ minutes
- Error rate < 0.1%
- Latency p95 < 500ms
- No critical errors in logs
- Smoke tests pass
- Key user flows tested (search, autocomplete)
Monitoring Period
First 30 minutes:
- Watch dashboards continuously
- Monitor error logs
- Check user reports
First 24 hours:
- Review metrics hourly
- Check for memory leaks (gradual memory increase)
- Monitor instance scaling patterns
Incident Response
If issues detected post-deployment:
- Assess severity (P0 = rollback immediately, P1 = investigate)
- Gather evidence (logs, metrics, screenshots)
- Create incident in Linear with label
Incident - Notify team in Slack
#incidents - Execute rollback if needed (see Emergency Rollback)
- Post-mortem within 48 hours
Success Criteria
✅ Deployment successful if:
- Zero 5xx errors for 1 hour
- Latency within SLA
- All health checks green
- No user complaints
- Metrics stable
Contacts
On-Call Rotation:
- Primary: @vova
- Secondary: @team-lead
Escalation Path:
- DevOps team (#devops)
- Engineering lead
- CTO
External Dependencies:
- MongoDB Atlas: https://cloud.mongodb.com/
- GCP Support: https://console.cloud.google.com/support
- Elasticsearch: Internal VPC (10.0.0.52:9200)
Appendix
Useful Commands
# Get current revision
gcloud run services describe search-service \
--region=us-east1 \
--format='value(status.latestReadyRevisionName)'
# List all revisions
gcloud run revisions list \
--service=search-service \
--region=us-east1
# Get revision traffic split
gcloud run services describe search-service \
--region=us-east1 \
--format='table(status.traffic[].revisionName,status.traffic[].percent)'
# Stream logs
gcloud run services logs tail search-service --region=us-east1
# Execute command in container
gcloud run services proxy search-service --region=us-east1Related Documentation
Document Version: 1.0 Last Reviewed: 2025-11-12 Next Review: 2025-12-12