High Availability & Disaster Recovery
Multi-region deployment, failover automation, RTO/RPO targets, and DR testing procedures for enterprise-grade CDE resilience.
High Availability Architecture
Multi-region deployment patterns for maximum resilience
Single Region HA
Multi-AZ deployment within one region
- +99.9% availability target
- +Lower cost & complexity
- -Single region failure risk
Active-Passive
Primary region with standby DR site
- +99.95% availability target
- +Full region failover capability
- ~RTO: 15-60 minutes
Active-Active
RecommendedTraffic served from multiple regions
- +99.99% availability target
- +RTO: Near-zero
- +Best developer latency
Active-Active Multi-Region Architecture
Global Load Balancer (Route53 / Cloud DNS)
|
+-------------------+-------------------+
| |
Region: US-East Region: EU-West
| |
+--------+--------+ +--------+--------+
| | | |
CDE Control Kubernetes CDE Control Kubernetes
Plane Cluster Plane Cluster
(3 replicas) (Worker Nodes) (3 replicas) (Worker Nodes)
| | | |
+--------+--------+ +--------+--------+
| |
PostgreSQL HA PostgreSQL HA
(RDS Multi-AZ) (RDS Multi-AZ)
| |
+------- Cross-Region Replication ------+
(Async, RPO: ~1 min)
Developer Workspaces: Developer Workspaces:
- us-east-1a (AZ-1) - eu-west-1a (AZ-1)
- us-east-1b (AZ-2) - eu-west-1b (AZ-2)
- us-east-1c (AZ-3) - eu-west-1c (AZ-3)
RTO & RPO Targets
Recovery objectives by CDE component
RTO
Recovery Time Objective
Maximum acceptable time to restore service after an outage. How long can developers be without their workspaces?
RPO
Recovery Point Objective
Maximum acceptable data loss measured in time. How much work can be lost?
| Component | Tier 1 (Critical) | Tier 2 (Standard) | Tier 3 (Best Effort) |
|---|---|---|---|
| Control Plane |
RTO: 5 min
RPO: 0
|
RTO: 30 min
RPO: 5 min
|
RTO: 4 hr
RPO: 1 hr
|
| Database (User/Config) |
RTO: 5 min
RPO: 1 min
|
RTO: 30 min
RPO: 15 min
|
RTO: 4 hr
RPO: 24 hr
|
| Workspace Storage |
RTO: 15 min
RPO: 5 min
|
RTO: 1 hr
RPO: 1 hr
|
RTO: 8 hr
RPO: 24 hr
|
| Templates/Automation | Stored in Git - RPO: 0, RTO: Minutes (re-deploy from repo) | ||
Failover Automation
Automated and manual failover procedures
Automated Failover Script
#!/bin/bash
# CDE Regional Failover Script
set -euo pipefail
PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://cde.company.com/api/v2/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"
# Check primary health
check_health() {
local status=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT --max-time 10)
[[ "$status" == "200" ]]
}
# Notify team
notify() {
curl -X POST $SLACK_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{\"text\": \"$1\"}"
}
# Failover database
failover_db() {
echo "Promoting DR database to primary..."
aws rds promote-read-replica \
--db-instance-identifier cde-db-dr \
--region $DR_REGION
}
# Update DNS
update_dns() {
echo "Updating DNS to DR region..."
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch file://failover-dns.json
}
# Main failover logic
main() {
if ! check_health; then
notify ":rotating_light: Primary CDE region unhealthy - initiating failover"
failover_db
update_dns
notify ":white_check_mark: Failover complete - CDE now running in $DR_REGION"
else
echo "Primary region healthy - no action needed"
fi
}
main "$@"
DR Testing Schedule
Regular testing ensures DR procedures actually work
Weekly
- - Health check validation
- - Backup verification
- - Alert test (silent)
Monthly
- - Restore test (non-prod)
- - Runbook walkthrough
- - On-call rotation test
Quarterly
- - Partial failover drill
- - Chaos engineering test
- - Documentation review
Annually
- - Full failover drill
- - Multi-day DR simulation
- - Third-party audit
Related Resources
Continue building resilient infrastructure