Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide
Implementation
Architecture Patterns DevContainers Language Quickstarts IDE Integration AI/ML Workloads Advanced DevContainers
Operations
Performance Optimization High Availability & DR Monitoring Capacity Planning Troubleshooting Runbooks
Security
Security Deep Dive Secrets Management Vulnerability Management Network Security IAM Guide Compliance Guide
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis Vendor Evaluation Training Resources Team Structure Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

High Availability & Disaster Recovery

Multi-region deployment, failover automation, RTO/RPO targets, and DR testing procedures for enterprise-grade CDE resilience.

High Availability Architecture

Multi-region deployment patterns for maximum resilience

Single Region HA

Multi-AZ deployment within one region

  • +99.9% availability target
  • +Lower cost & complexity
  • -Single region failure risk

Active-Passive

Primary region with standby DR site

  • +99.95% availability target
  • +Full region failover capability
  • ~RTO: 15-60 minutes

Active-Active

Recommended

Traffic served from multiple regions

  • +99.99% availability target
  • +RTO: Near-zero
  • +Best developer latency

Active-Active Multi-Region Architecture

                           Global Load Balancer (Route53 / Cloud DNS)
                                         |
                     +-------------------+-------------------+
                     |                                       |
              Region: US-East                         Region: EU-West
                     |                                       |
           +--------+--------+                     +--------+--------+
           |                 |                     |                 |
      CDE Control       Kubernetes            CDE Control       Kubernetes
        Plane             Cluster               Plane             Cluster
      (3 replicas)      (Worker Nodes)        (3 replicas)      (Worker Nodes)
           |                 |                     |                 |
           +--------+--------+                     +--------+--------+
                    |                                       |
             PostgreSQL HA                           PostgreSQL HA
            (RDS Multi-AZ)                          (RDS Multi-AZ)
                    |                                       |
                    +------- Cross-Region Replication ------+
                                (Async, RPO: ~1 min)

   Developer Workspaces:          Developer Workspaces:
   - us-east-1a (AZ-1)           - eu-west-1a (AZ-1)
   - us-east-1b (AZ-2)           - eu-west-1b (AZ-2)
   - us-east-1c (AZ-3)           - eu-west-1c (AZ-3)

RTO & RPO Targets

Recovery objectives by CDE component

RTO

Recovery Time Objective

Maximum acceptable time to restore service after an outage. How long can developers be without their workspaces?

RPO

Recovery Point Objective

Maximum acceptable data loss measured in time. How much work can be lost?

Component Tier 1 (Critical) Tier 2 (Standard) Tier 3 (Best Effort)
Control Plane
RTO: 5 min
RPO: 0
RTO: 30 min
RPO: 5 min
RTO: 4 hr
RPO: 1 hr
Database (User/Config)
RTO: 5 min
RPO: 1 min
RTO: 30 min
RPO: 15 min
RTO: 4 hr
RPO: 24 hr
Workspace Storage
RTO: 15 min
RPO: 5 min
RTO: 1 hr
RPO: 1 hr
RTO: 8 hr
RPO: 24 hr
Templates/Automation Stored in Git - RPO: 0, RTO: Minutes (re-deploy from repo)

Failover Automation

Automated and manual failover procedures

Automated Failover Script

#!/bin/bash
# CDE Regional Failover Script

set -euo pipefail

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://cde.company.com/api/v2/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"

# Check primary health
check_health() {
    local status=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT --max-time 10)
    [[ "$status" == "200" ]]
}

# Notify team
notify() {
    curl -X POST $SLACK_WEBHOOK \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"$1\"}"
}

# Failover database
failover_db() {
    echo "Promoting DR database to primary..."
    aws rds promote-read-replica \
        --db-instance-identifier cde-db-dr \
        --region $DR_REGION
}

# Update DNS
update_dns() {
    echo "Updating DNS to DR region..."
    aws route53 change-resource-record-sets \
        --hosted-zone-id $HOSTED_ZONE_ID \
        --change-batch file://failover-dns.json
}

# Main failover logic
main() {
    if ! check_health; then
        notify ":rotating_light: Primary CDE region unhealthy - initiating failover"

        failover_db
        update_dns

        notify ":white_check_mark: Failover complete - CDE now running in $DR_REGION"
    else
        echo "Primary region healthy - no action needed"
    fi
}

main "$@"

DR Testing Schedule

Regular testing ensures DR procedures actually work

Weekly

  • - Health check validation
  • - Backup verification
  • - Alert test (silent)

Monthly

  • - Restore test (non-prod)
  • - Runbook walkthrough
  • - On-call rotation test

Quarterly

  • - Partial failover drill
  • - Chaos engineering test
  • - Documentation review

Annually

  • - Full failover drill
  • - Multi-day DR simulation
  • - Third-party audit