Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide CDEs for Startups
AI & Automation
AI Coding Assistants Agentic AI AI-Native IDEs Agentic Engineering AI Agent Orchestration AI Governance AI-Assisted Architecture Shift-Left AI LLMOps Autonomous Development AI/ML Workloads GPU Computing
Implementation
Architecture Patterns DevContainers Advanced DevContainers Language Quickstarts IDE Integration CI/CD Integration Platform Engineering Developer Portals Container Registry Multi-CDE Strategies Remote Dev Protocols Nix Environments
Operations
Performance Optimization High Availability & DR Disaster Recovery Monitoring Capacity Planning Multi-Cluster Development Troubleshooting Runbooks Ephemeral Environments
Security
Security Deep Dive Zero Trust Architecture Secrets Management Vulnerability Management Network Security IAM Guide Supply Chain Security Air-Gapped Environments AI Agent Security MicroVM Isolation Compliance Guide Governance
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis FinOps GreenOps Vendor Evaluation Training Resources Developer Onboarding Team Structure DevEx Metrics Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

High Availability & Disaster Recovery

Multi-region deployment, failover automation, RTO/RPO targets, and DR testing procedures for enterprise-grade CDE resilience - including HA strategies for AI agent workloads at scale.

High Availability Architecture

Multi-region deployment patterns for maximum resilience across developer workspaces and AI agent workloads

Single Region HA

Multi-AZ deployment within one region

  • +99.9% availability target
  • +Lower cost and complexity
  • +Simpler data residency compliance
  • -Single region failure risk

Active-Passive

Primary region with standby DR site

  • +99.95% availability target
  • +Full region failover capability
  • +AI agent queues persist across failover
  • ~RTO: 15-60 minutes

Active-Active

Recommended

Traffic served from multiple regions simultaneously

  • +99.99% availability target
  • +RTO: Near-zero
  • +Best developer and agent latency
  • +Distributes GPU and compute load

CDE Platform HA Capabilities (2026)

Coder

  • - Multi-region Kubernetes
  • - External PostgreSQL HA
  • - Workspace auto-recovery
  • - Satellite architecture

Ona (formerly Gitpod)

  • - Multi-cluster support
  • - Workspace snapshots
  • - Prebuilt image caching
  • - Regional pod scheduling

GitHub Codespaces

  • - Azure region selection
  • - Managed infrastructure HA
  • - Automatic backup/restore
  • - GitHub Actions integration

DevPod / Daytona

  • - Provider-agnostic failover
  • - Multi-cloud workspace routing
  • - Bring-your-own infrastructure
  • - GitOps-driven recovery

Active-Active Multi-Region Architecture

                           Global Load Balancer (Route53 / Cloud DNS)
                                         |
                     +-------------------+-------------------+
                     |                                       |
              Region: US-East                         Region: EU-West
                     |                                       |
           +--------+--------+                     +--------+--------+
           |                 |                     |                 |
      CDE Control       Kubernetes            CDE Control       Kubernetes
        Plane             Cluster               Plane             Cluster
      (3 replicas)      (Worker Nodes)        (3 replicas)      (Worker Nodes)
           |                 |                     |                 |
      AI Agent          GPU Node Pool         AI Agent          GPU Node Pool
      Scheduler          (Optional)           Scheduler          (Optional)
           |                 |                     |                 |
           +--------+--------+                     +--------+--------+
                    |                                       |
             PostgreSQL HA                           PostgreSQL HA
            (RDS Multi-AZ)                          (RDS Multi-AZ)
                    |                                       |
                    +------- Cross-Region Replication ------+
                                (Async, RPO: ~1 min)

   Developer Workspaces:          Developer Workspaces:
   - us-east-1a (AZ-1)           - eu-west-1a (AZ-1)
   - us-east-1b (AZ-2)           - eu-west-1b (AZ-2)
   - us-east-1c (AZ-3)           - eu-west-1c (AZ-3)

   AI Agent Sandboxes:            AI Agent Sandboxes:
   - Ephemeral microVMs           - Ephemeral microVMs
   - Isolated namespaces          - Isolated namespaces
   - Task queue failover          - Task queue failover

RTO & RPO Targets

Recovery objectives by CDE component - including AI agent infrastructure

RTO

Recovery Time Objective

Maximum acceptable time to restore service after an outage. How long can developers and AI agents be without their workspaces?

RPO

Recovery Point Objective

Maximum acceptable data loss measured in time. How much work - human or agent-generated - can be lost?

ComponentTier 1 (Critical)Tier 2 (Standard)Tier 3 (Best Effort)
Control Plane
RTO: 5 min
RPO: 0
RTO: 30 min
RPO: 5 min
RTO: 4 hr
RPO: 1 hr
Database (User/Config)
RTO: 5 min
RPO: 1 min
RTO: 30 min
RPO: 15 min
RTO: 4 hr
RPO: 24 hr
Workspace Storage
RTO: 15 min
RPO: 5 min
RTO: 1 hr
RPO: 1 hr
RTO: 8 hr
RPO: 24 hr
AI Agent Task Queues
RTO: 2 min
RPO: 0
RTO: 15 min
RPO: 5 min
RTO: 2 hr
RPO: 30 min
Templates/AutomationStored in Git - RPO: 0, RTO: Minutes (re-deploy from repo)

Multi-Region Considerations

Data sovereignty, latency, and compliance factors that shape your multi-region CDE strategy

Data Sovereignty Matters More Than Ever

With GDPR enforcement increasing in 2026 and new regulations in APAC and Latin America, multi-region CDE deployments must consider where developer data, source code, and AI model inference results are stored and processed. AI agent workloads that send code to external LLM APIs add another layer of data residency complexity. Always verify that your failover regions comply with the same regulatory framework as your primary region.

Region Selection Criteria

  • Developer team geographic distribution
  • Regulatory requirements (GDPR, CCPA, PIPL)
  • GPU availability for AI agent workloads
  • Cloud provider service availability per region
  • Network latency to Git hosting and CI/CD
  • LLM API endpoint proximity for agent inference

Cross-Region Challenges

  • Database replication lag (async vs sync trade-offs)
  • Workspace state synchronization across regions
  • Container image distribution and caching
  • AI agent task state and checkpoint replication
  • Increased egress costs between regions
  • Split-brain scenarios during network partitions

Typical Cross-Region Latency Impact on CDE Operations

OperationSame RegionUS East <-> US WestUS <-> EUUS <-> APAC
IDE Keystroke Response<5ms~60ms~90ms~180ms
DB Replication<1ms~30ms~80ms~150ms
Agent Task Dispatch<10ms~70ms~100ms~200ms
Image Pull (Cached)<2s~5s~10s~20s

HA for AI Agent Workloads

Keeping autonomous agents running reliably across regions and failure scenarios

Why AI Agent HA Differs from Developer Workspace HA

When a developer's workspace goes down, the developer notices immediately and can wait for recovery. When an AI agent's workspace fails mid-task, the agent cannot self-report the failure or resume intelligently without orchestration. Agent workloads are often long-running (multi-hour code generation, test suites, refactoring), stateful (partial code changes, tool call history), and high-throughput (hundreds of concurrent agents across an organization). This makes HA for agent infrastructure fundamentally different from interactive developer workspace HA.

Task Queue Resilience

Use durable message queues (SQS, Redis Streams, NATS JetStream) to persist agent task assignments. If an agent workspace fails, the task is automatically re-queued and picked up by a healthy agent in the same or a different region.

Critical

Checkpoint and Resume

Periodically snapshot agent state (file changes, conversation history, tool call log) to shared storage. On failure, a new agent workspace can resume from the last checkpoint instead of starting from scratch.

Recommended

Ephemeral Sandboxes

Run each agent task in an ephemeral microVM or container that is disposable by design. Coder, Ona, and Codespaces all support ephemeral workspace patterns. If the sandbox fails, spin up a new one with no state to recover.

Best Practice

Agent Health Monitoring

Implement heartbeat checks for running agents. If an agent stops responding (crashed sandbox, OOM kill, network partition), the orchestrator marks the task as failed and triggers re-assignment within the configured timeout.

Required

LLM Provider Failover

AI agents depend on LLM APIs for inference. Configure fallback providers (e.g., OpenAI to Anthropic, or self-hosted to cloud). Use API gateway patterns to route around provider outages without agent code changes.

Important

Blast Radius Containment

Isolate agent pools by team, project, or risk level. A runaway agent consuming all GPU resources in one pool should not affect developer workspaces or agents in other pools. Use Kubernetes namespaces and resource quotas.

Security

AI Agent HA Architecture Pattern

                    Agent Orchestrator (HA, multi-region)
                                    |
                    +---------------+---------------+
                    |                               |
             Task Queue (Primary)           Task Queue (DR)
           (SQS / Redis Streams)         (Cross-region replica)
                    |                               |
         +----------+----------+         +----------+----------+
         |          |          |         |          |          |
     Agent Pool  Agent Pool  Agent    Agent Pool  Agent Pool  Agent
     (Team A)    (Team B)    Pool     (Team A)    (Team B)    Pool
                             (GPU)                            (GPU)
         |          |          |         |          |          |
     Ephemeral  Ephemeral  Ephemeral  Ephemeral  Ephemeral  Ephemeral
     Sandboxes  Sandboxes  Sandboxes  Sandboxes  Sandboxes  Sandboxes
     (microVMs) (microVMs) (microVMs) (microVMs) (microVMs) (microVMs)
         |          |          |         |          |          |
         +----------+----------+         +----------+----------+
                    |                               |
            Checkpoint Store                Checkpoint Store
            (S3 / GCS / Blob)              (Cross-region sync)
                    |                               |
                    +---------- Git Push -----------+
                           (Durable Output)

Failover Automation

Automated and manual failover procedures for developer workspaces and agent infrastructure

Automated Failover Script (Workspaces + Agent Pools)

#!/bin/bash
# CDE Regional Failover Script (2026 - includes agent workloads)

set -euo pipefail

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://cde.company.com/api/v2/health"
AGENT_HEALTH_ENDPOINT="https://cde.company.com/api/v2/agents/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"

# Check primary health (workspaces + agents)
check_health() {
    local ws_status=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT --max-time 10)
    local agent_status=$(curl -s -o /dev/null -w "%{http_code}" $AGENT_HEALTH_ENDPOINT --max-time 10)
    [[ "$ws_status" == "200" ]] && [[ "$agent_status" == "200" ]]
}

# Notify team
notify() {
    curl -X POST $SLACK_WEBHOOK \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"$1\"}"
}

# Failover database
failover_db() {
    echo "Promoting DR database to primary..."
    aws rds promote-read-replica \
        --db-instance-identifier cde-db-dr \
        --region $DR_REGION
}

# Drain and redirect agent task queues
failover_agents() {
    echo "Pausing agent task dispatch in $PRIMARY_REGION..."
    aws sqs set-queue-attributes \
        --queue-url $PRIMARY_QUEUE_URL \
        --attributes '{"ReceiveMessageWaitTimeSeconds":"0"}'

    echo "Activating DR agent queue consumers in $DR_REGION..."
    kubectl --context $DR_CONTEXT scale deployment agent-consumer \
        --replicas=10

    echo "Checkpointing in-flight agent tasks..."
    curl -X POST "https://cde.company.com/api/v2/agents/checkpoint-all" \
        --max-time 30
}

# Update DNS
update_dns() {
    echo "Updating DNS to DR region..."
    aws route53 change-resource-record-sets \
        --hosted-zone-id $HOSTED_ZONE_ID \
        --change-batch file://failover-dns.json
}

# Main failover logic
main() {
    if ! check_health; then
        notify ":rotating_light: Primary CDE region unhealthy - initiating failover"

        failover_agents
        failover_db
        update_dns

        notify ":white_check_mark: Failover complete - CDE now running in $DR_REGION"
        notify ":robot_face: Agent tasks checkpointed and resuming in $DR_REGION"
    else
        echo "Primary region healthy - no action needed"
    fi
}

main "$@"

Developer Workspace Failover

  • DNS failover to DR region
  • Database promotion (RDS/CloudSQL)
  • Workspace storage reattachment
  • IDE reconnection (VS Code, JetBrains Gateway)
  • Verify Git push/pull from DR region

AI Agent Failover

  • Pause task dispatch in failed region
  • Checkpoint all in-flight agent tasks
  • Scale up DR agent consumer fleet
  • Verify LLM API connectivity from DR
  • Resume checkpointed tasks in DR

DR Testing Schedule

Regular testing ensures DR procedures actually work - for both human and AI workloads

Weekly

  • - Health check validation
  • - Backup verification
  • - Alert test (silent)
  • - Agent heartbeat audit

Monthly

  • - Restore test (non-prod)
  • - Runbook walkthrough
  • - On-call rotation test
  • - Agent checkpoint/resume test

Quarterly

  • - Partial failover drill
  • - Chaos engineering test
  • - Documentation review
  • - LLM provider failover test

Annually

  • - Full failover drill
  • - Multi-day DR simulation
  • - Third-party audit
  • - Full agent fleet failover test

Chaos Engineering for CDEs in 2026

With AI agents now running hundreds of concurrent workspaces, chaos testing is more important than ever. Modern chaos engineering for CDEs should include: killing random agent sandboxes mid-task, simulating LLM API outages, injecting network partitions between regions, throttling GPU availability, and testing checkpoint/resume under load. Tools like Litmus, Chaos Mesh, and Gremlin all support Kubernetes-native chaos experiments that map well to CDE infrastructure.

Key metric: Measure not just recovery time, but how many agent tasks were lost or had to restart from scratch. The goal is zero task loss with checkpoint-based recovery.