How do I make CDEs highly available?

Deploy control plane across multiple availability zones, use managed Kubernetes for workspace orchestration, implement database replication, configure load balancer health checks, and have runbooks for common failure scenarios.

What RTO/RPO should CDEs target?

Typical targets: Control plane RTO under 15 minutes, workspace data RPO under 1 hour. Workspace state is often ephemeral with code in git, so data loss impact is usually limited to uncommitted changes.

Should CDEs be deployed multi-region?

Multi-region helps for: global teams needing low-latency access, disaster recovery requirements, and data residency compliance. Single-region with multi-AZ is sufficient for many organizations and simpler to manage.

High Availability & Disaster Recovery

Multi-region deployment, failover automation, RTO/RPO targets, and DR testing procedures for enterprise-grade CDE resilience - including HA strategies for AI agent workloads at scale.

Architecture RTO/RPO Multi-Region AI Agent HA Failover DR Testing

High Availability Architecture

Multi-region deployment patterns for maximum resilience across developer workspaces and AI agent workloads

Single Region HA

Multi-AZ deployment within one region

+99.9% availability target
+Lower cost and complexity
+Simpler data residency compliance
-Single region failure risk

Active-Passive

Primary region with standby DR site

+99.95% availability target
+Full region failover capability
+AI agent queues persist across failover
~RTO: 15-60 minutes

Active-Active

Recommended

Traffic served from multiple regions simultaneously

+99.99% availability target
+RTO: Near-zero
+Best developer and agent latency
+Distributes GPU and compute load

CDE Platform HA Capabilities (2026)

Coder

- Multi-region Kubernetes
- External PostgreSQL HA
- Workspace auto-recovery
- Satellite architecture

Ona (formerly Gitpod)

- Multi-cluster support
- Workspace snapshots
- Prebuilt image caching
- Regional pod scheduling

GitHub Codespaces

- Azure region selection
- Managed infrastructure HA
- Automatic backup/restore
- GitHub Actions integration

DevPod / Daytona

- Provider-agnostic failover
- Multi-cloud workspace routing
- Bring-your-own infrastructure
- GitOps-driven recovery

Active-Active Multi-Region Architecture

                           Global Load Balancer (Route53 / Cloud DNS)
                                         |
                     +-------------------+-------------------+
                     |                                       |
              Region: US-East                         Region: EU-West
                     |                                       |
           +--------+--------+                     +--------+--------+
           |                 |                     |                 |
      CDE Control       Kubernetes            CDE Control       Kubernetes
        Plane             Cluster               Plane             Cluster
      (3 replicas)      (Worker Nodes)        (3 replicas)      (Worker Nodes)
           |                 |                     |                 |
      AI Agent          GPU Node Pool         AI Agent          GPU Node Pool
      Scheduler          (Optional)           Scheduler          (Optional)
           |                 |                     |                 |
           +--------+--------+                     +--------+--------+
                    |                                       |
             PostgreSQL HA                           PostgreSQL HA
            (RDS Multi-AZ)                          (RDS Multi-AZ)
                    |                                       |
                    +------- Cross-Region Replication ------+
                                (Async, RPO: ~1 min)

   Developer Workspaces:          Developer Workspaces:
   - us-east-1a (AZ-1)           - eu-west-1a (AZ-1)
   - us-east-1b (AZ-2)           - eu-west-1b (AZ-2)
   - us-east-1c (AZ-3)           - eu-west-1c (AZ-3)

   AI Agent Sandboxes:            AI Agent Sandboxes:
   - Ephemeral microVMs           - Ephemeral microVMs
   - Isolated namespaces          - Isolated namespaces
   - Task queue failover          - Task queue failover

RTO & RPO Targets

Recovery objectives by CDE component - including AI agent infrastructure

RTO

Recovery Time Objective

Maximum acceptable time to restore service after an outage. How long can developers and AI agents be without their workspaces?

RPO

Recovery Point Objective

Maximum acceptable data loss measured in time. How much work - human or agent-generated - can be lost?

Component	Tier 1 (Critical)	Tier 2 (Standard)	Tier 3 (Best Effort)
Control Plane	RTO: 5 min RPO: 0	RTO: 30 min RPO: 5 min	RTO: 4 hr RPO: 1 hr
Database (User/Config)	RTO: 5 min RPO: 1 min	RTO: 30 min RPO: 15 min	RTO: 4 hr RPO: 24 hr
Workspace Storage	RTO: 15 min RPO: 5 min	RTO: 1 hr RPO: 1 hr	RTO: 8 hr RPO: 24 hr
AI Agent Task Queues	RTO: 2 min RPO: 0	RTO: 15 min RPO: 5 min	RTO: 2 hr RPO: 30 min
Templates/Automation	Stored in Git - RPO: 0, RTO: Minutes (re-deploy from repo)

Multi-Region Considerations

Data sovereignty, latency, and compliance factors that shape your multi-region CDE strategy

Data Sovereignty Matters More Than Ever

With GDPR enforcement increasing in 2026 and new regulations in APAC and Latin America, multi-region CDE deployments must consider where developer data, source code, and AI model inference results are stored and processed. AI agent workloads that send code to external LLM APIs add another layer of data residency complexity. Always verify that your failover regions comply with the same regulatory framework as your primary region.

Region Selection Criteria

Developer team geographic distribution
Regulatory requirements (GDPR, CCPA, PIPL)
GPU availability for AI agent workloads
Cloud provider service availability per region
Network latency to Git hosting and CI/CD
LLM API endpoint proximity for agent inference

Cross-Region Challenges

Database replication lag (async vs sync trade-offs)
Workspace state synchronization across regions
Container image distribution and caching
AI agent task state and checkpoint replication
Increased egress costs between regions
Split-brain scenarios during network partitions

Typical Cross-Region Latency Impact on CDE Operations

Operation	Same Region	US East <-> US West	US <-> EU	US <-> APAC
IDE Keystroke Response	<5ms	~60ms	~90ms	~180ms
DB Replication	<1ms	~30ms	~80ms	~150ms
Agent Task Dispatch	<10ms	~70ms	~100ms	~200ms
Image Pull (Cached)	<2s	~5s	~10s	~20s

HA for AI Agent Workloads

Keeping autonomous agents running reliably across regions and failure scenarios

Why AI Agent HA Differs from Developer Workspace HA

When a developer's workspace goes down, the developer notices immediately and can wait for recovery. When an AI agent's workspace fails mid-task, the agent cannot self-report the failure or resume intelligently without orchestration. Agent workloads are often long-running (multi-hour code generation, test suites, refactoring), stateful (partial code changes, tool call history), and high-throughput (hundreds of concurrent agents across an organization). This makes HA for agent infrastructure fundamentally different from interactive developer workspace HA.

Task Queue Resilience

Use durable message queues (SQS, Redis Streams, NATS JetStream) to persist agent task assignments. If an agent workspace fails, the task is automatically re-queued and picked up by a healthy agent in the same or a different region.

Critical

Checkpoint and Resume

Periodically snapshot agent state (file changes, conversation history, tool call log) to shared storage. On failure, a new agent workspace can resume from the last checkpoint instead of starting from scratch.

Recommended

Ephemeral Sandboxes

Run each agent task in an ephemeral microVM or container that is disposable by design. Coder, Ona, and Codespaces all support ephemeral workspace patterns. If the sandbox fails, spin up a new one with no state to recover.

Best Practice

Agent Health Monitoring

Implement heartbeat checks for running agents. If an agent stops responding (crashed sandbox, OOM kill, network partition), the orchestrator marks the task as failed and triggers re-assignment within the configured timeout.

Required

LLM Provider Failover

AI agents depend on LLM APIs for inference. Configure fallback providers (e.g., OpenAI to Anthropic, or self-hosted to cloud). Use API gateway patterns to route around provider outages without agent code changes.

Important

Blast Radius Containment

Isolate agent pools by team, project, or risk level. A runaway agent consuming all GPU resources in one pool should not affect developer workspaces or agents in other pools. Use Kubernetes namespaces and resource quotas.

Security

AI Agent HA Architecture Pattern

                    Agent Orchestrator (HA, multi-region)
                                    |
                    +---------------+---------------+
                    |                               |
             Task Queue (Primary)           Task Queue (DR)
           (SQS / Redis Streams)         (Cross-region replica)
                    |                               |
         +----------+----------+         +----------+----------+
         |          |          |         |          |          |
     Agent Pool  Agent Pool  Agent    Agent Pool  Agent Pool  Agent
     (Team A)    (Team B)    Pool     (Team A)    (Team B)    Pool
                             (GPU)                            (GPU)
         |          |          |         |          |          |
     Ephemeral  Ephemeral  Ephemeral  Ephemeral  Ephemeral  Ephemeral
     Sandboxes  Sandboxes  Sandboxes  Sandboxes  Sandboxes  Sandboxes
     (microVMs) (microVMs) (microVMs) (microVMs) (microVMs) (microVMs)
         |          |          |         |          |          |
         +----------+----------+         +----------+----------+
                    |                               |
            Checkpoint Store                Checkpoint Store
            (S3 / GCS / Blob)              (Cross-region sync)
                    |                               |
                    +---------- Git Push -----------+
                           (Durable Output)

Failover Automation

Automated and manual failover procedures for developer workspaces and agent infrastructure

Automated Failover Script (Workspaces + Agent Pools)

#!/bin/bash
# CDE Regional Failover Script (2026 - includes agent workloads)

set -euo pipefail

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://cde.company.com/api/v2/health"
AGENT_HEALTH_ENDPOINT="https://cde.company.com/api/v2/agents/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"

# Check primary health (workspaces + agents)
check_health() {
    local ws_status=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT --max-time 10)
    local agent_status=$(curl -s -o /dev/null -w "%{http_code}" $AGENT_HEALTH_ENDPOINT --max-time 10)
    [[ "$ws_status" == "200" ]] && [[ "$agent_status" == "200" ]]
}

# Notify team
notify() {
    curl -X POST $SLACK_WEBHOOK \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"$1\"}"
}

# Failover database
failover_db() {
    echo "Promoting DR database to primary..."
    aws rds promote-read-replica \
        --db-instance-identifier cde-db-dr \
        --region $DR_REGION
}

# Drain and redirect agent task queues
failover_agents() {
    echo "Pausing agent task dispatch in $PRIMARY_REGION..."
    aws sqs set-queue-attributes \
        --queue-url $PRIMARY_QUEUE_URL \
        --attributes '{"ReceiveMessageWaitTimeSeconds":"0"}'

    echo "Activating DR agent queue consumers in $DR_REGION..."
    kubectl --context $DR_CONTEXT scale deployment agent-consumer \
        --replicas=10

    echo "Checkpointing in-flight agent tasks..."
    curl -X POST "https://cde.company.com/api/v2/agents/checkpoint-all" \
        --max-time 30
}

# Update DNS
update_dns() {
    echo "Updating DNS to DR region..."
    aws route53 change-resource-record-sets \
        --hosted-zone-id $HOSTED_ZONE_ID \
        --change-batch file://failover-dns.json
}

# Main failover logic
main() {
    if ! check_health; then
        notify ":rotating_light: Primary CDE region unhealthy - initiating failover"

        failover_agents
        failover_db
        update_dns

        notify ":white_check_mark: Failover complete - CDE now running in $DR_REGION"
        notify ":robot_face: Agent tasks checkpointed and resuming in $DR_REGION"
    else
        echo "Primary region healthy - no action needed"
    fi
}

main "$@"

Developer Workspace Failover

DNS failover to DR region
Database promotion (RDS/CloudSQL)
Workspace storage reattachment
IDE reconnection (VS Code, JetBrains Gateway)
Verify Git push/pull from DR region

AI Agent Failover

Pause task dispatch in failed region
Checkpoint all in-flight agent tasks
Scale up DR agent consumer fleet
Verify LLM API connectivity from DR
Resume checkpointed tasks in DR

DR Testing Schedule

Regular testing ensures DR procedures actually work - for both human and AI workloads

Weekly

- Health check validation
- Backup verification
- Alert test (silent)
- Agent heartbeat audit

Monthly

- Restore test (non-prod)
- Runbook walkthrough
- On-call rotation test
- Agent checkpoint/resume test

Quarterly

- Partial failover drill
- Chaos engineering test
- Documentation review
- LLM provider failover test

Annually

- Full failover drill
- Multi-day DR simulation
- Third-party audit
- Full agent fleet failover test

Chaos Engineering for CDEs in 2026

With AI agents now running hundreds of concurrent workspaces, chaos testing is more important than ever. Modern chaos engineering for CDEs should include: killing random agent sandboxes mid-task, simulating LLM API outages, injecting network partitions between regions, throttling GPU availability, and testing checkpoint/resume under load. Tools like Litmus, Chaos Mesh, and Gremlin all support Kubernetes-native chaos experiments that map well to CDE infrastructure.

Key metric: Measure not just recovery time, but how many agent tasks were lost or had to restart from scratch. The goal is zero task loss with checkpoint-based recovery.

Related Resources

Continue building resilient infrastructure