High Availability & Disaster Recovery
Multi-region deployment, failover automation, RTO/RPO targets, and DR testing procedures for enterprise-grade CDE resilience - including HA strategies for AI agent workloads at scale.
High Availability Architecture
Multi-region deployment patterns for maximum resilience across developer workspaces and AI agent workloads
Single Region HA
Multi-AZ deployment within one region
- +99.9% availability target
- +Lower cost and complexity
- +Simpler data residency compliance
- -Single region failure risk
Active-Passive
Primary region with standby DR site
- +99.95% availability target
- +Full region failover capability
- +AI agent queues persist across failover
- ~RTO: 15-60 minutes
Active-Active
RecommendedTraffic served from multiple regions simultaneously
- +99.99% availability target
- +RTO: Near-zero
- +Best developer and agent latency
- +Distributes GPU and compute load
CDE Platform HA Capabilities (2026)
Coder
- - Multi-region Kubernetes
- - External PostgreSQL HA
- - Workspace auto-recovery
- - Satellite architecture
Ona (formerly Gitpod)
- - Multi-cluster support
- - Workspace snapshots
- - Prebuilt image caching
- - Regional pod scheduling
GitHub Codespaces
- - Azure region selection
- - Managed infrastructure HA
- - Automatic backup/restore
- - GitHub Actions integration
DevPod / Daytona
- - Provider-agnostic failover
- - Multi-cloud workspace routing
- - Bring-your-own infrastructure
- - GitOps-driven recovery
Active-Active Multi-Region Architecture
Global Load Balancer (Route53 / Cloud DNS)
|
+-------------------+-------------------+
| |
Region: US-East Region: EU-West
| |
+--------+--------+ +--------+--------+
| | | |
CDE Control Kubernetes CDE Control Kubernetes
Plane Cluster Plane Cluster
(3 replicas) (Worker Nodes) (3 replicas) (Worker Nodes)
| | | |
AI Agent GPU Node Pool AI Agent GPU Node Pool
Scheduler (Optional) Scheduler (Optional)
| | | |
+--------+--------+ +--------+--------+
| |
PostgreSQL HA PostgreSQL HA
(RDS Multi-AZ) (RDS Multi-AZ)
| |
+------- Cross-Region Replication ------+
(Async, RPO: ~1 min)
Developer Workspaces: Developer Workspaces:
- us-east-1a (AZ-1) - eu-west-1a (AZ-1)
- us-east-1b (AZ-2) - eu-west-1b (AZ-2)
- us-east-1c (AZ-3) - eu-west-1c (AZ-3)
AI Agent Sandboxes: AI Agent Sandboxes:
- Ephemeral microVMs - Ephemeral microVMs
- Isolated namespaces - Isolated namespaces
- Task queue failover - Task queue failoverRTO & RPO Targets
Recovery objectives by CDE component - including AI agent infrastructure
RTO
Recovery Time Objective
Maximum acceptable time to restore service after an outage. How long can developers and AI agents be without their workspaces?
RPO
Recovery Point Objective
Maximum acceptable data loss measured in time. How much work - human or agent-generated - can be lost?
| Component | Tier 1 (Critical) | Tier 2 (Standard) | Tier 3 (Best Effort) |
|---|---|---|---|
| Control Plane | RTO: 5 min RPO: 0 | RTO: 30 min RPO: 5 min | RTO: 4 hr RPO: 1 hr |
| Database (User/Config) | RTO: 5 min RPO: 1 min | RTO: 30 min RPO: 15 min | RTO: 4 hr RPO: 24 hr |
| Workspace Storage | RTO: 15 min RPO: 5 min | RTO: 1 hr RPO: 1 hr | RTO: 8 hr RPO: 24 hr |
| AI Agent Task Queues | RTO: 2 min RPO: 0 | RTO: 15 min RPO: 5 min | RTO: 2 hr RPO: 30 min |
| Templates/Automation | Stored in Git - RPO: 0, RTO: Minutes (re-deploy from repo) | ||
Multi-Region Considerations
Data sovereignty, latency, and compliance factors that shape your multi-region CDE strategy
Data Sovereignty Matters More Than Ever
With GDPR enforcement increasing in 2026 and new regulations in APAC and Latin America, multi-region CDE deployments must consider where developer data, source code, and AI model inference results are stored and processed. AI agent workloads that send code to external LLM APIs add another layer of data residency complexity. Always verify that your failover regions comply with the same regulatory framework as your primary region.
Region Selection Criteria
- Developer team geographic distribution
- Regulatory requirements (GDPR, CCPA, PIPL)
- GPU availability for AI agent workloads
- Cloud provider service availability per region
- Network latency to Git hosting and CI/CD
- LLM API endpoint proximity for agent inference
Cross-Region Challenges
- Database replication lag (async vs sync trade-offs)
- Workspace state synchronization across regions
- Container image distribution and caching
- AI agent task state and checkpoint replication
- Increased egress costs between regions
- Split-brain scenarios during network partitions
Typical Cross-Region Latency Impact on CDE Operations
| Operation | Same Region | US East <-> US West | US <-> EU | US <-> APAC |
|---|---|---|---|---|
| IDE Keystroke Response | <5ms | ~60ms | ~90ms | ~180ms |
| DB Replication | <1ms | ~30ms | ~80ms | ~150ms |
| Agent Task Dispatch | <10ms | ~70ms | ~100ms | ~200ms |
| Image Pull (Cached) | <2s | ~5s | ~10s | ~20s |
HA for AI Agent Workloads
Keeping autonomous agents running reliably across regions and failure scenarios
Why AI Agent HA Differs from Developer Workspace HA
When a developer's workspace goes down, the developer notices immediately and can wait for recovery. When an AI agent's workspace fails mid-task, the agent cannot self-report the failure or resume intelligently without orchestration. Agent workloads are often long-running (multi-hour code generation, test suites, refactoring), stateful (partial code changes, tool call history), and high-throughput (hundreds of concurrent agents across an organization). This makes HA for agent infrastructure fundamentally different from interactive developer workspace HA.
Task Queue Resilience
Use durable message queues (SQS, Redis Streams, NATS JetStream) to persist agent task assignments. If an agent workspace fails, the task is automatically re-queued and picked up by a healthy agent in the same or a different region.
CriticalCheckpoint and Resume
Periodically snapshot agent state (file changes, conversation history, tool call log) to shared storage. On failure, a new agent workspace can resume from the last checkpoint instead of starting from scratch.
RecommendedEphemeral Sandboxes
Run each agent task in an ephemeral microVM or container that is disposable by design. Coder, Ona, and Codespaces all support ephemeral workspace patterns. If the sandbox fails, spin up a new one with no state to recover.
Best PracticeAgent Health Monitoring
Implement heartbeat checks for running agents. If an agent stops responding (crashed sandbox, OOM kill, network partition), the orchestrator marks the task as failed and triggers re-assignment within the configured timeout.
RequiredLLM Provider Failover
AI agents depend on LLM APIs for inference. Configure fallback providers (e.g., OpenAI to Anthropic, or self-hosted to cloud). Use API gateway patterns to route around provider outages without agent code changes.
ImportantBlast Radius Containment
Isolate agent pools by team, project, or risk level. A runaway agent consuming all GPU resources in one pool should not affect developer workspaces or agents in other pools. Use Kubernetes namespaces and resource quotas.
SecurityAI Agent HA Architecture Pattern
Agent Orchestrator (HA, multi-region)
|
+---------------+---------------+
| |
Task Queue (Primary) Task Queue (DR)
(SQS / Redis Streams) (Cross-region replica)
| |
+----------+----------+ +----------+----------+
| | | | | |
Agent Pool Agent Pool Agent Agent Pool Agent Pool Agent
(Team A) (Team B) Pool (Team A) (Team B) Pool
(GPU) (GPU)
| | | | | |
Ephemeral Ephemeral Ephemeral Ephemeral Ephemeral Ephemeral
Sandboxes Sandboxes Sandboxes Sandboxes Sandboxes Sandboxes
(microVMs) (microVMs) (microVMs) (microVMs) (microVMs) (microVMs)
| | | | | |
+----------+----------+ +----------+----------+
| |
Checkpoint Store Checkpoint Store
(S3 / GCS / Blob) (Cross-region sync)
| |
+---------- Git Push -----------+
(Durable Output)Failover Automation
Automated and manual failover procedures for developer workspaces and agent infrastructure
Automated Failover Script (Workspaces + Agent Pools)
#!/bin/bash
# CDE Regional Failover Script (2026 - includes agent workloads)
set -euo pipefail
PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://cde.company.com/api/v2/health"
AGENT_HEALTH_ENDPOINT="https://cde.company.com/api/v2/agents/health"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"
# Check primary health (workspaces + agents)
check_health() {
local ws_status=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT --max-time 10)
local agent_status=$(curl -s -o /dev/null -w "%{http_code}" $AGENT_HEALTH_ENDPOINT --max-time 10)
[[ "$ws_status" == "200" ]] && [[ "$agent_status" == "200" ]]
}
# Notify team
notify() {
curl -X POST $SLACK_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{\"text\": \"$1\"}"
}
# Failover database
failover_db() {
echo "Promoting DR database to primary..."
aws rds promote-read-replica \
--db-instance-identifier cde-db-dr \
--region $DR_REGION
}
# Drain and redirect agent task queues
failover_agents() {
echo "Pausing agent task dispatch in $PRIMARY_REGION..."
aws sqs set-queue-attributes \
--queue-url $PRIMARY_QUEUE_URL \
--attributes '{"ReceiveMessageWaitTimeSeconds":"0"}'
echo "Activating DR agent queue consumers in $DR_REGION..."
kubectl --context $DR_CONTEXT scale deployment agent-consumer \
--replicas=10
echo "Checkpointing in-flight agent tasks..."
curl -X POST "https://cde.company.com/api/v2/agents/checkpoint-all" \
--max-time 30
}
# Update DNS
update_dns() {
echo "Updating DNS to DR region..."
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch file://failover-dns.json
}
# Main failover logic
main() {
if ! check_health; then
notify ":rotating_light: Primary CDE region unhealthy - initiating failover"
failover_agents
failover_db
update_dns
notify ":white_check_mark: Failover complete - CDE now running in $DR_REGION"
notify ":robot_face: Agent tasks checkpointed and resuming in $DR_REGION"
else
echo "Primary region healthy - no action needed"
fi
}
main "$@"Developer Workspace Failover
- DNS failover to DR region
- Database promotion (RDS/CloudSQL)
- Workspace storage reattachment
- IDE reconnection (VS Code, JetBrains Gateway)
- Verify Git push/pull from DR region
AI Agent Failover
- Pause task dispatch in failed region
- Checkpoint all in-flight agent tasks
- Scale up DR agent consumer fleet
- Verify LLM API connectivity from DR
- Resume checkpointed tasks in DR
DR Testing Schedule
Regular testing ensures DR procedures actually work - for both human and AI workloads
Weekly
- - Health check validation
- - Backup verification
- - Alert test (silent)
- - Agent heartbeat audit
Monthly
- - Restore test (non-prod)
- - Runbook walkthrough
- - On-call rotation test
- - Agent checkpoint/resume test
Quarterly
- - Partial failover drill
- - Chaos engineering test
- - Documentation review
- - LLM provider failover test
Annually
- - Full failover drill
- - Multi-day DR simulation
- - Third-party audit
- - Full agent fleet failover test
Chaos Engineering for CDEs in 2026
With AI agents now running hundreds of concurrent workspaces, chaos testing is more important than ever. Modern chaos engineering for CDEs should include: killing random agent sandboxes mid-task, simulating LLM API outages, injecting network partitions between regions, throttling GPU availability, and testing checkpoint/resume under load. Tools like Litmus, Chaos Mesh, and Gremlin all support Kubernetes-native chaos experiments that map well to CDE infrastructure.
Key metric: Measure not just recovery time, but how many agent tasks were lost or had to restart from scratch. The goal is zero task loss with checkpoint-based recovery.
Related Resources
Continue building resilient infrastructure
