Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide CDEs for Startups
AI & Automation
AI Coding Assistants Agentic AI AI-Native IDEs Agentic Engineering AI Agent Orchestration AI Governance AI-Assisted Architecture Shift-Left AI LLMOps Autonomous Development AI/ML Workloads GPU Computing
Implementation
Architecture Patterns DevContainers Advanced DevContainers Language Quickstarts IDE Integration CI/CD Integration Platform Engineering Developer Portals Container Registry Multi-CDE Strategies Remote Dev Protocols Nix Environments
Operations
Performance Optimization High Availability & DR Disaster Recovery Monitoring Capacity Planning Multi-Cluster Development Troubleshooting Runbooks Ephemeral Environments
Security
Security Deep Dive Zero Trust Architecture Secrets Management Vulnerability Management Network Security IAM Guide Supply Chain Security Air-Gapped Environments AI Agent Security MicroVM Isolation Compliance Guide Governance
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis FinOps GreenOps Vendor Evaluation Training Resources Developer Onboarding Team Structure DevEx Metrics Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

Disaster Recovery for Cloud Development Environments

Comprehensive DR planning for CDEs: RTO/RPO targets, backup strategies, failover procedures, AI agent workspace recovery, and testing to ensure business continuity when disaster strikes.

Why DR Matters for Cloud Development Environments

When your CDE platform goes down, all developers and AI coding agents stop working simultaneously. Understanding the unique DR requirements for CDEs is critical.

Massive Blast Radius

Unlike local development where individual laptop failures affect one person, a CDE outage impacts every single developer in your organization simultaneously.

Impact Example

A 4-hour CDE outage at a company with 200 developers and 50 AI agent workspaces means 800+ lost developer hours plus halted autonomous coding tasks. At a loaded cost of $100/hour, that is $80,000 in direct productivity loss - plus reputation damage, missed deadlines, and customer impact.

Local Dev vs CDE DR

Local Development

  • Failures are isolated to individuals
  • Lost laptop affects 1 person
  • Recovery: Buy new laptop, restore from backup
  • Code in Git provides resilience

Cloud Development Environments

  • Platform failure affects everyone
  • Control plane down = 0 productivity for humans and AI agents
  • Recovery requires infrastructure rebuild
  • Workspace data needs backup strategy

CDE-Specific DR Considerations

Code in Git

Committed code is already resilient (stored in Git). Your primary concern is uncommitted work in workspace filesystems and the ability to recreate workspaces.

Control Plane is Critical

The control plane (user data, templates, configurations) is your single point of failure. Database backups and multi-region deployment are essential.

Developer Patience is Limited

Developers expect high availability. Outages longer than 30 minutes trigger fallback to local development and erode trust in the platform.

AI Agents Need Uptime Too

AI coding agents (Claude Code, GitHub Copilot, Cursor) increasingly run in dedicated CDE workspaces. When your platform is down, autonomous coding tasks, background refactors, and AI-assisted pipelines all stop.

RTO and RPO Targets for CDE Components

Define recovery objectives by component. Not everything needs the same level of protection.

RTO

Recovery Time Objective

Maximum acceptable downtime after a disaster. How long can developers be without their workspaces before business impact becomes severe?

Typical CDE RTO Targets:

  • Mission Critical: 15-30 minutes
  • Standard: 2-4 hours
  • Best Effort: 24 hours

RPO

Recovery Point Objective

Maximum acceptable data loss measured in time. How much uncommitted work can developers afford to lose?

Typical CDE RPO Targets:

  • Control Plane: 0-5 minutes
  • Workspace Data: 1-4 hours
  • Templates/Config: 0 (stored in Git)
Component What It Contains Recommended RTO Recommended RPO Backup Strategy
Control Plane Database User accounts, workspace metadata, templates, RBAC policies 15 min 5 min Continuous replication, automated snapshots every 5 min
Workspace Storage (Persistent) Uncommitted code, build artifacts, local configs, databases 1 hour 4 hours Volume snapshots every 4 hours, cross-region replication
Templates & Infrastructure Code Terraform templates, DevContainer configs, base images 10 min 0 Stored in Git (GitHub, GitLab), multi-region container registry
Container Images Base images, prebuilds, workspace containers 30 min 0 Multi-region registry replication (ECR, ACR, Artifact Registry)
Kubernetes Cluster State Running workspaces, pods, configurations 30 min N/A Ephemeral - recreate from templates. Use Velero for etcd backups if needed
AI Agent Workspaces Running AI coding agents, task queues, agent context, in-progress refactors 30 min 1 hour Task queue checkpointing, agent state snapshots, idempotent task design for safe restart
Secrets & Credentials API keys, database passwords, SSH keys 15 min 0 Multi-region secrets manager (Vault, AWS Secrets Manager, etc.)

DR Architecture Patterns

Choose the DR pattern that balances cost, complexity, and recovery objectives for your organization.

Active-Passive

Primary region with cold standby DR site

+Lower cost (no active compute)
+Simpler to manage
-RTO: 30-60 minutes
-Manual failover required
Best for: Cost-conscious teams, non-critical workloads

Active-Active

Recommended

Traffic served from multiple regions simultaneously

+RTO: Near-zero (automatic)
+Best developer latency
~Higher cost (2x compute)
~Complex data synchronization
Best for: Mission-critical environments, global teams

Pilot Light

Minimal DR infrastructure always running

+Database always replicated
+Faster than cold standby
-RTO: 20-45 minutes
-Compute must scale up on failure
Best for: Balance of cost and recovery speed

Warm Standby

Scaled-down version running in DR region

+Control plane already running
+RTO: 10-20 minutes
~Moderate cost (partial compute)
-Must scale to full capacity
Best for: Fast recovery with cost constraints

Active-Active Multi-Region CDE Architecture

                     Global Traffic Manager (Route53 Geo-Routing)
                                    /              \
                          (Latency-Based)      (Latency-Based)
                              /                      \
                    Region: US-East-1              Region: EU-West-1
                    -----------------              -----------------

    Load Balancer                          Load Balancer
         |                                      |
    +-----------+-----------+              +-----------+-----------+
    |           |           |              |           |           |
Coder      Coder      Coder            Coder      Coder      Coder
Pod 1      Pod 2      Pod 3            Pod 1      Pod 2      Pod 3
    |           |           |              |           |           |
    +--------+--+-----------+              +--------+--+-----------+
             |                                      |
    PostgreSQL (Primary)                  PostgreSQL (Replica)
    RDS Multi-AZ                          Read Replica + Async Replication
         |                                      |
         +---------- Bi-directional Replication ---------+
                    (5-10 second lag)

Developer Workspaces:                  Developer Workspaces:
- us-east-1a (AZ-1)                   - eu-west-1a (AZ-1)
- us-east-1b (AZ-2)                   - eu-west-1b (AZ-2)
- us-east-1c (AZ-3)                   - eu-west-1c (AZ-3)

Persistent Volumes:                    Persistent Volumes:
EBS snapshots -> S3                    EBS snapshots -> S3
  |                                      |
  +--- Cross-Region Replication (S3) ---+
       (RPO: 15 minutes)

Backup Strategies for CDE Components

Comprehensive backup strategy covering control plane, workspace data, and configuration.

Control Plane Database Backups

The control plane database contains user accounts, workspace metadata, templates, and RBAC policies. This is your most critical backup target.

Continuous Replication

# AWS RDS - Enable Multi-AZ + Read Replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier coder-db-replica-eu \
  --source-db-instance-identifier coder-db-us \
  --source-region us-east-1 \
  --region eu-west-1

# PostgreSQL Logical Replication
# On primary
CREATE PUBLICATION coder_pub FOR ALL TABLES;

# On replica
CREATE SUBSCRIPTION coder_sub
CONNECTION 'host=primary-db port=5432 dbname=coder'
PUBLICATION coder_pub;

Automated Snapshots

# Enable automated backups
aws rds modify-db-instance \
  --db-instance-identifier coder-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Manual snapshot for testing
aws rds create-db-snapshot \
  --db-instance-identifier coder-db \
  --db-snapshot-identifier pre-upgrade-$(date +%Y%m%d)

# Copy snapshot to DR region
aws rds copy-db-snapshot \
  --source-region us-east-1 \
  --source-db-snapshot-identifier SNAPSHOT_ID \
  --target-db-snapshot-identifier SNAPSHOT_ID \
  --region eu-west-1

Best Practice

Test database restoration monthly. Restore to a test environment and verify workspace metadata integrity. Automated snapshots mean nothing if you have never tested restoration.

Workspace Persistent Volume Backups

Workspace persistent volumes contain uncommitted code, build artifacts, and local databases. Balance RPO targets against storage costs.

Volume Snapshots

# Kubernetes VolumeSnapshot (CSI)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: workspace-snapshot
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: workspace-pvc

# Automated snapshot via CronJob
kubectl create cronjob workspace-snapshots \
  --image=bitnami/kubectl \
  --schedule="0 */4 * * *" \
  -- kubectl create volumesnapshot \
  workspace-$(date +%Y%m%d-%H%M) \
  --from-pvc=workspace-pvc

Selective Backup with Velero

# Install Velero with S3 backend
velero install \
  --provider aws \
  --bucket cde-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1

# Create backup schedule
velero schedule create daily-workspaces \
  --schedule="0 2 * * *" \
  --include-namespaces workspaces \
  --ttl 720h

# Backup critical workspace only
velero backup create important-workspace \
  --include-resources pvc \
  --selector workspace=prod-db-workspace

Cost Consideration

Backing up every workspace every 4 hours can be expensive. Consider tiered backup strategies: critical workspaces (4h RPO), standard (24h RPO), ephemeral (no backups - recreate from template).

Configuration and Template Backups

Templates, Terraform configurations, and DevContainer definitions should be version-controlled in Git. This provides zero RPO and easy restoration.

Git-Based Configuration

# Directory structure
cde-templates/
├── .git/
├── python-dev/
│   ├── main.tf
│   ├── devcontainer.json
│   └── README.md
├── java-spring/
│   ├── main.tf
│   └── docker-compose.yml
└── data-science/
    ├── main.tf
    └── requirements.txt

# Push templates to multi-region repos
git remote add origin-us \
  [email protected]:company/cde-templates.git
git remote add origin-eu \
  [email protected]:platform/cde-templates.git

git push origin-us main
git push origin-eu main

Container Image Replication

# AWS ECR - Enable cross-region replication
aws ecr put-replication-configuration \
  --replication-configuration '{
    "rules": [{
      "destinations": [{
        "region": "eu-west-1",
        "registryId": "123456789012"
      }]
    }]
  }'

# Azure ACR - Geo-replication
az acr replication create \
  --registry cderegistry \
  --location westeurope

# Manual image copy
docker pull us.gcr.io/company/workspace:latest
docker tag us.gcr.io/company/workspace:latest \
  eu.gcr.io/company/workspace:latest
docker push eu.gcr.io/company/workspace:latest

Failover Procedures

Automated and manual failover strategies to minimize downtime during regional outages.

Automated Failover

Health checks detect failures and automatically redirect traffic to healthy regions without manual intervention.

DNS-Based Health Checks

# AWS Route53 Health Check
aws route53 create-health-check \
  --health-check-config \
    IPAddress=203.0.113.1,\
    Port=443,\
    Type=HTTPS,\
    ResourcePath=/api/v2/health,\
    FullyQualifiedDomainName=cde.company.com,\
    RequestInterval=30,\
    FailureThreshold=3

# Associate with DNS record
# Route53 automatically fails over to
# secondary region when health check fails

Load Balancer Health Checks

# Kubernetes Liveness Probe
livenessProbe:
  httpGet:
    path: /api/v2/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

# Unhealthy pods removed from service
# automatically by kube-proxy

Manual Failover

For active-passive architectures or when automated failover is not configured, manual procedures are required.

Failover Script

#!/bin/bash
# DR Failover Script
set -euo pipefail

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"

echo "Step 1: Promote DR database to primary"
aws rds promote-read-replica \
  --db-instance-identifier coder-db-dr \
  --region $DR_REGION

echo "Step 2: Update DNS to DR region"
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://dr-dns.json

echo "Step 3: Scale DR control plane"
kubectl scale deployment coder \
  --replicas=3 -n coder-system \
  --context dr-cluster

echo "Failover complete"
echo "Verify: curl https://cde.company.com/health"

Manual Failover Checklist

  • Confirm primary region is unrecoverable
  • Notify all stakeholders
  • Promote database replica
  • Update DNS records
  • Scale DR infrastructure
  • Verify developer access
  • Document incident timeline

DNS Failover Configuration

DNS-based failover uses health checks to automatically route traffic to healthy regions. This is the foundation of most DR strategies.

Route53 Failover Policy

{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "cde.company.com",
      "Type": "A",
      "SetIdentifier": "Primary-US-East",
      "Failover": "PRIMARY",
      "AliasTarget": {
        "HostedZoneId": "Z1234567890ABC",
        "DNSName": "primary-lb-123.us-east-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  },
  {
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "cde.company.com",
      "Type": "A",
      "SetIdentifier": "Secondary-EU-West",
      "Failover": "SECONDARY",
      "AliasTarget": {
        "HostedZoneId": "Z9876543210XYZ",
        "DNSName": "dr-lb-456.eu-west-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  }]
}

Health Check Configuration

{
  "Type": "HTTPS",
  "ResourcePath": "/api/v2/health",
  "FullyQualifiedDomainName": "cde.company.com",
  "Port": 443,
  "RequestInterval": 30,
  "FailureThreshold": 3,
  "MeasureLatency": true,
  "EnableSNI": true,
  "HealthThreshold": 3
}

# Health check evaluates:
# - HTTP 200 response
# - Response time < 2 seconds
# - 3 consecutive failures trigger failover
# - 3 consecutive successes restore primary

TTL Consideration: Set DNS TTL to 60 seconds or less. Longer TTLs delay failover as clients cache old records.

Testing Disaster Recovery Plans

Untested DR plans fail when you need them most. Regular testing validates procedures and builds muscle memory for your team.

Weekly

  • Verify backups completed
  • Test health check endpoints
  • Review monitoring dashboards
  • Validate replication lag

Monthly

  • Restore database to test environment
  • Walkthrough runbooks with team
  • Test workspace volume restore
  • Verify on-call escalation

Quarterly

  • Partial failover drill (non-prod)
  • Chaos engineering tests
  • Test AI agent task recovery
  • Update DR documentation
  • Review RTO/RPO targets

Annually

  • Full production failover drill
  • Multi-day DR simulation
  • Third-party DR audit
  • Executive tabletop exercise

Tabletop Exercises

Walk through DR scenarios without actually executing failover. Identify gaps in procedures and unclear responsibilities.

Sample Scenarios

Scenario 1: Regional Outage

"AWS us-east-1 has a complete outage affecting all services. Your CDE control plane is unreachable. Walk through the failover process."

Scenario 2: Database Corruption

"Your control plane database has been corrupted. Last known good backup is 6 hours old. How do you recover?"

Scenario 3: Ransomware

"Ransomware has encrypted all workspaces and control plane. Attackers demand payment. What is your response plan?"

Scenario 4: AI Agent Workspace Outage

"Your primary region is down. 30 AI coding agents were mid-task - refactoring, generating tests, and running migrations. How do you recover agent progress and restart tasks in the DR region?"

Exercise Format

1
Present Scenario (5 min)

Facilitator describes the disaster situation

2
Initial Response (15 min)

Team discusses who does what, referencing runbooks

3
Walk Through Steps (30 min)

Step-by-step discussion of DR procedures

4
Identify Gaps (10 min)

Document unclear procedures, missing tools, training needs

5
Action Items (10 min)

Assign improvements to team members with deadlines

Chaos Engineering for Disaster Recovery

Intentionally inject failures in non-production environments to validate DR automation and team response.

Control Plane Failure

# Kill control plane pods
kubectl delete pod -l app=coder \
  -n coder-system

# Verify:
# - Kubernetes restarts pods
# - Health checks detect failure
# - Alerting triggers
# - Workspaces unaffected

Database Failure Simulation

# Simulate DB connection loss
kubectl scale deployment coder \
  --replicas=0 -n coder-system

# Or use network policies
kubectl apply -f - <

Regional Failure

# Update Route53 to fail primary
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "cde-test.company.com",
        "Type": "A",
        "SetIdentifier": "Primary",
        "Failover": "PRIMARY",
        "HealthCheckId": "FORCE_FAIL",
        ...
      }
    }]
  }'

# Verify automatic failover to DR

Safety First

Always run chaos experiments in non-production environments first. Schedule during low-traffic periods. Have rollback procedures ready. Notify stakeholders before testing.

Annual Game Day: Full DR Drill

Once per year, execute a complete failover to your DR region during a scheduled maintenance window. This is the ultimate test of your DR preparedness.

Planning Phase (T-4 weeks)

  • Schedule 4-hour maintenance window
  • Notify all developers 2 weeks in advance
  • Create detailed runbook with timings
  • Assign roles (commander, scribe, ops, comms)
  • Set up Zoom war room for coordination
  • Verify on-call contacts

Execution Phase (Game Day)

  • T-0:00 - Declare maintenance start, post status page update
  • T+0:10 - Failover to DR region (manual or automated)
  • T+0:30 - Verify control plane health, test workspace creation
  • T+1:00 - All developers validate workspace access
  • T+1:30 - Verify AI agent workspaces resume queued tasks
  • T+2:00 - Developers work in DR region for 1 hour
  • T+3:00 - Failback to primary region
  • T+3:30 - Verify primary, all-clear announcement
  • T+4:00 - Post-mortem meeting

Success Criteria

Failover RTO

Control plane available in DR region within 30 minutes

Developer Access

95%+ developers can create workspaces in DR region

Data Integrity

Zero data loss, all user accounts and templates intact

Recovery Runbooks

Step-by-step procedures for common disaster scenarios. Keep these updated and accessible offline.

Complete Regional Outage

SEV-1 | RTO: 30 min
  • Confirm primary region unavailable (AWS status, health checks)
  • Page incident commander and DR team
  • Execute failover script or manual DNS update
  • Promote DR database to primary
  • Scale DR control plane to full capacity
  • Post status page update
  • Verify developer access, monitor error rates

Database Corruption

SEV-1 | RTO: 45 min
  • Stop all writes to database immediately
  • Assess corruption scope (full DB or specific tables)
  • Identify most recent clean snapshot
  • Restore database from snapshot to new instance
  • Verify data integrity with test queries
  • Update control plane connection string
  • Communicate data loss window to users

Workspace Volume Loss

SEV-3 | RTO: 2 hours
  • Identify affected workspace and user
  • Locate most recent volume snapshot
  • Create new PV from snapshot
  • Recreate workspace pointing to restored volume
  • Notify user of data loss window
  • Document incident for postmortem

AI Agent Task Interruption

SEV-3 | RTO: 1 hour
  • Identify affected agent workspaces and in-flight tasks
  • Check task queue for pending and in-progress items
  • Review Git branches for checkpointed progress
  • Spin up agent workspaces in DR region
  • Verify LLM API keys and tokens are valid in DR
  • Re-queue interrupted tasks, mark as retry

Ransomware Attack

SEV-1 | RTO: 4 hours
  • Isolate affected systems immediately (network isolation)
  • Engage security team and legal counsel
  • Do NOT pay ransom - restore from backups
  • Verify backup integrity (pre-infection)
  • Rebuild infrastructure from clean images
  • Restore data from known-clean backups
  • Conduct forensic analysis of attack vector

Offline Runbook Access

During a disaster, your documentation platform may be unavailable. Keep offline copies of runbooks:

PDF Backups

Print to PDF quarterly, store in S3 + local drives

Mobile Access

On-call engineers save runbooks to phone notes app

Physical Copies

Laminated runbooks in incident response binder

Cloud Provider DR Considerations

Each cloud provider offers different DR capabilities. Choose the right services for your architecture.

AWS

Multi-AZ vs Multi-Region

Multi-AZ: RDS, EKS across availability zones. Protects against datacenter failures. RTO: minutes.

Multi-Region: Route53 failover, cross-region RDS replicas, S3 replication. Protects against regional failures. RTO: 15-60 min.

Key Services

  • Route53 health checks + failover
  • RDS automated backups + read replicas
  • S3 cross-region replication
  • EBS snapshot copy to DR region
  • AWS Backup for centralized management

Azure

Availability Zones vs Geo-Replication

Availability Zones: AKS zone redundancy, zone-redundant storage. Protects against datacenter failures.

Geo-Replication: Traffic Manager, Azure SQL geo-replication, storage account replication. Regional protection.

Key Services

  • Traffic Manager for DNS failover
  • Azure SQL active geo-replication
  • Storage account GRS/GZRS replication
  • Azure Site Recovery for VM replication
  • Backup vaults with cross-region restore

GCP

Zonal vs Regional vs Multi-Regional

Regional: GKE regional clusters, Cloud SQL HA across zones. Zone failure protection.

Multi-Regional: Cloud SQL cross-region replicas, multi-region storage. Regional failure protection.

Key Services

  • Cloud DNS with health checks
  • Cloud SQL cross-region read replicas
  • GCS multi-region or dual-region buckets
  • Persistent disk snapshots to any region
  • Regional GKE clusters (3+ zones)

Cross-Cloud DR Strategy

For ultimate resilience, some organizations deploy across multiple cloud providers. This eliminates vendor lock-in and protects against cloud-wide outages.

Benefits

  • Zero vendor lock-in
  • Protection from cloud-wide outages
  • Negotiating leverage on pricing

Challenges

  • Significantly higher operational complexity
  • Data synchronization across clouds
  • Requires cloud-agnostic tooling (Terraform, Kubernetes, DevPod)

Disaster Recovery for AI Agent Workspaces

AI coding agents like Claude Code, GitHub Copilot, and Cursor now run in dedicated CDE workspaces. Their DR requirements differ from interactive developer workspaces in important ways.

Why AI Agent DR is Different

AI agents run long-lived, autonomous tasks - multi-file refactors, test generation, code reviews, and background migrations. Unlike human developers who can context-switch during an outage, an interrupted agent loses its in-flight work and reasoning context.

Long-running tasks: An agent refactoring 50 files across a monorepo may run for 30+ minutes. A mid-task failure wastes all progress unless checkpointed.

Stateful context: Agents build up conversation and codebase context. Losing this mid-task means restarting from scratch, not resuming.

No human fallback: A developer can switch to a local machine. An AI agent workspace has no manual fallback - the platform is the only option.

DR Checklist for AI Agents

Design tasks to be idempotent

Agents should be able to restart a task safely without duplicating work or corrupting state

Checkpoint agent progress to Git

Configure agents to commit work-in-progress to feature branches at regular intervals

Use a task queue with persistence

Queue pending agent tasks in a durable store (Redis with AOF, PostgreSQL) so nothing is lost on restart

Separate agent workspaces from developer workspaces

Run AI agents in their own namespace so a noisy agent does not impact human developers during recovery

Monitor API key and token validity

AI agents depend on external API keys (LLM providers, Git). Ensure DR secrets are current and not expired

Include agents in DR drills

Verify AI agent workspaces spin up in the DR region, connect to LLM providers, and resume queued tasks

AI Agent DR Architecture Patterns

Checkpoint-and-Resume

Agents commit partial work to Git branches at logical breakpoints. On recovery, a new agent workspace picks up the branch and continues.

Best for: Large refactoring tasks, multi-file migrations

Idempotent Replay

Design agent tasks so they can be safely re-executed from the beginning. The agent checks what is already done and skips completed steps.

Best for: Test generation, linting, documentation tasks

Queue-Based Failover

Agent tasks sit in a durable queue (backed by a replicated database). DR region workers pick up pending tasks automatically on failover.

Best for: CI/CD agent tasks, scheduled background jobs

CDE Platform Considerations for AI Agent DR

Leading CDE platforms handle AI agent workspaces differently. Understand how your platform of choice supports agent-specific DR requirements.

Coder

Supports headless agent workspaces via Terraform templates. Agent tasks can be managed through the Coder API. Use workspace prebuilds to speed up agent recovery in DR.

Ona (formerly Gitpod)

Ona workspaces can be pre-configured for AI agents using automation APIs. Ensure your DR plan accounts for Ona workspace timeouts and auto-stop policies that may affect long-running agent tasks.

GitHub Codespaces

Codespaces integrates tightly with GitHub Actions for agent-driven workflows. DR depends on GitHub's own availability - consider a self-hosted CDE as a backup plane.

DevPod

DevPod's provider-agnostic model makes cross-cloud DR straightforward. Agent workspaces can failover between AWS, Azure, or GCP providers with the same devcontainer config.

Best Practice for 2026

Treat AI agent workspaces as infrastructure, not developer tools. They should have their own resource quotas, monitoring dashboards, and DR runbooks separate from interactive developer workspaces. When the platform recovers, prioritize spinning up developer workspaces first, then resume agent tasks from their checkpoint or queue.

Frequently Asked Questions

What happens to running workspaces during a regional failover?

Running workspaces in the failed region are lost. Developers must recreate workspaces in the DR region. This is why RPO matters - uncommitted work since the last backup is lost. Encourage developers to commit and push frequently. In active-active architectures, workspaces in healthy regions continue running unaffected.

How much does multi-region DR cost?

Costs vary by pattern. Active-passive adds 10-20% (database replication, snapshots, minimal compute). Warm standby adds 30-50% (partial DR infrastructure). Active-active doubles infrastructure costs but provides best availability. Calculate ROI against cost of downtime. For 200 developers at $100/hour, even 1 hour of downtime costs $20,000.

Should we back up every workspace or just critical ones?

Implement tiered backup strategies. Critical workspaces (production databases, shared services): 4-hour RPO with regular snapshots. Standard development: 24-hour RPO with daily snapshots. Ephemeral workspaces (tutorial environments, testing): no backups - recreate from templates. AI agent workspaces fall into their own tier - back up the task queue and checkpoint state, but the workspace itself can be ephemeral.

What if developers are working when we need to test DR failover?

Schedule annual full DR drills during a maintenance window with 2+ weeks notice. For quarterly drills, use a dedicated test CDE environment or test during off-hours (weekends, late evening). For continuous testing, use chaos engineering on non-production environments. Never surprise developers with production failover testing during business hours.

How do AI coding agents affect our DR strategy?

AI agents like Claude Code and GitHub Copilot run long-lived tasks in CDE workspaces that cannot simply resume after a restart. Design agent tasks to be idempotent or checkpoint progress to Git branches. Use a persistent task queue so pending work survives a failover. Separate agent workspaces into their own Kubernetes namespace with dedicated resource quotas, and prioritize developer workspace recovery over agent task resumption.

Should AI agent workspaces run in the DR region during normal operations?

In an active-active architecture, yes - distributing agent workloads across regions provides natural resilience and reduces blast radius. In active-passive setups, keep agents in the primary region but ensure the task queue is replicated to DR. On failover, the DR region can spin up agent workspaces and drain the queue. This is simpler than live-migrating in-progress agent tasks.