Disaster Recovery for Cloud Development Environments

Q: What happens to running workspaces during a regional failover?

Running workspaces in the failed region are lost. Developers must recreate workspaces in the DR region. This is why RPO matters - uncommitted work since the last backup is lost. Encourage developers to commit and push frequently. In active-active architectures, workspaces in healthy regions continue running unaffected.

Q: How much does multi-region DR cost?

Costs vary by pattern. Active-passive adds 10-20% (database replication, snapshots, minimal compute). Warm standby adds 30-50% (partial DR infrastructure). Active-active doubles infrastructure costs but provides best availability. Calculate ROI against cost of downtime.

Q: Should we back up every workspace or just critical ones?

Implement tiered backup strategies. Critical workspaces (production databases, shared services): 4-hour RPO with regular snapshots. Standard development: 24-hour RPO with daily snapshots. Ephemeral workspaces (tutorial environments, testing): no backups - recreate from templates.

Q: What if developers are working when we need to test DR failover?

Schedule annual full DR drills during a maintenance window with 2+ weeks notice. For quarterly drills, use a dedicated test CDE environment or test during off-hours. For continuous testing, use chaos engineering on non-production environments.

Comprehensive DR planning for CDEs: RTO/RPO targets, backup strategies, failover procedures, AI agent workspace recovery, and testing to ensure business continuity when disaster strikes.

Why DR Matters RTO/RPO DR Patterns Backups Testing AI Agents

Why DR Matters for Cloud Development Environments

When your CDE platform goes down, all developers and AI coding agents stop working simultaneously. Understanding the unique DR requirements for CDEs is critical.

Massive Blast Radius

Unlike local development where individual laptop failures affect one person, a CDE outage impacts every single developer in your organization simultaneously.

Impact Example

A 4-hour CDE outage at a company with 200 developers and 50 AI agent workspaces means 800+ lost developer hours plus halted autonomous coding tasks. At a loaded cost of $100/hour, that is $80,000 in direct productivity loss - plus reputation damage, missed deadlines, and customer impact.

Local Dev vs CDE DR

Local Development

Failures are isolated to individuals
Lost laptop affects 1 person
Recovery: Buy new laptop, restore from backup
Code in Git provides resilience

Cloud Development Environments

Platform failure affects everyone
Control plane down = 0 productivity for humans and AI agents
Recovery requires infrastructure rebuild
Workspace data needs backup strategy

CDE-Specific DR Considerations

Code in Git

Committed code is already resilient (stored in Git). Your primary concern is uncommitted work in workspace filesystems and the ability to recreate workspaces.

Control Plane is Critical

The control plane (user data, templates, configurations) is your single point of failure. Database backups and multi-region deployment are essential.

Developer Patience is Limited

Developers expect high availability. Outages longer than 30 minutes trigger fallback to local development and erode trust in the platform.

AI Agents Need Uptime Too

AI coding agents (Claude Code, GitHub Copilot, Cursor) increasingly run in dedicated CDE workspaces. When your platform is down, autonomous coding tasks, background refactors, and AI-assisted pipelines all stop.

RTO and RPO Targets for CDE Components

Define recovery objectives by component. Not everything needs the same level of protection.

RTO

Recovery Time Objective

Maximum acceptable downtime after a disaster. How long can developers be without their workspaces before business impact becomes severe?

Typical CDE RTO Targets:

Mission Critical: 15-30 minutes
Standard: 2-4 hours
Best Effort: 24 hours

RPO

Recovery Point Objective

Maximum acceptable data loss measured in time. How much uncommitted work can developers afford to lose?

Typical CDE RPO Targets:

Control Plane: 0-5 minutes
Workspace Data: 1-4 hours
Templates/Config: 0 (stored in Git)

Component	What It Contains	Recommended RTO	Recommended RPO	Backup Strategy
Control Plane Database	User accounts, workspace metadata, templates, RBAC policies	15 min	5 min	Continuous replication, automated snapshots every 5 min
Workspace Storage (Persistent)	Uncommitted code, build artifacts, local configs, databases	1 hour	4 hours	Volume snapshots every 4 hours, cross-region replication
Templates & Infrastructure Code	Terraform templates, DevContainer configs, base images	10 min	0	Stored in Git (GitHub, GitLab), multi-region container registry
Container Images	Base images, prebuilds, workspace containers	30 min	0	Multi-region registry replication (ECR, ACR, Artifact Registry)
Kubernetes Cluster State	Running workspaces, pods, configurations	30 min	N/A	Ephemeral - recreate from templates. Use Velero for etcd backups if needed
AI Agent Workspaces	Running AI coding agents, task queues, agent context, in-progress refactors	30 min	1 hour	Task queue checkpointing, agent state snapshots, idempotent task design for safe restart
Secrets & Credentials	API keys, database passwords, SSH keys	15 min	0	Multi-region secrets manager (Vault, AWS Secrets Manager, etc.)

DR Architecture Patterns

Choose the DR pattern that balances cost, complexity, and recovery objectives for your organization.

Active-Passive

Primary region with cold standby DR site

+Lower cost (no active compute)

+Simpler to manage

-RTO: 30-60 minutes

-Manual failover required

Best for: Cost-conscious teams, non-critical workloads

Active-Active

Recommended

Traffic served from multiple regions simultaneously

+RTO: Near-zero (automatic)

+Best developer latency

~Higher cost (2x compute)

~Complex data synchronization

Best for: Mission-critical environments, global teams

Pilot Light

Minimal DR infrastructure always running

+Database always replicated

+Faster than cold standby

-RTO: 20-45 minutes

-Compute must scale up on failure

Best for: Balance of cost and recovery speed

Warm Standby

Scaled-down version running in DR region

+Control plane already running

+RTO: 10-20 minutes

~Moderate cost (partial compute)

-Must scale to full capacity

Best for: Fast recovery with cost constraints

Active-Active Multi-Region CDE Architecture

                     Global Traffic Manager (Route53 Geo-Routing)
                                    /              \
                          (Latency-Based)      (Latency-Based)
                              /                      \
                    Region: US-East-1              Region: EU-West-1
                    -----------------              -----------------

    Load Balancer                          Load Balancer
         |                                      |
    +-----------+-----------+              +-----------+-----------+
    |           |           |              |           |           |
Coder      Coder      Coder            Coder      Coder      Coder
Pod 1      Pod 2      Pod 3            Pod 1      Pod 2      Pod 3
    |           |           |              |           |           |
    +--------+--+-----------+              +--------+--+-----------+
             |                                      |
    PostgreSQL (Primary)                  PostgreSQL (Replica)
    RDS Multi-AZ                          Read Replica + Async Replication
         |                                      |
         +---------- Bi-directional Replication ---------+
                    (5-10 second lag)

Developer Workspaces:                  Developer Workspaces:
- us-east-1a (AZ-1)                   - eu-west-1a (AZ-1)
- us-east-1b (AZ-2)                   - eu-west-1b (AZ-2)
- us-east-1c (AZ-3)                   - eu-west-1c (AZ-3)

Persistent Volumes:                    Persistent Volumes:
EBS snapshots -> S3                    EBS snapshots -> S3
  |                                      |
  +--- Cross-Region Replication (S3) ---+
       (RPO: 15 minutes)

Backup Strategies for CDE Components

Comprehensive backup strategy covering control plane, workspace data, and configuration.

Control Plane Database Backups

The control plane database contains user accounts, workspace metadata, templates, and RBAC policies. This is your most critical backup target.

Continuous Replication

# AWS RDS - Enable Multi-AZ + Read Replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier coder-db-replica-eu \
  --source-db-instance-identifier coder-db-us \
  --source-region us-east-1 \
  --region eu-west-1

# PostgreSQL Logical Replication
# On primary
CREATE PUBLICATION coder_pub FOR ALL TABLES;

# On replica
CREATE SUBSCRIPTION coder_sub
CONNECTION 'host=primary-db port=5432 dbname=coder'
PUBLICATION coder_pub;

Automated Snapshots

# Enable automated backups
aws rds modify-db-instance \
  --db-instance-identifier coder-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Manual snapshot for testing
aws rds create-db-snapshot \
  --db-instance-identifier coder-db \
  --db-snapshot-identifier pre-upgrade-$(date +%Y%m%d)

# Copy snapshot to DR region
aws rds copy-db-snapshot \
  --source-region us-east-1 \
  --source-db-snapshot-identifier SNAPSHOT_ID \
  --target-db-snapshot-identifier SNAPSHOT_ID \
  --region eu-west-1

Best Practice

Test database restoration monthly. Restore to a test environment and verify workspace metadata integrity. Automated snapshots mean nothing if you have never tested restoration.

Workspace Persistent Volume Backups

Workspace persistent volumes contain uncommitted code, build artifacts, and local databases. Balance RPO targets against storage costs.

Volume Snapshots

# Kubernetes VolumeSnapshot (CSI)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: workspace-snapshot
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: workspace-pvc

# Automated snapshot via CronJob
kubectl create cronjob workspace-snapshots \
  --image=bitnami/kubectl \
  --schedule="0 */4 * * *" \
  -- kubectl create volumesnapshot \
  workspace-$(date +%Y%m%d-%H%M) \
  --from-pvc=workspace-pvc

Selective Backup with Velero

# Install Velero with S3 backend
velero install \
  --provider aws \
  --bucket cde-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1

# Create backup schedule
velero schedule create daily-workspaces \
  --schedule="0 2 * * *" \
  --include-namespaces workspaces \
  --ttl 720h

# Backup critical workspace only
velero backup create important-workspace \
  --include-resources pvc \
  --selector workspace=prod-db-workspace

Cost Consideration

Backing up every workspace every 4 hours can be expensive. Consider tiered backup strategies: critical workspaces (4h RPO), standard (24h RPO), ephemeral (no backups - recreate from template).

Configuration and Template Backups

Templates, Terraform configurations, and DevContainer definitions should be version-controlled in Git. This provides zero RPO and easy restoration.

Git-Based Configuration

# Directory structure
cde-templates/
├── .git/
├── python-dev/
│   ├── main.tf
│   ├── devcontainer.json
│   └── README.md
├── java-spring/
│   ├── main.tf
│   └── docker-compose.yml
└── data-science/
    ├── main.tf
    └── requirements.txt

# Push templates to multi-region repos
git remote add origin-us \
  [email protected]:company/cde-templates.git
git remote add origin-eu \
  [email protected]:platform/cde-templates.git

git push origin-us main
git push origin-eu main

Container Image Replication

# AWS ECR - Enable cross-region replication
aws ecr put-replication-configuration \
  --replication-configuration '{
    "rules": [{
      "destinations": [{
        "region": "eu-west-1",
        "registryId": "123456789012"
      }]
    }]
  }'

# Azure ACR - Geo-replication
az acr replication create \
  --registry cderegistry \
  --location westeurope

# Manual image copy
docker pull us.gcr.io/company/workspace:latest
docker tag us.gcr.io/company/workspace:latest \
  eu.gcr.io/company/workspace:latest
docker push eu.gcr.io/company/workspace:latest

Failover Procedures

Automated and manual failover strategies to minimize downtime during regional outages.

Automated Failover

Health checks detect failures and automatically redirect traffic to healthy regions without manual intervention.

DNS-Based Health Checks

# AWS Route53 Health Check
aws route53 create-health-check \
  --health-check-config \
    IPAddress=203.0.113.1,\
    Port=443,\
    Type=HTTPS,\
    ResourcePath=/api/v2/health,\
    FullyQualifiedDomainName=cde.company.com,\
    RequestInterval=30,\
    FailureThreshold=3

# Associate with DNS record
# Route53 automatically fails over to
# secondary region when health check fails

Load Balancer Health Checks

# Kubernetes Liveness Probe
livenessProbe:
  httpGet:
    path: /api/v2/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

# Unhealthy pods removed from service
# automatically by kube-proxy

Manual Failover

For active-passive architectures or when automated failover is not configured, manual procedures are required.

Failover Script

#!/bin/bash
# DR Failover Script
set -euo pipefail

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"

echo "Step 1: Promote DR database to primary"
aws rds promote-read-replica \
  --db-instance-identifier coder-db-dr \
  --region $DR_REGION

echo "Step 2: Update DNS to DR region"
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://dr-dns.json

echo "Step 3: Scale DR control plane"
kubectl scale deployment coder \
  --replicas=3 -n coder-system \
  --context dr-cluster

echo "Failover complete"
echo "Verify: curl https://cde.company.com/health"

Manual Failover Checklist

Confirm primary region is unrecoverable
Notify all stakeholders
Promote database replica
Update DNS records
Scale DR infrastructure
Verify developer access
Document incident timeline

DNS Failover Configuration

DNS-based failover uses health checks to automatically route traffic to healthy regions. This is the foundation of most DR strategies.

Route53 Failover Policy

{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "cde.company.com",
      "Type": "A",
      "SetIdentifier": "Primary-US-East",
      "Failover": "PRIMARY",
      "AliasTarget": {
        "HostedZoneId": "Z1234567890ABC",
        "DNSName": "primary-lb-123.us-east-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  },
  {
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "cde.company.com",
      "Type": "A",
      "SetIdentifier": "Secondary-EU-West",
      "Failover": "SECONDARY",
      "AliasTarget": {
        "HostedZoneId": "Z9876543210XYZ",
        "DNSName": "dr-lb-456.eu-west-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  }]
}

Health Check Configuration

{
  "Type": "HTTPS",
  "ResourcePath": "/api/v2/health",
  "FullyQualifiedDomainName": "cde.company.com",
  "Port": 443,
  "RequestInterval": 30,
  "FailureThreshold": 3,
  "MeasureLatency": true,
  "EnableSNI": true,
  "HealthThreshold": 3
}

# Health check evaluates:
# - HTTP 200 response
# - Response time < 2 seconds
# - 3 consecutive failures trigger failover
# - 3 consecutive successes restore primary

TTL Consideration: Set DNS TTL to 60 seconds or less. Longer TTLs delay failover as clients cache old records.

Testing Disaster Recovery Plans

Untested DR plans fail when you need them most. Regular testing validates procedures and builds muscle memory for your team.

Weekly

Verify backups completed
Test health check endpoints
Review monitoring dashboards
Validate replication lag

Monthly

Restore database to test environment
Walkthrough runbooks with team
Test workspace volume restore
Verify on-call escalation

Quarterly

Partial failover drill (non-prod)
Chaos engineering tests
Test AI agent task recovery
Update DR documentation
Review RTO/RPO targets

Annually

Full production failover drill
Multi-day DR simulation
Third-party DR audit
Executive tabletop exercise

Tabletop Exercises

Walk through DR scenarios without actually executing failover. Identify gaps in procedures and unclear responsibilities.

Sample Scenarios

Scenario 1: Regional Outage

"AWS us-east-1 has a complete outage affecting all services. Your CDE control plane is unreachable. Walk through the failover process."

Scenario 2: Database Corruption

"Your control plane database has been corrupted. Last known good backup is 6 hours old. How do you recover?"

Scenario 3: Ransomware

"Ransomware has encrypted all workspaces and control plane. Attackers demand payment. What is your response plan?"

Scenario 4: AI Agent Workspace Outage

"Your primary region is down. 30 AI coding agents were mid-task - refactoring, generating tests, and running migrations. How do you recover agent progress and restart tasks in the DR region?"

Exercise Format

Present Scenario (5 min)

Facilitator describes the disaster situation

Initial Response (15 min)

Team discusses who does what, referencing runbooks

Walk Through Steps (30 min)

Step-by-step discussion of DR procedures

Identify Gaps (10 min)

Document unclear procedures, missing tools, training needs

Action Items (10 min)

Assign improvements to team members with deadlines

Chaos Engineering for Disaster Recovery

Intentionally inject failures in non-production environments to validate DR automation and team response.

Control Plane Failure

# Kill control plane pods
kubectl delete pod -l app=coder \
  -n coder-system

# Verify:
# - Kubernetes restarts pods
# - Health checks detect failure
# - Alerting triggers
# - Workspaces unaffected

Database Failure Simulation

# Simulate DB connection loss
kubectl scale deployment coder \
  --replicas=0 -n coder-system

# Or use network policies
kubectl apply -f - <


                                
                                    Regional Failure
                                    # Update Route53 to fail primary
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "cde-test.company.com",
        "Type": "A",
        "SetIdentifier": "Primary",
        "Failover": "PRIMARY",
        "HealthCheckId": "FORCE_FAIL",
        ...
      }
    }]
  }'

# Verify automatic failover to DR


                            
                                
                                    Safety First
                                
                                
                                    Always run chaos experiments in non-production environments first. Schedule during low-traffic periods. Have rollback procedures ready. Notify stakeholders before testing.



                    
                    
                        
                            
                                
                                Annual Game Day: Full DR Drill
                            
                        
                        
                            
                                Once per year, execute a complete failover to your DR region during a scheduled maintenance window. This is the ultimate test of your DR preparedness.
                            
                            
                                
                                    Planning Phase (T-4 weeks)
                                    
                                        Schedule 4-hour maintenance window
                                        Notify all developers 2 weeks in advance
                                        Create detailed runbook with timings
                                        Assign roles (commander, scribe, ops, comms)
                                        Set up Zoom war room for coordination
                                        Verify on-call contacts
                                    
                                
                                
                                    Execution Phase (Game Day)
                                    
                                        T-0:00 - Declare maintenance start, post status page update
                                        T+0:10 - Failover to DR region (manual or automated)
                                        T+0:30 - Verify control plane health, test workspace creation
                                        T+1:00 - All developers validate workspace access
                                        T+1:30 - Verify AI agent workspaces resume queued tasks
                                        T+2:00 - Developers work in DR region for 1 hour
                                        T+3:00 - Failback to primary region
                                        T+3:30 - Verify primary, all-clear announcement
                                        T+4:00 - Post-mortem meeting
                                    
                                
                            
                            
                                Success Criteria
                                
                                    
                                        Failover RTO
                                        Control plane available in DR region within 30 minutes
                                    
                                    
                                        Developer Access
                                        95%+ developers can create workspaces in DR region
                                    
                                    
                                        Data Integrity
                                        Zero data loss, all user accounts and templates intact



        
        
            
                
                    
                        
                        Recovery Runbooks
                    
                    
                        Step-by-step procedures for common disaster scenarios. Keep these updated and accessible offline.
                    
                

                
                    
                        
                            
                                
                            
                            
                                Complete Regional Outage
                                SEV-1 | RTO: 30 min
                            
                        
                        
                            Confirm primary region unavailable (AWS status, health checks)
                            Page incident commander and DR team
                            Execute failover script or manual DNS update
                            Promote DR database to primary
                            Scale DR control plane to full capacity
                            Post status page update
                            Verify developer access, monitor error rates
                        
                    

                    
                        
                            
                                
                            
                            
                                Database Corruption
                                SEV-1 | RTO: 45 min
                            
                        
                        
                            Stop all writes to database immediately
                            Assess corruption scope (full DB or specific tables)
                            Identify most recent clean snapshot
                            Restore database from snapshot to new instance
                            Verify data integrity with test queries
                            Update control plane connection string
                            Communicate data loss window to users
                        
                    

                    
                        
                            
                                
                            
                            
                                Workspace Volume Loss
                                SEV-3 | RTO: 2 hours
                            
                        
                        
                            Identify affected workspace and user
                            Locate most recent volume snapshot
                            Create new PV from snapshot
                            Recreate workspace pointing to restored volume
                            Notify user of data loss window
                            Document incident for postmortem
                        
                    

                    
                        
                            
                                
                            
                            
                                AI Agent Task Interruption
                                SEV-3 | RTO: 1 hour
                            
                        
                        
                            Identify affected agent workspaces and in-flight tasks
                            Check task queue for pending and in-progress items
                            Review Git branches for checkpointed progress
                            Spin up agent workspaces in DR region
                            Verify LLM API keys and tokens are valid in DR
                            Re-queue interrupted tasks, mark as retry
                        
                    

                    
                        
                            
                                
                            
                            
                                Ransomware Attack
                                SEV-1 | RTO: 4 hours
                            
                        
                        
                            Isolate affected systems immediately (network isolation)
                            Engage security team and legal counsel
                            Do NOT pay ransom - restore from backups
                            Verify backup integrity (pre-infection)
                            Rebuild infrastructure from clean images
                            Restore data from known-clean backups
                            Conduct forensic analysis of attack vector
                        
                    
                

                
                    
                        
                        Offline Runbook Access
                    
                    
                        During a disaster, your documentation platform may be unavailable. Keep offline copies of runbooks:
                    
                    
                        
                            
                            PDF Backups
                            Print to PDF quarterly, store in S3 + local drives
                        
                        
                            
                            Mobile Access
                            On-call engineers save runbooks to phone notes app
                        
                        
                            
                            Physical Copies
                            Laminated runbooks in incident response binder
                        
                    
                
            
        

        
        
            
                
                    
                        
                        Cloud Provider DR Considerations
                    
                    
                        Each cloud provider offers different DR capabilities. Choose the right services for your architecture.
                    
                

                
                    
                    
                        
                            
                                
                            
                            AWS
                        
                        
                            
                                Multi-AZ vs Multi-Region
                                
                                    Multi-AZ: RDS, EKS across availability zones. Protects against datacenter failures. RTO: minutes.
                                
                                
                                    Multi-Region: Route53 failover, cross-region RDS replicas, S3 replication. Protects against regional failures. RTO: 15-60 min.
                                
                            
                            
                                Key Services
                                
                                    Route53 health checks + failover
                                    RDS automated backups + read replicas
                                    S3 cross-region replication
                                    EBS snapshot copy to DR region
                                    AWS Backup for centralized management
                                
                            
                        
                    

                    
                    
                        
                            
                                
                            
                            Azure
                        
                        
                            
                                Availability Zones vs Geo-Replication
                                
                                    Availability Zones: AKS zone redundancy, zone-redundant storage. Protects against datacenter failures.
                                
                                
                                    Geo-Replication: Traffic Manager, Azure SQL geo-replication, storage account replication. Regional protection.
                                
                            
                            
                                Key Services
                                
                                    Traffic Manager for DNS failover
                                    Azure SQL active geo-replication
                                    Storage account GRS/GZRS replication
                                    Azure Site Recovery for VM replication
                                    Backup vaults with cross-region restore
                                
                            
                        
                    

                    
                    
                        
                            
                                
                            
                            GCP
                        
                        
                            
                                Zonal vs Regional vs Multi-Regional
                                
                                    Regional: GKE regional clusters, Cloud SQL HA across zones. Zone failure protection.
                                
                                
                                    Multi-Regional: Cloud SQL cross-region replicas, multi-region storage. Regional failure protection.
                                
                            
                            
                                Key Services
                                
                                    Cloud DNS with health checks
                                    Cloud SQL cross-region read replicas
                                    GCS multi-region or dual-region buckets
                                    Persistent disk snapshots to any region
                                    Regional GKE clusters (3+ zones)
                                
                            
                        
                    
                

                
                    
                        
                        Cross-Cloud DR Strategy
                    
                    
                        For ultimate resilience, some organizations deploy across multiple cloud providers. This eliminates vendor lock-in and protects against cloud-wide outages.
                    
                    
                        
                            Benefits
                            
                                Zero vendor lock-in
                                Protection from cloud-wide outages
                                Negotiating leverage on pricing
                            
                        
                        
                            Challenges
                            
                                Significantly higher operational complexity
                                Data synchronization across clouds
                                Requires cloud-agnostic tooling (Terraform, Kubernetes, DevPod)
                            
                        
                    
                
            
        

        
        
            
                
                    
                        
                        Disaster Recovery for AI Agent Workspaces
                    
                    
                        AI coding agents like Claude Code, GitHub Copilot, and Cursor now run in dedicated CDE workspaces. Their DR requirements differ from interactive developer workspaces in important ways.
                    
                

                
                
                    
                        
                            
                                
                            
                            Why AI Agent DR is Different
                        
                        
                            AI agents run long-lived, autonomous tasks - multi-file refactors, test generation, code reviews, and background migrations. Unlike human developers who can context-switch during an outage, an interrupted agent loses its in-flight work and reasoning context.
                        
                        
                            
                                
                                    Long-running tasks: An agent refactoring 50 files across a monorepo may run for 30+ minutes. A mid-task failure wastes all progress unless checkpointed.
                                
                            
                            
                                
                                    Stateful context: Agents build up conversation and codebase context. Losing this mid-task means restarting from scratch, not resuming.
                                
                            
                            
                                
                                    No human fallback: A developer can switch to a local machine. An AI agent workspace has no manual fallback - the platform is the only option.
                                
                            
                        
                    

                    
                        
                            
                                
                            
                            DR Checklist for AI Agents
                        
                        
                            
                                
                                
                                    Design tasks to be idempotent
                                    Agents should be able to restart a task safely without duplicating work or corrupting state
                                
                            
                            
                                
                                
                                    Checkpoint agent progress to Git
                                    Configure agents to commit work-in-progress to feature branches at regular intervals
                                
                            
                            
                                
                                
                                    Use a task queue with persistence
                                    Queue pending agent tasks in a durable store (Redis with AOF, PostgreSQL) so nothing is lost on restart
                                
                            
                            
                                
                                
                                    Separate agent workspaces from developer workspaces
                                    Run AI agents in their own namespace so a noisy agent does not impact human developers during recovery
                                
                            
                            
                                
                                
                                    Monitor API key and token validity
                                    AI agents depend on external API keys (LLM providers, Git). Ensure DR secrets are current and not expired
                                
                            
                            
                                
                                
                                    Include agents in DR drills
                                    Verify AI agent workspaces spin up in the DR region, connect to LLM providers, and resume queued tasks
                                
                            
                        
                    
                

                
                
                    
                        AI Agent DR Architecture Patterns
                    
                    
                        
                            
                                
                                    
                                
                                Checkpoint-and-Resume
                                
                                    Agents commit partial work to Git branches at logical breakpoints. On recovery, a new agent workspace picks up the branch and continues.
                                
                                
                                    Best for: Large refactoring tasks, multi-file migrations
                                
                            
                            
                                
                                    
                                
                                Idempotent Replay
                                
                                    Design agent tasks so they can be safely re-executed from the beginning. The agent checks what is already done and skips completed steps.
                                
                                
                                    Best for: Test generation, linting, documentation tasks
                                
                            
                            
                                
                                    
                                
                                Queue-Based Failover
                                
                                    Agent tasks sit in a durable queue (backed by a replicated database). DR region workers pick up pending tasks automatically on failover.
                                
                                
                                    Best for: CI/CD agent tasks, scheduled background jobs
                                
                            
                        
                    
                

                
                
                    
                        CDE Platform Considerations for AI Agent DR
                    
                    
                        
                            Leading CDE platforms handle AI agent workspaces differently. Understand how your platform of choice supports agent-specific DR requirements.
                        
                        
                            
                                Coder
                                
                                    Supports headless agent workspaces via Terraform templates. Agent tasks can be managed through the Coder API. Use workspace prebuilds to speed up agent recovery in DR.
                                
                            
                            
                                Ona (formerly Gitpod)
                                
                                    Ona workspaces can be pre-configured for AI agents using automation APIs. Ensure your DR plan accounts for Ona workspace timeouts and auto-stop policies that may affect long-running agent tasks.
                                
                            
                            
                                GitHub Codespaces
                                
                                    Codespaces integrates tightly with GitHub Actions for agent-driven workflows. DR depends on GitHub's own availability - consider a self-hosted CDE as a backup plane.
                                
                            
                            
                                DevPod
                                
                                    DevPod's provider-agnostic model makes cross-cloud DR straightforward. Agent workspaces can failover between AWS, Azure, or GCP providers with the same devcontainer config.
                                
                            
                        
                        
                            
                                Best Practice for 2026
                            
                            
                                Treat AI agent workspaces as infrastructure, not developer tools. They should have their own resource quotas, monitoring dashboards, and DR runbooks separate from interactive developer workspaces. When the platform recovers, prioritize spinning up developer workspaces first, then resume agent tasks from their checkpoint or queue.
                            
                        
                    
                
            
        

        
        
            
                
                    
                        
                        Frequently Asked Questions
                    
                

                
                    
                        
                            What happens to running workspaces during a regional failover?
                        
                        
                            Running workspaces in the failed region are lost. Developers must recreate workspaces in the DR region. This is why RPO matters - uncommitted work since the last backup is lost. Encourage developers to commit and push frequently. In active-active architectures, workspaces in healthy regions continue running unaffected.
                        
                    

                    
                        
                            How much does multi-region DR cost?
                        
                        
                            Costs vary by pattern. Active-passive adds 10-20% (database replication, snapshots, minimal compute). Warm standby adds 30-50% (partial DR infrastructure). Active-active doubles infrastructure costs but provides best availability. Calculate ROI against cost of downtime. For 200 developers at $100/hour, even 1 hour of downtime costs $20,000.
                        
                    

                    
                        
                            Should we back up every workspace or just critical ones?
                        
                        
                            Implement tiered backup strategies. Critical workspaces (production databases, shared services): 4-hour RPO with regular snapshots. Standard development: 24-hour RPO with daily snapshots. Ephemeral workspaces (tutorial environments, testing): no backups - recreate from templates. AI agent workspaces fall into their own tier - back up the task queue and checkpoint state, but the workspace itself can be ephemeral.
                        
                    

                    
                        
                            What if developers are working when we need to test DR failover?
                        
                        
                            Schedule annual full DR drills during a maintenance window with 2+ weeks notice. For quarterly drills, use a dedicated test CDE environment or test during off-hours (weekends, late evening). For continuous testing, use chaos engineering on non-production environments. Never surprise developers with production failover testing during business hours.
                        
                    

                    
                        
                            How do AI coding agents affect our DR strategy?
                        
                        
                            AI agents like Claude Code and GitHub Copilot run long-lived tasks in CDE workspaces that cannot simply resume after a restart. Design agent tasks to be idempotent or checkpoint progress to Git branches. Use a persistent task queue so pending work survives a failover. Separate agent workspaces into their own Kubernetes namespace with dedicated resource quotas, and prioritize developer workspace recovery over agent task resumption.
                        
                    

                    
                        
                            Should AI agent workspaces run in the DR region during normal operations?
                        
                        
                            In an active-active architecture, yes - distributing agent workloads across regions provides natural resilience and reduces blast radius. In active-passive setups, keep agents in the primary region but ensure the task queue is replicated to DR. On failover, the DR region can spin up agent workspaces and drain the queue. This is simpler than live-migrating in-progress agent tasks.
                        
                    
                
            
        

        
        
            
                
                    Continue Learning
                    Related resources for building resilient CDE infrastructure
                
                
                    
                        
                            
                        
                        High Availability
                        Multi-region deployment and HA architecture patterns
                    
                    
                        
                            
                        
                        Operational Runbooks
                        Step-by-step incident response procedures
                    
                    
                        
                            
                        
                        CDE Architecture
                        Reference architectures for AWS, Azure, and GCP
                    
                    
                        
                            
                        
                        Risk Management
                        Risk assessment and mitigation strategies