Disaster Recovery for Cloud Development Environments
Comprehensive DR planning for CDEs: RTO/RPO targets, backup strategies, failover procedures, AI agent workspace recovery, and testing to ensure business continuity when disaster strikes.
Why DR Matters for Cloud Development Environments
When your CDE platform goes down, all developers and AI coding agents stop working simultaneously. Understanding the unique DR requirements for CDEs is critical.
Massive Blast Radius
Unlike local development where individual laptop failures affect one person, a CDE outage impacts every single developer in your organization simultaneously.
Impact Example
A 4-hour CDE outage at a company with 200 developers and 50 AI agent workspaces means 800+ lost developer hours plus halted autonomous coding tasks. At a loaded cost of $100/hour, that is $80,000 in direct productivity loss - plus reputation damage, missed deadlines, and customer impact.
Local Dev vs CDE DR
Local Development
- Failures are isolated to individuals
- Lost laptop affects 1 person
- Recovery: Buy new laptop, restore from backup
- Code in Git provides resilience
Cloud Development Environments
- Platform failure affects everyone
- Control plane down = 0 productivity for humans and AI agents
- Recovery requires infrastructure rebuild
- Workspace data needs backup strategy
CDE-Specific DR Considerations
Code in Git
Committed code is already resilient (stored in Git). Your primary concern is uncommitted work in workspace filesystems and the ability to recreate workspaces.
Control Plane is Critical
The control plane (user data, templates, configurations) is your single point of failure. Database backups and multi-region deployment are essential.
Developer Patience is Limited
Developers expect high availability. Outages longer than 30 minutes trigger fallback to local development and erode trust in the platform.
AI Agents Need Uptime Too
AI coding agents (Claude Code, GitHub Copilot, Cursor) increasingly run in dedicated CDE workspaces. When your platform is down, autonomous coding tasks, background refactors, and AI-assisted pipelines all stop.
RTO and RPO Targets for CDE Components
Define recovery objectives by component. Not everything needs the same level of protection.
RTO
Recovery Time Objective
Maximum acceptable downtime after a disaster. How long can developers be without their workspaces before business impact becomes severe?
Typical CDE RTO Targets:
- Mission Critical: 15-30 minutes
- Standard: 2-4 hours
- Best Effort: 24 hours
RPO
Recovery Point Objective
Maximum acceptable data loss measured in time. How much uncommitted work can developers afford to lose?
Typical CDE RPO Targets:
- Control Plane: 0-5 minutes
- Workspace Data: 1-4 hours
- Templates/Config: 0 (stored in Git)
| Component | What It Contains | Recommended RTO | Recommended RPO | Backup Strategy |
|---|---|---|---|---|
| Control Plane Database | User accounts, workspace metadata, templates, RBAC policies | 15 min | 5 min | Continuous replication, automated snapshots every 5 min |
| Workspace Storage (Persistent) | Uncommitted code, build artifacts, local configs, databases | 1 hour | 4 hours | Volume snapshots every 4 hours, cross-region replication |
| Templates & Infrastructure Code | Terraform templates, DevContainer configs, base images | 10 min | 0 | Stored in Git (GitHub, GitLab), multi-region container registry |
| Container Images | Base images, prebuilds, workspace containers | 30 min | 0 | Multi-region registry replication (ECR, ACR, Artifact Registry) |
| Kubernetes Cluster State | Running workspaces, pods, configurations | 30 min | N/A | Ephemeral - recreate from templates. Use Velero for etcd backups if needed |
| AI Agent Workspaces | Running AI coding agents, task queues, agent context, in-progress refactors | 30 min | 1 hour | Task queue checkpointing, agent state snapshots, idempotent task design for safe restart |
| Secrets & Credentials | API keys, database passwords, SSH keys | 15 min | 0 | Multi-region secrets manager (Vault, AWS Secrets Manager, etc.) |
DR Architecture Patterns
Choose the DR pattern that balances cost, complexity, and recovery objectives for your organization.
Active-Passive
Primary region with cold standby DR site
Active-Active
RecommendedTraffic served from multiple regions simultaneously
Pilot Light
Minimal DR infrastructure always running
Warm Standby
Scaled-down version running in DR region
Active-Active Multi-Region CDE Architecture
Global Traffic Manager (Route53 Geo-Routing)
/ \
(Latency-Based) (Latency-Based)
/ \
Region: US-East-1 Region: EU-West-1
----------------- -----------------
Load Balancer Load Balancer
| |
+-----------+-----------+ +-----------+-----------+
| | | | | |
Coder Coder Coder Coder Coder Coder
Pod 1 Pod 2 Pod 3 Pod 1 Pod 2 Pod 3
| | | | | |
+--------+--+-----------+ +--------+--+-----------+
| |
PostgreSQL (Primary) PostgreSQL (Replica)
RDS Multi-AZ Read Replica + Async Replication
| |
+---------- Bi-directional Replication ---------+
(5-10 second lag)
Developer Workspaces: Developer Workspaces:
- us-east-1a (AZ-1) - eu-west-1a (AZ-1)
- us-east-1b (AZ-2) - eu-west-1b (AZ-2)
- us-east-1c (AZ-3) - eu-west-1c (AZ-3)
Persistent Volumes: Persistent Volumes:
EBS snapshots -> S3 EBS snapshots -> S3
| |
+--- Cross-Region Replication (S3) ---+
(RPO: 15 minutes)
Backup Strategies for CDE Components
Comprehensive backup strategy covering control plane, workspace data, and configuration.
Control Plane Database Backups
The control plane database contains user accounts, workspace metadata, templates, and RBAC policies. This is your most critical backup target.
Continuous Replication
# AWS RDS - Enable Multi-AZ + Read Replica
aws rds create-db-instance-read-replica \
--db-instance-identifier coder-db-replica-eu \
--source-db-instance-identifier coder-db-us \
--source-region us-east-1 \
--region eu-west-1
# PostgreSQL Logical Replication
# On primary
CREATE PUBLICATION coder_pub FOR ALL TABLES;
# On replica
CREATE SUBSCRIPTION coder_sub
CONNECTION 'host=primary-db port=5432 dbname=coder'
PUBLICATION coder_pub;
Automated Snapshots
# Enable automated backups
aws rds modify-db-instance \
--db-instance-identifier coder-db \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00"
# Manual snapshot for testing
aws rds create-db-snapshot \
--db-instance-identifier coder-db \
--db-snapshot-identifier pre-upgrade-$(date +%Y%m%d)
# Copy snapshot to DR region
aws rds copy-db-snapshot \
--source-region us-east-1 \
--source-db-snapshot-identifier SNAPSHOT_ID \
--target-db-snapshot-identifier SNAPSHOT_ID \
--region eu-west-1
Best Practice
Test database restoration monthly. Restore to a test environment and verify workspace metadata integrity. Automated snapshots mean nothing if you have never tested restoration.
Workspace Persistent Volume Backups
Workspace persistent volumes contain uncommitted code, build artifacts, and local databases. Balance RPO targets against storage costs.
Volume Snapshots
# Kubernetes VolumeSnapshot (CSI)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: workspace-snapshot
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
persistentVolumeClaimName: workspace-pvc
# Automated snapshot via CronJob
kubectl create cronjob workspace-snapshots \
--image=bitnami/kubectl \
--schedule="0 */4 * * *" \
-- kubectl create volumesnapshot \
workspace-$(date +%Y%m%d-%H%M) \
--from-pvc=workspace-pvc
Selective Backup with Velero
# Install Velero with S3 backend
velero install \
--provider aws \
--bucket cde-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
# Create backup schedule
velero schedule create daily-workspaces \
--schedule="0 2 * * *" \
--include-namespaces workspaces \
--ttl 720h
# Backup critical workspace only
velero backup create important-workspace \
--include-resources pvc \
--selector workspace=prod-db-workspace
Cost Consideration
Backing up every workspace every 4 hours can be expensive. Consider tiered backup strategies: critical workspaces (4h RPO), standard (24h RPO), ephemeral (no backups - recreate from template).
Configuration and Template Backups
Templates, Terraform configurations, and DevContainer definitions should be version-controlled in Git. This provides zero RPO and easy restoration.
Git-Based Configuration
# Directory structure
cde-templates/
├── .git/
├── python-dev/
│ ├── main.tf
│ ├── devcontainer.json
│ └── README.md
├── java-spring/
│ ├── main.tf
│ └── docker-compose.yml
└── data-science/
├── main.tf
└── requirements.txt
# Push templates to multi-region repos
git remote add origin-us \
[email protected]:company/cde-templates.git
git remote add origin-eu \
[email protected]:platform/cde-templates.git
git push origin-us main
git push origin-eu main
Container Image Replication
# AWS ECR - Enable cross-region replication
aws ecr put-replication-configuration \
--replication-configuration '{
"rules": [{
"destinations": [{
"region": "eu-west-1",
"registryId": "123456789012"
}]
}]
}'
# Azure ACR - Geo-replication
az acr replication create \
--registry cderegistry \
--location westeurope
# Manual image copy
docker pull us.gcr.io/company/workspace:latest
docker tag us.gcr.io/company/workspace:latest \
eu.gcr.io/company/workspace:latest
docker push eu.gcr.io/company/workspace:latest
Failover Procedures
Automated and manual failover strategies to minimize downtime during regional outages.
Automated Failover
Health checks detect failures and automatically redirect traffic to healthy regions without manual intervention.
DNS-Based Health Checks
# AWS Route53 Health Check
aws route53 create-health-check \
--health-check-config \
IPAddress=203.0.113.1,\
Port=443,\
Type=HTTPS,\
ResourcePath=/api/v2/health,\
FullyQualifiedDomainName=cde.company.com,\
RequestInterval=30,\
FailureThreshold=3
# Associate with DNS record
# Route53 automatically fails over to
# secondary region when health check fails
Load Balancer Health Checks
# Kubernetes Liveness Probe
livenessProbe:
httpGet:
path: /api/v2/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
# Unhealthy pods removed from service
# automatically by kube-proxy
Manual Failover
For active-passive architectures or when automated failover is not configured, manual procedures are required.
Failover Script
#!/bin/bash
# DR Failover Script
set -euo pipefail
PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
echo "Step 1: Promote DR database to primary"
aws rds promote-read-replica \
--db-instance-identifier coder-db-dr \
--region $DR_REGION
echo "Step 2: Update DNS to DR region"
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch file://dr-dns.json
echo "Step 3: Scale DR control plane"
kubectl scale deployment coder \
--replicas=3 -n coder-system \
--context dr-cluster
echo "Failover complete"
echo "Verify: curl https://cde.company.com/health"
Manual Failover Checklist
- Confirm primary region is unrecoverable
- Notify all stakeholders
- Promote database replica
- Update DNS records
- Scale DR infrastructure
- Verify developer access
- Document incident timeline
DNS Failover Configuration
DNS-based failover uses health checks to automatically route traffic to healthy regions. This is the foundation of most DR strategies.
Route53 Failover Policy
{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "cde.company.com",
"Type": "A",
"SetIdentifier": "Primary-US-East",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z1234567890ABC",
"DNSName": "primary-lb-123.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "cde.company.com",
"Type": "A",
"SetIdentifier": "Secondary-EU-West",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z9876543210XYZ",
"DNSName": "dr-lb-456.eu-west-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}
Health Check Configuration
{
"Type": "HTTPS",
"ResourcePath": "/api/v2/health",
"FullyQualifiedDomainName": "cde.company.com",
"Port": 443,
"RequestInterval": 30,
"FailureThreshold": 3,
"MeasureLatency": true,
"EnableSNI": true,
"HealthThreshold": 3
}
# Health check evaluates:
# - HTTP 200 response
# - Response time < 2 seconds
# - 3 consecutive failures trigger failover
# - 3 consecutive successes restore primary
TTL Consideration: Set DNS TTL to 60 seconds or less. Longer TTLs delay failover as clients cache old records.
Testing Disaster Recovery Plans
Untested DR plans fail when you need them most. Regular testing validates procedures and builds muscle memory for your team.
Weekly
- Verify backups completed
- Test health check endpoints
- Review monitoring dashboards
- Validate replication lag
Monthly
- Restore database to test environment
- Walkthrough runbooks with team
- Test workspace volume restore
- Verify on-call escalation
Quarterly
- Partial failover drill (non-prod)
- Chaos engineering tests
- Test AI agent task recovery
- Update DR documentation
- Review RTO/RPO targets
Annually
- Full production failover drill
- Multi-day DR simulation
- Third-party DR audit
- Executive tabletop exercise
Tabletop Exercises
Walk through DR scenarios without actually executing failover. Identify gaps in procedures and unclear responsibilities.
Sample Scenarios
Scenario 1: Regional Outage
"AWS us-east-1 has a complete outage affecting all services. Your CDE control plane is unreachable. Walk through the failover process."
Scenario 2: Database Corruption
"Your control plane database has been corrupted. Last known good backup is 6 hours old. How do you recover?"
Scenario 3: Ransomware
"Ransomware has encrypted all workspaces and control plane. Attackers demand payment. What is your response plan?"
Scenario 4: AI Agent Workspace Outage
"Your primary region is down. 30 AI coding agents were mid-task - refactoring, generating tests, and running migrations. How do you recover agent progress and restart tasks in the DR region?"
Exercise Format
Present Scenario (5 min)
Facilitator describes the disaster situation
Initial Response (15 min)
Team discusses who does what, referencing runbooks
Walk Through Steps (30 min)
Step-by-step discussion of DR procedures
Identify Gaps (10 min)
Document unclear procedures, missing tools, training needs
Action Items (10 min)
Assign improvements to team members with deadlines
Chaos Engineering for Disaster Recovery
Intentionally inject failures in non-production environments to validate DR automation and team response.
Control Plane Failure
# Kill control plane pods
kubectl delete pod -l app=coder \
-n coder-system
# Verify:
# - Kubernetes restarts pods
# - Health checks detect failure
# - Alerting triggers
# - Workspaces unaffected
Database Failure Simulation
# Simulate DB connection loss
kubectl scale deployment coder \
--replicas=0 -n coder-system
# Or use network policies
kubectl apply -f - <
Regional Failure
# Update Route53 to fail primary
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "cde-test.company.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"HealthCheckId": "FORCE_FAIL",
...
}
}]
}'
# Verify automatic failover to DR
Safety First
Always run chaos experiments in non-production environments first. Schedule during low-traffic periods. Have rollback procedures ready. Notify stakeholders before testing.
Annual Game Day: Full DR Drill
Once per year, execute a complete failover to your DR region during a scheduled maintenance window. This is the ultimate test of your DR preparedness.
Planning Phase (T-4 weeks)
- Schedule 4-hour maintenance window
- Notify all developers 2 weeks in advance
- Create detailed runbook with timings
- Assign roles (commander, scribe, ops, comms)
- Set up Zoom war room for coordination
- Verify on-call contacts
Execution Phase (Game Day)
- T-0:00 - Declare maintenance start, post status page update
- T+0:10 - Failover to DR region (manual or automated)
- T+0:30 - Verify control plane health, test workspace creation
- T+1:00 - All developers validate workspace access
- T+1:30 - Verify AI agent workspaces resume queued tasks
- T+2:00 - Developers work in DR region for 1 hour
- T+3:00 - Failback to primary region
- T+3:30 - Verify primary, all-clear announcement
- T+4:00 - Post-mortem meeting
Success Criteria
Failover RTO
Control plane available in DR region within 30 minutes
Developer Access
95%+ developers can create workspaces in DR region
Data Integrity
Zero data loss, all user accounts and templates intact
Recovery Runbooks
Step-by-step procedures for common disaster scenarios. Keep these updated and accessible offline.
Complete Regional Outage
SEV-1 | RTO: 30 min- Confirm primary region unavailable (AWS status, health checks)
- Page incident commander and DR team
- Execute failover script or manual DNS update
- Promote DR database to primary
- Scale DR control plane to full capacity
- Post status page update
- Verify developer access, monitor error rates
Database Corruption
SEV-1 | RTO: 45 min- Stop all writes to database immediately
- Assess corruption scope (full DB or specific tables)
- Identify most recent clean snapshot
- Restore database from snapshot to new instance
- Verify data integrity with test queries
- Update control plane connection string
- Communicate data loss window to users
Workspace Volume Loss
SEV-3 | RTO: 2 hours- Identify affected workspace and user
- Locate most recent volume snapshot
- Create new PV from snapshot
- Recreate workspace pointing to restored volume
- Notify user of data loss window
- Document incident for postmortem
AI Agent Task Interruption
SEV-3 | RTO: 1 hour- Identify affected agent workspaces and in-flight tasks
- Check task queue for pending and in-progress items
- Review Git branches for checkpointed progress
- Spin up agent workspaces in DR region
- Verify LLM API keys and tokens are valid in DR
- Re-queue interrupted tasks, mark as retry
Ransomware Attack
SEV-1 | RTO: 4 hours- Isolate affected systems immediately (network isolation)
- Engage security team and legal counsel
- Do NOT pay ransom - restore from backups
- Verify backup integrity (pre-infection)
- Rebuild infrastructure from clean images
- Restore data from known-clean backups
- Conduct forensic analysis of attack vector
Offline Runbook Access
During a disaster, your documentation platform may be unavailable. Keep offline copies of runbooks:
Print to PDF quarterly, store in S3 + local drives
On-call engineers save runbooks to phone notes app
Laminated runbooks in incident response binder
Cloud Provider DR Considerations
Each cloud provider offers different DR capabilities. Choose the right services for your architecture.
AWS
Multi-AZ vs Multi-Region
Multi-AZ: RDS, EKS across availability zones. Protects against datacenter failures. RTO: minutes.
Multi-Region: Route53 failover, cross-region RDS replicas, S3 replication. Protects against regional failures. RTO: 15-60 min.
Key Services
- Route53 health checks + failover
- RDS automated backups + read replicas
- S3 cross-region replication
- EBS snapshot copy to DR region
- AWS Backup for centralized management
Azure
Availability Zones vs Geo-Replication
Availability Zones: AKS zone redundancy, zone-redundant storage. Protects against datacenter failures.
Geo-Replication: Traffic Manager, Azure SQL geo-replication, storage account replication. Regional protection.
Key Services
- Traffic Manager for DNS failover
- Azure SQL active geo-replication
- Storage account GRS/GZRS replication
- Azure Site Recovery for VM replication
- Backup vaults with cross-region restore
GCP
Zonal vs Regional vs Multi-Regional
Regional: GKE regional clusters, Cloud SQL HA across zones. Zone failure protection.
Multi-Regional: Cloud SQL cross-region replicas, multi-region storage. Regional failure protection.
Key Services
- Cloud DNS with health checks
- Cloud SQL cross-region read replicas
- GCS multi-region or dual-region buckets
- Persistent disk snapshots to any region
- Regional GKE clusters (3+ zones)
Cross-Cloud DR Strategy
For ultimate resilience, some organizations deploy across multiple cloud providers. This eliminates vendor lock-in and protects against cloud-wide outages.
Benefits
- Zero vendor lock-in
- Protection from cloud-wide outages
- Negotiating leverage on pricing
Challenges
- Significantly higher operational complexity
- Data synchronization across clouds
- Requires cloud-agnostic tooling (Terraform, Kubernetes, DevPod)
Disaster Recovery for AI Agent Workspaces
AI coding agents like Claude Code, GitHub Copilot, and Cursor now run in dedicated CDE workspaces. Their DR requirements differ from interactive developer workspaces in important ways.
Why AI Agent DR is Different
AI agents run long-lived, autonomous tasks - multi-file refactors, test generation, code reviews, and background migrations. Unlike human developers who can context-switch during an outage, an interrupted agent loses its in-flight work and reasoning context.
Long-running tasks: An agent refactoring 50 files across a monorepo may run for 30+ minutes. A mid-task failure wastes all progress unless checkpointed.
Stateful context: Agents build up conversation and codebase context. Losing this mid-task means restarting from scratch, not resuming.
No human fallback: A developer can switch to a local machine. An AI agent workspace has no manual fallback - the platform is the only option.
DR Checklist for AI Agents
Design tasks to be idempotent
Agents should be able to restart a task safely without duplicating work or corrupting state
Checkpoint agent progress to Git
Configure agents to commit work-in-progress to feature branches at regular intervals
Use a task queue with persistence
Queue pending agent tasks in a durable store (Redis with AOF, PostgreSQL) so nothing is lost on restart
Separate agent workspaces from developer workspaces
Run AI agents in their own namespace so a noisy agent does not impact human developers during recovery
Monitor API key and token validity
AI agents depend on external API keys (LLM providers, Git). Ensure DR secrets are current and not expired
Include agents in DR drills
Verify AI agent workspaces spin up in the DR region, connect to LLM providers, and resume queued tasks
AI Agent DR Architecture Patterns
Checkpoint-and-Resume
Agents commit partial work to Git branches at logical breakpoints. On recovery, a new agent workspace picks up the branch and continues.
Idempotent Replay
Design agent tasks so they can be safely re-executed from the beginning. The agent checks what is already done and skips completed steps.
Queue-Based Failover
Agent tasks sit in a durable queue (backed by a replicated database). DR region workers pick up pending tasks automatically on failover.
CDE Platform Considerations for AI Agent DR
Leading CDE platforms handle AI agent workspaces differently. Understand how your platform of choice supports agent-specific DR requirements.
Coder
Supports headless agent workspaces via Terraform templates. Agent tasks can be managed through the Coder API. Use workspace prebuilds to speed up agent recovery in DR.
Ona (formerly Gitpod)
Ona workspaces can be pre-configured for AI agents using automation APIs. Ensure your DR plan accounts for Ona workspace timeouts and auto-stop policies that may affect long-running agent tasks.
GitHub Codespaces
Codespaces integrates tightly with GitHub Actions for agent-driven workflows. DR depends on GitHub's own availability - consider a self-hosted CDE as a backup plane.
DevPod
DevPod's provider-agnostic model makes cross-cloud DR straightforward. Agent workspaces can failover between AWS, Azure, or GCP providers with the same devcontainer config.
Best Practice for 2026
Treat AI agent workspaces as infrastructure, not developer tools. They should have their own resource quotas, monitoring dashboards, and DR runbooks separate from interactive developer workspaces. When the platform recovers, prioritize spinning up developer workspaces first, then resume agent tasks from their checkpoint or queue.
Frequently Asked Questions
What happens to running workspaces during a regional failover?
Running workspaces in the failed region are lost. Developers must recreate workspaces in the DR region. This is why RPO matters - uncommitted work since the last backup is lost. Encourage developers to commit and push frequently. In active-active architectures, workspaces in healthy regions continue running unaffected.
How much does multi-region DR cost?
Costs vary by pattern. Active-passive adds 10-20% (database replication, snapshots, minimal compute). Warm standby adds 30-50% (partial DR infrastructure). Active-active doubles infrastructure costs but provides best availability. Calculate ROI against cost of downtime. For 200 developers at $100/hour, even 1 hour of downtime costs $20,000.
Should we back up every workspace or just critical ones?
Implement tiered backup strategies. Critical workspaces (production databases, shared services): 4-hour RPO with regular snapshots. Standard development: 24-hour RPO with daily snapshots. Ephemeral workspaces (tutorial environments, testing): no backups - recreate from templates. AI agent workspaces fall into their own tier - back up the task queue and checkpoint state, but the workspace itself can be ephemeral.
What if developers are working when we need to test DR failover?
Schedule annual full DR drills during a maintenance window with 2+ weeks notice. For quarterly drills, use a dedicated test CDE environment or test during off-hours (weekends, late evening). For continuous testing, use chaos engineering on non-production environments. Never surprise developers with production failover testing during business hours.
How do AI coding agents affect our DR strategy?
AI agents like Claude Code and GitHub Copilot run long-lived tasks in CDE workspaces that cannot simply resume after a restart. Design agent tasks to be idempotent or checkpoint progress to Git branches. Use a persistent task queue so pending work survives a failover. Separate agent workspaces into their own Kubernetes namespace with dedicated resource quotas, and prioritize developer workspace recovery over agent task resumption.
Should AI agent workspaces run in the DR region during normal operations?
In an active-active architecture, yes - distributing agent workloads across regions provides natural resilience and reduces blast radius. In active-passive setups, keep agents in the primary region but ensure the task queue is replicated to DR. On failover, the DR region can spin up agent workspaces and drain the queue. This is simpler than live-migrating in-progress agent tasks.
Continue Learning
Related resources for building resilient CDE infrastructure