Risk Management & Rollback Strategies
Comprehensive risk assessment, mitigation strategies, and rollback procedures for CDE migrations. Plan for every scenario from vendor discontinuation to migration failures.
CDE Risk Assessment Matrix
Identify, assess, and prioritize risks before implementation
Technical Risks
- HIGH Control plane single point of failure
- HIGH Network latency affecting developer experience
- MED Storage performance degradation
- MED IDE plugin compatibility issues
- LOW Template drift across environments
Organizational Risks
- HIGH Developer resistance to workflow change
- HIGH Insufficient platform engineering resources
- MED Knowledge concentration in few individuals
- MED Lack of executive sponsorship
- LOW Shadow IT local development
Vendor Risks
- HIGH Vendor acquisition or product discontinuation
- MED Significant pricing changes
- MED Feature deprecation without alternatives
- MED Support quality degradation
- LOW API breaking changes
AI Agent & LLM Risks
- HIGH AI agent sandbox escape or privilege escalation
- HIGH Uncontrolled LLM token costs from autonomous agents
- HIGH Sensitive code or secrets leaked via LLM context windows
- MED AI-generated code introducing vulnerabilities at scale
- MED Prompt injection via malicious repository content
Compliance & Data Risks
- HIGH Source code sent to external LLM APIs without approval
- HIGH Regulatory violations from AI training on customer data
- MED Audit trail gaps for AI agent actions in workspaces
- MED IP ownership disputes over AI-generated code
- LOW License contamination from AI-suggested dependencies
Risk Scoring Framework
| Risk Factor | Probability (1-5) | Impact (1-5) | Score | Mitigation Priority |
|---|---|---|---|---|
| Control plane outage | 3 | 5 | 15 | Critical - Immediate |
| Developer productivity loss | 4 | 4 | 16 | Critical - Immediate |
| Vendor discontinuation | 2 | 5 | 10 | High - Plan within 30 days |
| Cost overrun | 3 | 3 | 9 | High - Plan within 30 days |
| Security breach | 2 | 5 | 10 | High - Plan within 30 days |
| AI agent sandbox escape | 3 | 5 | 15 | Critical - Immediate |
| LLM data exfiltration | 3 | 4 | 12 | Critical - Immediate |
| Uncontrolled AI token spend | 4 | 3 | 12 | Critical - Immediate |
| AI-generated vulnerability at scale | 3 | 4 | 12 | Critical - Immediate |
Migration Failure Scenarios & Mitigation
Prepare for common migration failures with proven mitigation strategies
Scenario: Control Plane Becomes Unresponsive During Peak Hours
Impact
- All developers unable to access workspaces
- Active work sessions terminated
- Potential data loss in unsaved work
Mitigation
- Deploy HA control plane (3+ replicas)
- Enable workspace persistence during outages
- Configure auto-save intervals (every 30s)
Rollback Trigger
- Outage > 4 hours in production
- 3+ incidents in 7 days
- Developer productivity < 50%
Scenario: Network Latency Makes Development Unusable
Impact
- Keystroke delays >200ms
- IDE features timeout or fail
- Developer frustration and workarounds
Mitigation
- Deploy in multiple regions
- Use WireGuard/Tailscale for optimization
- Enable local file sync with Mutagen
Rollback Trigger
- P95 latency > 150ms sustained
- Developer survey score < 3/5
- Local development requests > 20%
Scenario: Vendor Announces Product Discontinuation
Impact
- 12-18 month migration timeline
- Template/automation rewrite required
- Training and process changes
Mitigation
- Use Terraform for infrastructure portability
- DevContainers for portable configs
- Maintain alternative vendor relationship
Rollback Trigger
- Sunset notice with < 18 months
- Acquisition by competitor
- Key feature removal announcement
Scenario: AI Agent Escapes Sandbox or Exfiltrates Data
Impact
- Source code or secrets sent to external LLM APIs
- Autonomous agent modifies production infrastructure
- Compliance violation if PII enters model context
Mitigation
- Run agents in Firecracker microVM sandboxes
- Enforce network egress allowlists per workspace
- Require human-in-the-loop for destructive operations
Rollback Trigger
- Any confirmed data exfiltration event
- Agent action outside approved scope
- Sandbox breakout detected in monitoring
Scenario: LLM Token Costs Spike Beyond Budget
Impact
- Autonomous agents running overnight burn tokens
- Monthly AI spend exceeds entire CDE budget
- No per-team or per-project cost attribution
Mitigation
- Set per-user and per-team token budgets with hard caps
- Auto-terminate agent sessions exceeding time limits
- Deploy LLM gateway proxy for cost tracking and limits
Rollback Trigger
- Token spend > 150% of monthly budget
- Single agent session > $500 with no output
- No ROI improvement after 90-day evaluation
Rollback Procedures
Step-by-step procedures for different rollback scenarios
Rollback Decision Timeline
Immediate Response
Investigate issue, engage platform team, communicate status to affected developers
Escalation
Engage vendor support (if applicable), prepare partial rollback, enable local development fallback
Partial Rollback Decision
Enable hybrid mode - critical teams return to local, non-critical stay on CDE
Full Rollback
Execute full rollback procedure, transition all developers to local development
Full Rollback Procedure
Export All Workspace Data
# Export all user workspace files
for workspace in $(coder workspaces list --all -o json | jq -r '.[].name'); do
coder ssh $workspace "tar -czf /tmp/workspace-backup.tar.gz ~/projects"
coder scp $workspace:/tmp/workspace-backup.tar.gz ./backups/$workspace.tar.gz
done
# Export configuration and templates
coder templates export --all -o ./backups/templates/
kubectl get configmap -n coder -o yaml > ./backups/k8s-configs.yaml
Notify All Stakeholders
# Send notification via Slack/Teams/Email
Subject: [ACTION REQUIRED] CDE Rollback in Progress
Dear Developers,
Due to [REASON], we are initiating a rollback to local development.
Timeline:
- [TIME]: Begin workspace data export
- [TIME+2h]: Disable new workspace creation
- [TIME+4h]: All workspaces terminated
- [TIME+6h]: Local dev environment required
Action Required:
1. Save all current work immediately
2. Pull local copies of your repositories
3. Set up local development environment per: [WIKI_LINK]
Support: #platform-engineering or page Platform On-Call
Restore Local Development
# Re-enable local development permissions
# (Adjust based on your security controls)
# Restore local admin rights (Windows)
Add-LocalGroupMember -Group "Administrators" -Member "DOMAIN\Developers"
# Re-enable Docker Desktop
Enable-WindowsOptionalFeature -FeatureName Containers -Online
# Distribute local development scripts
git clone https://github.com/company/local-dev-setup
cd local-dev-setup && ./setup.sh
Post-Rollback Verification
# Verify developer environment status
# Send survey to all affected developers
curl -X POST "https://forms.company.com/api/submit" \
-d "survey_id=rollback-verification" \
-d "questions=local_env_working,data_restored,blockers"
# Schedule retrospective
# Document lessons learned
# Update risk assessment based on actual experience
AI Agent Risk Management in CDEs
In 2026, AI coding agents run autonomously inside CDE workspaces. New risks require new controls.
Why AI Agent Risks Are Different
Unlike traditional CDE risks where humans are in the loop, AI coding agents (Claude Code, Copilot Agent, Cursor, Devin, Windsurf) operate autonomously inside workspaces. They can read files, execute commands, make network requests, and modify code without human approval on every action. A misconfigured agent in a CDE has the same blast radius as a compromised developer account - but it acts faster and at scale.
LLM Data Flow Risks
Every AI agent interaction sends workspace context to an LLM provider. Understand what leaves your CDE.
-
File contents in context windows
Agents read source files and send them as LLM prompt context. Secrets in .env files, hardcoded credentials, and proprietary algorithms can all be transmitted.
-
Terminal output in agent loops
Agents execute commands and feed output back to the LLM. Database connection strings, API responses with customer data, and error messages with internal URLs can all leak.
-
Model provider data retention
Not all LLM providers offer zero-retention agreements. Verify whether your provider trains on inputs or retains conversation logs.
Agent Autonomy Controls
Define boundaries for what AI agents can and cannot do inside CDE workspaces.
-
Filesystem scope limits
Restrict agents to project directories only. Block access to ~/.ssh, ~/.aws, /etc, and other sensitive paths via workspace policy.
-
Network egress allowlists
Only allow agent workspaces to reach approved LLM API endpoints, package registries, and internal services. Block all other outbound traffic.
-
Session time and token budgets
Set hard limits on agent session duration (e.g., 4 hours max) and per-session token budgets to prevent runaway costs.
AI Agent Governance Framework for CDEs
| Control Area | Minimum Requirement | Recommended (2026) | Priority |
|---|---|---|---|
| Workspace isolation | Container per agent session | Firecracker microVM per agent session | P0 |
| Network egress | Allowlist for LLM API endpoints | Zero-trust network with per-request auth | P0 |
| Secret management | No secrets in workspace filesystem | Vault-injected, auto-rotating, agent-scoped tokens | P0 |
| Cost controls | Per-team monthly token budgets | Per-session limits with LLM gateway proxy | P1 |
| Audit logging | Log all agent commands and file changes | Full prompt/response logging with retention policy | P1 |
| Code review gates | Human review before merge | Automated SAST/DAST scan on all AI-generated code | P1 |
Vendor Exit Strategy
Ensure portability and reduce lock-in from day one
Portability Checklist
-
Use Terraform for all infrastructure
Avoid vendor-proprietary template formats
-
DevContainer specification for configs
Works across VS Code, Codespaces, Ona (formerly Gitpod), Coder
-
Standard container images
No vendor-specific base images or extensions
-
Document all vendor-specific features used
Maintain migration notes for each feature
-
Regular data export testing
Quarterly validation of export/restore procedures
-
Maintain alternative vendor evaluation
Annual review of market alternatives
-
AI agent configurations as code
Store agent rules, allowlists, and token budgets in version control - not vendor dashboards
-
LLM provider portability plan
Use LLM gateway proxies to abstract provider APIs and enable rapid LLM switching
Migration Paths
Coder to Ona
Terraform templates need rewrite to Ona configuration format, but DevContainers work as-is
Moderate EffortAny CDE to GitHub Codespaces
DevContainers fully compatible, but requires GitHub Enterprise
Low EffortCDE to Local Development
DevContainers run locally, security controls may need adjustment
High EffortSelf-Hosted to Managed
Offload operations but may lose some customization
Moderate EffortSwitch AI Agent / LLM Provider
LLM gateway proxy makes switching providers straightforward; agent rule files need adaptation
Low Effort (with proxy)Continue Your Planning
Related resources for comprehensive CDE implementation
Pilot Program Design
Structure your pilot for success with selection criteria and metrics
High Availability & DR
Multi-region deployment and disaster recovery strategies
Vendor Evaluation
Comprehensive scoring matrices and lock-in assessment
Agentic Engineering
AI coding agents, LLM integration, and autonomous development workflows in CDEs