What should be in a CDE incident response runbook?

Include: incident classification criteria, escalation paths, diagnostic commands, common failure scenarios with remediation steps, communication templates, and post-incident review process. Cover both platform-wide and individual workspace issues.

How often should CDE platforms be updated?

Follow vendor release cadence - typically monthly for minor updates, quarterly for major versions. Establish maintenance windows, test updates in staging first, and have rollback procedures ready. Security patches may require expedited deployment.

What daily operations does a CDE platform require?

Daily tasks include: monitoring dashboard review, checking for stuck workspaces, reviewing resource utilization, responding to user issues, and verifying backup completion. Most can be automated with alerts for anomalies.

CDE Operational Runbooks

Step-by-step procedures for common operations, incident response, and maintenance tasks in your Cloud Development Environment.

Daily Ops Incident Response Maintenance Disaster Recovery AI Agent Incidents Scaling

Daily Operations

Routine procedures for day-to-day CDE management

Morning Health Check Runbook

Est. Time: 10-15 minutes

Check Control Plane Status

# Coder
coder server health

# Kubernetes-based
kubectl get pods -n coder-system
kubectl get pods -n ona

# Check service endpoints
curl -s https://cde.company.com/api/v2/health | jq

Review Overnight Alerts

# Check alertmanager for firing alerts
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'

# Review PagerDuty/Opsgenie incidents
pd incident:list --status=triggered,acknowledged

# Check Slack #cde-alerts channel
# Review email inbox for overnight alerts

Verify Workspace Metrics

# Check active workspace count
curl -s https://cde.company.com/api/v2/workspaces | jq '[.[] | select(.latest_build.status=="running")] | length'

# Review resource utilization
kubectl top nodes
kubectl top pods -n workspaces --sort-by=memory

# Check for stuck workspaces
coder workspaces list --status=starting | grep -E "^[0-9]+ hours?"

Check Certificate Expiration

# Check TLS certificate expiry
echo | openssl s_client -servername cde.company.com -connect cde.company.com:443 2>/dev/null | openssl x509 -noout -dates

# Check all ingress certificates
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter

Document Status in Daily Log

Update the team's daily operations log with:

All systems operational / Any issues found
Number of active workspaces
Resource utilization summary
Any pending maintenance or upgrades

New User Onboarding Runbook

Est. Time: 5-10 minutes

Verify SSO/IdP Provisioning

# Confirm user exists in IdP group
# Azure AD
az ad group member check --group "CDE-Users" --member-id USER_OBJECT_ID

# Okta - Check group membership via admin console
# Or verify SCIM provisioning logs

Assign Appropriate Role/Template Access

# Coder - Assign to organization and template
coder users create --email [email protected] --username jsmith
coder organizations members add engineering jsmith --role member

# Grant template access
coder templates edit python-dev --group engineering

Send Welcome Resources

Provide the new user with:

CDE access URL: https://cde.company.com
Getting started documentation
Onboarding video/walkthrough
Support Slack channel: #cde-help

Verify First Workspace Creation

# Monitor first workspace creation
coder workspaces list --owner jsmith

# Check for any provisioning errors
coder workspaces show jsmith/my-workspace --json | jq '.latest_build.job.error'

Incident Response Runbooks

Emergency procedures for common CDE incidents

Control Plane Unavailable

SEV-1

Symptoms

CDE dashboard inaccessible
New workspaces cannot be created
Existing workspaces may still function but cannot be managed

Page On-Call and Create Incident

# PagerDuty
pd incident:create --title "CDE Control Plane Down" --service CDE-CRITICAL --urgency high

# Slack notification
/incident new "CDE Control Plane Unavailable" severity:sev1

Check Pod/Service Status

# Get pod status
kubectl get pods -n coder-system -o wide
kubectl get pods -n coder-system | grep -v Running

# Check recent events
kubectl get events -n coder-system --sort-by='.lastTimestamp' | tail -20

# Check service endpoints
kubectl get endpoints -n coder-system

Check Database Connectivity

# Test PostgreSQL connection
kubectl exec -it deploy/coder -n coder-system -- pg_isready -h $CODER_PG_CONNECTION_URL

# Check database pod (if self-hosted)
kubectl get pods -n database -l app=postgresql

# Check RDS status (if AWS)
aws rds describe-db-instances --db-instance-identifier coder-db --query 'DBInstances[0].DBInstanceStatus'

Attempt Pod Restart

# Rolling restart of control plane
kubectl rollout restart deployment/coder -n coder-system

# Monitor rollout
kubectl rollout status deployment/coder -n coder-system --timeout=300s

# If stuck, force delete problematic pods
kubectl delete pod POD_NAME -n coder-system --grace-period=0 --force

Verify Recovery and Update Status

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test workspace creation
coder create test-recovery --template minimal --yes
coder delete test-recovery --yes

# Update incident status
/incident update "Control plane recovered. Monitoring for stability."

Workspace Stuck in Starting/Stopping

SEV-3

Identify the Stuck Workspace

# List workspaces by status
coder workspaces list --status=starting
coder workspaces list --status=stopping

# Get detailed info
coder workspaces show USERNAME/WORKSPACE_NAME --json | jq '.latest_build'

Check Provisioner Logs

# Get build logs
coder workspaces logs USERNAME/WORKSPACE_NAME

# Check provisioner pods
kubectl logs -n coder-system -l app=coder-provisioner --tail=100

# Look for Terraform errors
kubectl logs -n coder-system -l app=coder-provisioner | grep -i error

Force Cancel and Retry

# Cancel the stuck build
coder workspaces cancel USERNAME/WORKSPACE_NAME

# For stuck in stopping - force stop the underlying resources
# Kubernetes
kubectl delete pod -n workspaces -l workspace=WORKSPACE_ID --grace-period=0 --force

# Retry the operation
coder workspaces start USERNAME/WORKSPACE_NAME
# or
coder workspaces stop USERNAME/WORKSPACE_NAME

Storage Capacity Critical

SEV-2

Identify Storage Usage

# Check PVC usage
kubectl get pvc -n workspaces -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.resources.requests.storage,USED:.status.capacity.storage

# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Find largest workspace volumes
kubectl exec -it STORAGE_POD -- df -h
kubectl exec -it STORAGE_POD -- du -sh /data/* | sort -rh | head -20

Clean Up Orphaned Resources

# Find orphaned PVCs
kubectl get pvc -n workspaces --no-headers | while read pvc rest; do
  if ! coder workspaces list --all | grep -q "$pvc"; then
    echo "Orphaned: $pvc"
  fi
done

# Clean docker images on nodes
kubectl get nodes -o name | xargs -I {} kubectl debug {} -- docker system prune -af

# Clean old workspace builds
coder workspaces list --all | awk '$5 > 30 {print $1}' | xargs -I {} coder delete {} --yes

Expand Storage (if needed)

# Expand PVC (if storage class supports it)
kubectl patch pvc WORKSPACE_PVC -n workspaces -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

# Add new storage nodes
# AWS EKS
eksctl scale nodegroup --cluster=cde-cluster --name=storage-nodes --nodes=5

# Update storage quotas in templates
coder templates push TEMPLATE --var disk_size=100

AI Agent Incident Runbooks

Emergency procedures for AI coding agent failures, runaway processes, and sandbox breaches

Runaway AI Agent - Resource Exhaustion

SEV-1

Symptoms

AI agent consuming 100% CPU or exhausting memory in workspace
Uncontrolled subprocess spawning (fork bombs, recursive builds)
Disk usage spiking from generated files, logs, or downloaded artifacts
LLM API token spend exceeding budget thresholds

Identify the Runaway Workspace

# Find workspaces with abnormal resource usage
kubectl top pods -n workspaces --sort-by=cpu | head -10
kubectl top pods -n workspaces --sort-by=memory | head -10

# Check for high process counts inside workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- ps aux --sort=-%cpu | head -20

# Review agent session logs
kubectl logs WORKSPACE_POD -n workspaces --tail=200 | grep -i "agent\|llm\|copilot"

Kill Runaway Processes and Contain

# Kill the agent process tree inside the workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- pkill -9 -f "claude\|copilot\|aider\|agent"

# If subprocess spawning is out of control, apply cgroup limits
kubectl exec -it WORKSPACE_POD -n workspaces -- bash -c 'echo 50 > /sys/fs/cgroup/pids/pids.max'

# If workspace is unresponsive, force-stop it
coder workspaces stop USERNAME/WORKSPACE_NAME
kubectl delete pod WORKSPACE_POD -n workspaces --grace-period=10 --force

Check LLM API Token Spend

# Check proxy/gateway logs for token consumption
kubectl logs -n ai-gateway -l app=litellm --tail=100 | grep -i "tokens\|cost\|budget"

# If using a cost proxy, check budget status
curl -s http://litellm-proxy:4000/budget/info | jq

# Revoke API key if spend is out of control
curl -X POST http://litellm-proxy:4000/key/delete \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"keys": ["sk-compromised-key-id"]}'

Post-Incident Hardening

After containment, apply preventive controls:

Set CPU/memory resource limits on agent workspace templates
Configure per-user LLM API token budgets with hard caps
Enable max-process-count (pids.max) cgroup limits in templates
Add agent session timeout policies (e.g., 30-minute idle kill)
Set up alerting on sustained high CPU/memory in agent workspaces

AI Agent Sandbox Escape or Unauthorized Access

SEV-1

Symptoms

Agent accessing files, repos, or services outside its allowed scope
Unexpected network traffic from agent workspace (egress to unknown IPs)
Agent modifying system-level config, installing packages, or escalating privileges
Secrets or credentials appearing in agent output or logs

Isolate and Freeze the Workspace Immediately

# Apply network policy to block all egress immediately
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-isolate-agent
  namespace: workspaces
spec:
  podSelector:
    matchLabels:
      workspace: WORKSPACE_ID
  policyTypes:
    - Egress
  egress: []
EOF

# Freeze the workspace (do NOT delete - preserve for forensics)
coder workspaces stop USERNAME/WORKSPACE_NAME

# Revoke the user's agent API keys
curl -X POST http://litellm-proxy:4000/key/delete \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"keys": ["sk-affected-key"]}'

Collect Forensic Evidence

# Snapshot the workspace volume before cleanup
kubectl get pvc -n workspaces -l workspace=WORKSPACE_ID -o json > /tmp/pvc-snapshot.json

# Export agent session logs
kubectl logs WORKSPACE_POD -n workspaces --all-containers > /tmp/agent-forensics.log

# Check network flow logs for unauthorized connections
kubectl logs -n kube-system -l app=calico-node --tail=500 | grep WORKSPACE_POD_IP

# Review audit logs for privilege escalation
kubectl logs -n workspaces WORKSPACE_POD | grep -iE "sudo\|chmod\|chown\|curl\|wget\|ssh"

# Check if secrets were accessed
kubectl logs WORKSPACE_POD -n workspaces | grep -iE "password\|token\|secret\|api.key\|credentials"

Rotate Exposed Credentials

# Rotate any secrets that were mounted in the workspace
kubectl get secret -n workspaces -l workspace=WORKSPACE_ID -o json | jq '.items[].metadata.name'

# Rotate Git credentials
# GitHub: Revoke PAT, generate new one
# GitLab: Revoke token in Admin > Credentials

# Rotate cloud provider credentials
aws iam update-access-key --access-key-id AKIA... --status Inactive --user-name cde-agent
aws iam create-access-key --user-name cde-agent

# Notify security team
/incident new "AI Agent Sandbox Escape" severity:sev1 --notify security-oncall

Harden Agent Sandbox Configuration

After investigation, implement stronger isolation:

Deploy agent workspaces in Firecracker microVMs instead of containers
Apply read-only filesystem with explicit writable paths only
Enforce NetworkPolicy: allow-list only required egress (Git host, LLM API, registry)
Use short-lived, scoped credentials injected via Vault or IRSA instead of long-lived keys
Enable seccomp and AppArmor profiles to restrict syscalls
Enable comprehensive audit logging for all agent workspace activity

LLM Token Budget Exhaustion

SEV-3

Symptoms

Developers reporting "quota exceeded" or "rate limited" from AI coding tools
Team or organization LLM API spend approaching or exceeding monthly budget
Agent proxy returning 429 responses to workspace requests

Assess Current Token Usage

# Check LLM proxy spend dashboard
curl -s http://litellm-proxy:4000/global/spend | jq

# Get per-user token consumption
curl -s http://litellm-proxy:4000/global/spend/keys | jq '.[] | {user: .key_alias, spend: .spend}'

# Find top consumers
curl -s http://litellm-proxy:4000/global/spend/keys | jq 'sort_by(-.spend) | .[0:10]'

# Check rate limit status
curl -s http://litellm-proxy:4000/key/info -d '{"key": "sk-user-key"}' | jq '.info.max_budget,.info.spend'

Triage and Reallocate Budget

# Increase budget for a specific team (if within org limits)
curl -X POST http://litellm-proxy:4000/key/update \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"key": "sk-team-key", "max_budget": 500}'

# Reset spend counter for a budget period
curl -X POST http://litellm-proxy:4000/key/update \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"key": "sk-team-key", "spend": 0, "budget_duration": "30d"}'

# If needed, throttle low-priority agents to cheaper models
# Update proxy routing to use smaller models for autocomplete

Set Up Budget Guardrails

Prevent future exhaustion with these controls:

Set per-user daily and monthly token budgets with soft and hard caps
Configure alerts at 50%, 80%, and 95% of budget thresholds
Route autocomplete to smaller, cheaper models (e.g., use a small model for autocomplete, reserve large models for chat)
Implement request caching at the proxy layer for repeated queries
Add per-request max_tokens limits to prevent single-query budget spikes

AI Agent Code/Config Pollution

SEV-2

Symptoms

Agent committed broken, hallucinated, or insecure code to shared branches
CI/CD pipelines failing after agent-generated commits
Agent modified infrastructure configs (Terraform, Helm values, Dockerfiles) unexpectedly
Leaked secrets, hardcoded credentials, or debug code in agent commits

Identify and Revert Affected Commits

# Find recent agent-authored commits
git log --oneline --author="agent\|bot\|copilot\|claude" --since="24 hours ago"

# Review the diff of suspicious commits
git diff COMMIT_HASH~1 COMMIT_HASH

# Revert the problematic commit(s)
git revert COMMIT_HASH --no-edit
git push origin main

# If secrets were committed, treat as credential leak
# (see Sandbox Escape runbook, step 3)

Restrict Agent Git Access

# Enforce branch protection - agents must use feature branches + PRs
# GitHub
gh api repos/ORG/REPO/branches/main/protection -X PUT \
  -f 'required_pull_request_reviews[required_approving_review_count]=1' \
  -f 'required_status_checks[strict]=true'

# GitLab: protect main branch
# Settings > Repository > Protected Branches > main > "No one" can push

# Restrict agent Git credentials to fork-only or branch-only access

Establish Agent Code Quality Gates

Prevent future incidents with automated guardrails:

Require human approval on all PRs from agent-authored branches
Add pre-commit hooks for secret scanning (gitleaks, trufflehog)
Run SAST/DAST scans on agent-generated code before merge
Enforce .gitignore patterns that block agent temp files and logs
Block direct pushes to main/release branches from agent service accounts

Maintenance Runbooks

Scheduled maintenance and upgrade procedures

Platform Upgrade

Upgrade Coder/Ona (formerly Gitpod) version

1 Review release notes and breaking changes

2 Backup database and persistent volumes

3 Notify users of maintenance window

4 Stop all running workspaces

5 Apply helm/terraform upgrade

6 Run database migrations

7 Verify health and test workspace creation

Template Update

Update workspace templates

1 Update template code in Git repository

2 Test in staging environment

3 Push new template version

4 Notify users of available update

5 Monitor workspace rebuilds

# Push template update
coder templates push python-dev \
  --directory ./templates/python-dev \
  --message "Updated Python to 3.12"

# Check active versions
coder templates versions python-dev

Certificate Renewal

TLS certificate management

1 Check certificate expiration dates

2 Generate CSR or trigger Let's Encrypt

3 Update Kubernetes secrets

4 Restart ingress controller

5 Verify HTTPS connectivity

Database Maintenance

PostgreSQL optimization

1 Check database size and growth

2 Run VACUUM ANALYZE

3 Check and rebuild indexes

4 Archive old audit logs

5 Verify backup integrity

Disaster Recovery Runbooks

Procedures for major outages and data recovery

15 min

RTO (Recovery Time Objective)

1 hour

RPO (Recovery Point Objective)

99.9%

Availability Target

Full Platform Recovery Runbook

When to Use

Complete control plane failure
Database corruption or loss
Cluster-wide infrastructure failure
Ransomware or security incident requiring full rebuild

Assess Damage and Declare DR

# Confirm primary region is unrecoverable
aws ec2 describe-availability-zones --region us-east-1

# Declare disaster recovery
/incident update "DR declared. Initiating failover to us-west-2"

# Notify stakeholders
# - Executive team
# - All engineering teams
# - Security team

Restore Database from Backup

# List available backups
aws rds describe-db-snapshots --db-instance-identifier coder-db

# Restore to DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier coder-db-dr \
  --db-snapshot-identifier rds:coder-db-2026-01-15-00-05 \
  --availability-zone us-west-2a

# Wait for restoration
aws rds wait db-instance-available --db-instance-identifier coder-db-dr

Deploy Control Plane to DR Region

# Switch kubectl context to DR cluster
kubectl config use-context dr-cluster-us-west-2

# Deploy Coder with DR database connection
helm upgrade --install coder coder/coder \
  --namespace coder-system \
  --values dr-values.yaml \
  --set postgres.host=coder-db-dr.xxxxx.us-west-2.rds.amazonaws.com

# Verify deployment
kubectl rollout status deployment/coder -n coder-system

Update DNS and Load Balancers

# Update Route53 to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXXXXXXXXXX \
  --change-batch file://dr-dns-change.json

# Verify DNS propagation
dig cde.company.com +short

# Update any hardcoded references in IdP

Validate Recovery

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test SSO login
# Manually test in browser

# Create test workspace
coder create dr-test --template minimal --yes

# Verify existing user data
coder users list | head -10

# Announce recovery
/incident update "DR complete. CDE operational in us-west-2. Please recreate workspaces."

Scaling Operations

Procedures for scaling CDE infrastructure up or down

Emergency Scale-Up

Use when workspace creation is slow or failing due to resource constraints.

# Check current capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory

# Scale node group (AWS EKS)
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=10 \
  --nodes-max=20

# Or trigger cluster autoscaler
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=1

# Verify new nodes
kubectl get nodes -w

Caution

Emergency scaling may incur significant costs. Notify finance team if scaling beyond normal limits.

Scheduled Scale-Down

Use during off-hours or weekends to reduce costs.

# Stop idle workspaces older than 4 hours
coder workspaces list --status=running | \
  awk '$6 > 4 {print $1}' | \
  xargs -I {} coder workspaces stop {}

# Cordon nodes to prevent new scheduling
kubectl cordon NODE_NAME

# Drain workloads (with grace period)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Scale down node group
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=3

Automation

Consider using Kubernetes CronJobs or AWS Scheduled Scaling for automated off-hours scaling.

Runbook Template

Use this template to create new runbooks for your team

# Runbook: [TITLE]

## Overview
- **Purpose**: [What does this runbook accomplish?]
- **Audience**: [Who should use this runbook?]
- **Estimated Time**: [How long does this take?]
- **Severity/Priority**: [SEV-1/2/3/4]

## Prerequisites
- [ ] Access to X system
- [ ] Knowledge of Y
- [ ] Required tools: Z

## Symptoms / When to Use
- Symptom 1
- Symptom 2

## Steps

### Step 1: [Action Name]
**Description**: [What are we doing and why?]

```bash
# Commands to run
command here
```

**Expected Output**: [What should we see?]

**If Failed**: [What to do if this step fails]

### Step 2: [Action Name]
...

## Verification
- [ ] Check 1
- [ ] Check 2

## Rollback Procedure
If something goes wrong:
1. Step 1
2. Step 2

## Related Runbooks
- Link to related runbook 1
- Link to related runbook 2

## Revision History
| Date | Author | Changes |
|------|--------|---------|
| 2026-01-15 | @username | Initial version |

CDE Operational Runbooks

Daily Operations

Morning Health Check Runbook

Check Control Plane Status

Review Overnight Alerts

Verify Workspace Metrics

Check Certificate Expiration

Document Status in Daily Log

New User Onboarding Runbook

Verify SSO/IdP Provisioning

Assign Appropriate Role/Template Access

Send Welcome Resources

Verify First Workspace Creation

Incident Response Runbooks

Control Plane Unavailable

Symptoms

Page On-Call and Create Incident

Check Pod/Service Status

Check Database Connectivity

Attempt Pod Restart

Verify Recovery and Update Status

Workspace Stuck in Starting/Stopping

Identify the Stuck Workspace

Check Provisioner Logs

Force Cancel and Retry

Storage Capacity Critical

Identify Storage Usage

Clean Up Orphaned Resources

Expand Storage (if needed)

AI Agent Incident Runbooks

Runaway AI Agent - Resource Exhaustion

Symptoms

Identify the Runaway Workspace

Kill Runaway Processes and Contain

Check LLM API Token Spend

Post-Incident Hardening

AI Agent Sandbox Escape or Unauthorized Access

Symptoms

Isolate and Freeze the Workspace Immediately

Collect Forensic Evidence

Rotate Exposed Credentials

Harden Agent Sandbox Configuration

LLM Token Budget Exhaustion

Symptoms

Assess Current Token Usage

Triage and Reallocate Budget

Set Up Budget Guardrails

AI Agent Code/Config Pollution

Symptoms

Identify and Revert Affected Commits

Restrict Agent Git Access

Establish Agent Code Quality Gates

Maintenance Runbooks

Platform Upgrade

Template Update

Certificate Renewal

Database Maintenance

Disaster Recovery Runbooks

Full Platform Recovery Runbook

When to Use

Assess Damage and Declare DR

Restore Database from Backup

Deploy Control Plane to DR Region

Update DNS and Load Balancers

Validate Recovery

Scaling Operations

Emergency Scale-Up

Caution

Scheduled Scale-Down

Automation

Runbook Template

Related Resources

Monitoring Guide

Troubleshooting

Architecture