CDE Operational Runbooks
Step-by-step procedures for common operations, incident response, and maintenance tasks in your Cloud Development Environment.
Daily Operations
Routine procedures for day-to-day CDE management
Morning Health Check Runbook
Check Control Plane Status
# Coder
coder server health
# Kubernetes-based
kubectl get pods -n coder-system
kubectl get pods -n ona
# Check service endpoints
curl -s https://cde.company.com/api/v2/health | jqReview Overnight Alerts
# Check alertmanager for firing alerts
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'
# Review PagerDuty/Opsgenie incidents
pd incident:list --status=triggered,acknowledged
# Check Slack #cde-alerts channel
# Review email inbox for overnight alertsVerify Workspace Metrics
# Check active workspace count
curl -s https://cde.company.com/api/v2/workspaces | jq '[.[] | select(.latest_build.status=="running")] | length'
# Review resource utilization
kubectl top nodes
kubectl top pods -n workspaces --sort-by=memory
# Check for stuck workspaces
coder workspaces list --status=starting | grep -E "^[0-9]+ hours?"Check Certificate Expiration
# Check TLS certificate expiry
echo | openssl s_client -servername cde.company.com -connect cde.company.com:443 2>/dev/null | openssl x509 -noout -dates
# Check all ingress certificates
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfterDocument Status in Daily Log
Update the team's daily operations log with:
- All systems operational / Any issues found
- Number of active workspaces
- Resource utilization summary
- Any pending maintenance or upgrades
New User Onboarding Runbook
Verify SSO/IdP Provisioning
# Confirm user exists in IdP group
# Azure AD
az ad group member check --group "CDE-Users" --member-id USER_OBJECT_ID
# Okta - Check group membership via admin console
# Or verify SCIM provisioning logsAssign Appropriate Role/Template Access
# Coder - Assign to organization and template
coder users create --email [email protected] --username jsmith
coder organizations members add engineering jsmith --role member
# Grant template access
coder templates edit python-dev --group engineeringSend Welcome Resources
Provide the new user with:
- CDE access URL: https://cde.company.com
- Getting started documentation
- Onboarding video/walkthrough
- Support Slack channel: #cde-help
Verify First Workspace Creation
# Monitor first workspace creation
coder workspaces list --owner jsmith
# Check for any provisioning errors
coder workspaces show jsmith/my-workspace --json | jq '.latest_build.job.error'Incident Response Runbooks
Emergency procedures for common CDE incidents
Control Plane Unavailable
SEV-1Symptoms
- CDE dashboard inaccessible
- New workspaces cannot be created
- Existing workspaces may still function but cannot be managed
Page On-Call and Create Incident
# PagerDuty
pd incident:create --title "CDE Control Plane Down" --service CDE-CRITICAL --urgency high
# Slack notification
/incident new "CDE Control Plane Unavailable" severity:sev1Check Pod/Service Status
# Get pod status
kubectl get pods -n coder-system -o wide
kubectl get pods -n coder-system | grep -v Running
# Check recent events
kubectl get events -n coder-system --sort-by='.lastTimestamp' | tail -20
# Check service endpoints
kubectl get endpoints -n coder-systemCheck Database Connectivity
# Test PostgreSQL connection
kubectl exec -it deploy/coder -n coder-system -- pg_isready -h $CODER_PG_CONNECTION_URL
# Check database pod (if self-hosted)
kubectl get pods -n database -l app=postgresql
# Check RDS status (if AWS)
aws rds describe-db-instances --db-instance-identifier coder-db --query 'DBInstances[0].DBInstanceStatus'Attempt Pod Restart
# Rolling restart of control plane
kubectl rollout restart deployment/coder -n coder-system
# Monitor rollout
kubectl rollout status deployment/coder -n coder-system --timeout=300s
# If stuck, force delete problematic pods
kubectl delete pod POD_NAME -n coder-system --grace-period=0 --forceVerify Recovery and Update Status
# Health check
curl -s https://cde.company.com/api/v2/health | jq
# Test workspace creation
coder create test-recovery --template minimal --yes
coder delete test-recovery --yes
# Update incident status
/incident update "Control plane recovered. Monitoring for stability."Workspace Stuck in Starting/Stopping
SEV-3Identify the Stuck Workspace
# List workspaces by status
coder workspaces list --status=starting
coder workspaces list --status=stopping
# Get detailed info
coder workspaces show USERNAME/WORKSPACE_NAME --json | jq '.latest_build'Check Provisioner Logs
# Get build logs
coder workspaces logs USERNAME/WORKSPACE_NAME
# Check provisioner pods
kubectl logs -n coder-system -l app=coder-provisioner --tail=100
# Look for Terraform errors
kubectl logs -n coder-system -l app=coder-provisioner | grep -i errorForce Cancel and Retry
# Cancel the stuck build
coder workspaces cancel USERNAME/WORKSPACE_NAME
# For stuck in stopping - force stop the underlying resources
# Kubernetes
kubectl delete pod -n workspaces -l workspace=WORKSPACE_ID --grace-period=0 --force
# Retry the operation
coder workspaces start USERNAME/WORKSPACE_NAME
# or
coder workspaces stop USERNAME/WORKSPACE_NAMEStorage Capacity Critical
SEV-2Identify Storage Usage
# Check PVC usage
kubectl get pvc -n workspaces -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.resources.requests.storage,USED:.status.capacity.storage
# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage
# Find largest workspace volumes
kubectl exec -it STORAGE_POD -- df -h
kubectl exec -it STORAGE_POD -- du -sh /data/* | sort -rh | head -20Clean Up Orphaned Resources
# Find orphaned PVCs
kubectl get pvc -n workspaces --no-headers | while read pvc rest; do
if ! coder workspaces list --all | grep -q "$pvc"; then
echo "Orphaned: $pvc"
fi
done
# Clean docker images on nodes
kubectl get nodes -o name | xargs -I {} kubectl debug {} -- docker system prune -af
# Clean old workspace builds
coder workspaces list --all | awk '$5 > 30 {print $1}' | xargs -I {} coder delete {} --yesExpand Storage (if needed)
# Expand PVC (if storage class supports it)
kubectl patch pvc WORKSPACE_PVC -n workspaces -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
# Add new storage nodes
# AWS EKS
eksctl scale nodegroup --cluster=cde-cluster --name=storage-nodes --nodes=5
# Update storage quotas in templates
coder templates push TEMPLATE --var disk_size=100AI Agent Incident Runbooks
Emergency procedures for AI coding agent failures, runaway processes, and sandbox breaches
Runaway AI Agent - Resource Exhaustion
SEV-1Symptoms
- AI agent consuming 100% CPU or exhausting memory in workspace
- Uncontrolled subprocess spawning (fork bombs, recursive builds)
- Disk usage spiking from generated files, logs, or downloaded artifacts
- LLM API token spend exceeding budget thresholds
Identify the Runaway Workspace
# Find workspaces with abnormal resource usage
kubectl top pods -n workspaces --sort-by=cpu | head -10
kubectl top pods -n workspaces --sort-by=memory | head -10
# Check for high process counts inside workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- ps aux --sort=-%cpu | head -20
# Review agent session logs
kubectl logs WORKSPACE_POD -n workspaces --tail=200 | grep -i "agent\|llm\|copilot"Kill Runaway Processes and Contain
# Kill the agent process tree inside the workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- pkill -9 -f "claude\|copilot\|aider\|agent"
# If subprocess spawning is out of control, apply cgroup limits
kubectl exec -it WORKSPACE_POD -n workspaces -- bash -c 'echo 50 > /sys/fs/cgroup/pids/pids.max'
# If workspace is unresponsive, force-stop it
coder workspaces stop USERNAME/WORKSPACE_NAME
kubectl delete pod WORKSPACE_POD -n workspaces --grace-period=10 --forceCheck LLM API Token Spend
# Check proxy/gateway logs for token consumption
kubectl logs -n ai-gateway -l app=litellm --tail=100 | grep -i "tokens\|cost\|budget"
# If using a cost proxy, check budget status
curl -s http://litellm-proxy:4000/budget/info | jq
# Revoke API key if spend is out of control
curl -X POST http://litellm-proxy:4000/key/delete \
-H "Authorization: Bearer ADMIN_KEY" \
-d '{"keys": ["sk-compromised-key-id"]}'Post-Incident Hardening
After containment, apply preventive controls:
- Set CPU/memory resource limits on agent workspace templates
- Configure per-user LLM API token budgets with hard caps
- Enable max-process-count (pids.max) cgroup limits in templates
- Add agent session timeout policies (e.g., 30-minute idle kill)
- Set up alerting on sustained high CPU/memory in agent workspaces
AI Agent Sandbox Escape or Unauthorized Access
SEV-1Symptoms
- Agent accessing files, repos, or services outside its allowed scope
- Unexpected network traffic from agent workspace (egress to unknown IPs)
- Agent modifying system-level config, installing packages, or escalating privileges
- Secrets or credentials appearing in agent output or logs
Isolate and Freeze the Workspace Immediately
# Apply network policy to block all egress immediately
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: emergency-isolate-agent
namespace: workspaces
spec:
podSelector:
matchLabels:
workspace: WORKSPACE_ID
policyTypes:
- Egress
egress: []
EOF
# Freeze the workspace (do NOT delete - preserve for forensics)
coder workspaces stop USERNAME/WORKSPACE_NAME
# Revoke the user's agent API keys
curl -X POST http://litellm-proxy:4000/key/delete \
-H "Authorization: Bearer ADMIN_KEY" \
-d '{"keys": ["sk-affected-key"]}'Collect Forensic Evidence
# Snapshot the workspace volume before cleanup
kubectl get pvc -n workspaces -l workspace=WORKSPACE_ID -o json > /tmp/pvc-snapshot.json
# Export agent session logs
kubectl logs WORKSPACE_POD -n workspaces --all-containers > /tmp/agent-forensics.log
# Check network flow logs for unauthorized connections
kubectl logs -n kube-system -l app=calico-node --tail=500 | grep WORKSPACE_POD_IP
# Review audit logs for privilege escalation
kubectl logs -n workspaces WORKSPACE_POD | grep -iE "sudo\|chmod\|chown\|curl\|wget\|ssh"
# Check if secrets were accessed
kubectl logs WORKSPACE_POD -n workspaces | grep -iE "password\|token\|secret\|api.key\|credentials"Rotate Exposed Credentials
# Rotate any secrets that were mounted in the workspace
kubectl get secret -n workspaces -l workspace=WORKSPACE_ID -o json | jq '.items[].metadata.name'
# Rotate Git credentials
# GitHub: Revoke PAT, generate new one
# GitLab: Revoke token in Admin > Credentials
# Rotate cloud provider credentials
aws iam update-access-key --access-key-id AKIA... --status Inactive --user-name cde-agent
aws iam create-access-key --user-name cde-agent
# Notify security team
/incident new "AI Agent Sandbox Escape" severity:sev1 --notify security-oncallHarden Agent Sandbox Configuration
After investigation, implement stronger isolation:
- Deploy agent workspaces in Firecracker microVMs instead of containers
- Apply read-only filesystem with explicit writable paths only
- Enforce NetworkPolicy: allow-list only required egress (Git host, LLM API, registry)
- Use short-lived, scoped credentials injected via Vault or IRSA instead of long-lived keys
- Enable seccomp and AppArmor profiles to restrict syscalls
- Enable comprehensive audit logging for all agent workspace activity
LLM Token Budget Exhaustion
SEV-3Symptoms
- Developers reporting "quota exceeded" or "rate limited" from AI coding tools
- Team or organization LLM API spend approaching or exceeding monthly budget
- Agent proxy returning 429 responses to workspace requests
Assess Current Token Usage
# Check LLM proxy spend dashboard
curl -s http://litellm-proxy:4000/global/spend | jq
# Get per-user token consumption
curl -s http://litellm-proxy:4000/global/spend/keys | jq '.[] | {user: .key_alias, spend: .spend}'
# Find top consumers
curl -s http://litellm-proxy:4000/global/spend/keys | jq 'sort_by(-.spend) | .[0:10]'
# Check rate limit status
curl -s http://litellm-proxy:4000/key/info -d '{"key": "sk-user-key"}' | jq '.info.max_budget,.info.spend'Triage and Reallocate Budget
# Increase budget for a specific team (if within org limits)
curl -X POST http://litellm-proxy:4000/key/update \
-H "Authorization: Bearer ADMIN_KEY" \
-d '{"key": "sk-team-key", "max_budget": 500}'
# Reset spend counter for a budget period
curl -X POST http://litellm-proxy:4000/key/update \
-H "Authorization: Bearer ADMIN_KEY" \
-d '{"key": "sk-team-key", "spend": 0, "budget_duration": "30d"}'
# If needed, throttle low-priority agents to cheaper models
# Update proxy routing to use smaller models for autocompleteSet Up Budget Guardrails
Prevent future exhaustion with these controls:
- Set per-user daily and monthly token budgets with soft and hard caps
- Configure alerts at 50%, 80%, and 95% of budget thresholds
- Route autocomplete to smaller, cheaper models (e.g., use a small model for autocomplete, reserve large models for chat)
- Implement request caching at the proxy layer for repeated queries
- Add per-request max_tokens limits to prevent single-query budget spikes
AI Agent Code/Config Pollution
SEV-2Symptoms
- Agent committed broken, hallucinated, or insecure code to shared branches
- CI/CD pipelines failing after agent-generated commits
- Agent modified infrastructure configs (Terraform, Helm values, Dockerfiles) unexpectedly
- Leaked secrets, hardcoded credentials, or debug code in agent commits
Identify and Revert Affected Commits
# Find recent agent-authored commits
git log --oneline --author="agent\|bot\|copilot\|claude" --since="24 hours ago"
# Review the diff of suspicious commits
git diff COMMIT_HASH~1 COMMIT_HASH
# Revert the problematic commit(s)
git revert COMMIT_HASH --no-edit
git push origin main
# If secrets were committed, treat as credential leak
# (see Sandbox Escape runbook, step 3)Restrict Agent Git Access
# Enforce branch protection - agents must use feature branches + PRs
# GitHub
gh api repos/ORG/REPO/branches/main/protection -X PUT \
-f 'required_pull_request_reviews[required_approving_review_count]=1' \
-f 'required_status_checks[strict]=true'
# GitLab: protect main branch
# Settings > Repository > Protected Branches > main > "No one" can push
# Restrict agent Git credentials to fork-only or branch-only accessEstablish Agent Code Quality Gates
Prevent future incidents with automated guardrails:
- Require human approval on all PRs from agent-authored branches
- Add pre-commit hooks for secret scanning (gitleaks, trufflehog)
- Run SAST/DAST scans on agent-generated code before merge
- Enforce .gitignore patterns that block agent temp files and logs
- Block direct pushes to main/release branches from agent service accounts
Maintenance Runbooks
Scheduled maintenance and upgrade procedures
Platform Upgrade
Upgrade Coder/Ona (formerly Gitpod) version
Template Update
Update workspace templates
# Push template update
coder templates push python-dev \
--directory ./templates/python-dev \
--message "Updated Python to 3.12"
# Check active versions
coder templates versions python-devCertificate Renewal
TLS certificate management
Database Maintenance
PostgreSQL optimization
Disaster Recovery Runbooks
Procedures for major outages and data recovery
Full Platform Recovery Runbook
When to Use
- Complete control plane failure
- Database corruption or loss
- Cluster-wide infrastructure failure
- Ransomware or security incident requiring full rebuild
Assess Damage and Declare DR
# Confirm primary region is unrecoverable
aws ec2 describe-availability-zones --region us-east-1
# Declare disaster recovery
/incident update "DR declared. Initiating failover to us-west-2"
# Notify stakeholders
# - Executive team
# - All engineering teams
# - Security teamRestore Database from Backup
# List available backups
aws rds describe-db-snapshots --db-instance-identifier coder-db
# Restore to DR region
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier coder-db-dr \
--db-snapshot-identifier rds:coder-db-2026-01-15-00-05 \
--availability-zone us-west-2a
# Wait for restoration
aws rds wait db-instance-available --db-instance-identifier coder-db-drDeploy Control Plane to DR Region
# Switch kubectl context to DR cluster
kubectl config use-context dr-cluster-us-west-2
# Deploy Coder with DR database connection
helm upgrade --install coder coder/coder \
--namespace coder-system \
--values dr-values.yaml \
--set postgres.host=coder-db-dr.xxxxx.us-west-2.rds.amazonaws.com
# Verify deployment
kubectl rollout status deployment/coder -n coder-systemUpdate DNS and Load Balancers
# Update Route53 to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXXXXXXXXXXXX \
--change-batch file://dr-dns-change.json
# Verify DNS propagation
dig cde.company.com +short
# Update any hardcoded references in IdPValidate Recovery
# Health check
curl -s https://cde.company.com/api/v2/health | jq
# Test SSO login
# Manually test in browser
# Create test workspace
coder create dr-test --template minimal --yes
# Verify existing user data
coder users list | head -10
# Announce recovery
/incident update "DR complete. CDE operational in us-west-2. Please recreate workspaces."Scaling Operations
Procedures for scaling CDE infrastructure up or down
Emergency Scale-Up
Use when workspace creation is slow or failing due to resource constraints.
# Check current capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory
# Scale node group (AWS EKS)
eksctl scale nodegroup \
--cluster=cde-cluster \
--name=workspace-nodes \
--nodes=10 \
--nodes-max=20
# Or trigger cluster autoscaler
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=1
# Verify new nodes
kubectl get nodes -wCaution
Emergency scaling may incur significant costs. Notify finance team if scaling beyond normal limits.
Scheduled Scale-Down
Use during off-hours or weekends to reduce costs.
# Stop idle workspaces older than 4 hours
coder workspaces list --status=running | \
awk '$6 > 4 {print $1}' | \
xargs -I {} coder workspaces stop {}
# Cordon nodes to prevent new scheduling
kubectl cordon NODE_NAME
# Drain workloads (with grace period)
kubectl drain NODE_NAME \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300
# Scale down node group
eksctl scale nodegroup \
--cluster=cde-cluster \
--name=workspace-nodes \
--nodes=3Automation
Consider using Kubernetes CronJobs or AWS Scheduled Scaling for automated off-hours scaling.
Runbook Template
Use this template to create new runbooks for your team
# Runbook: [TITLE]
## Overview
- **Purpose**: [What does this runbook accomplish?]
- **Audience**: [Who should use this runbook?]
- **Estimated Time**: [How long does this take?]
- **Severity/Priority**: [SEV-1/2/3/4]
## Prerequisites
- [ ] Access to X system
- [ ] Knowledge of Y
- [ ] Required tools: Z
## Symptoms / When to Use
- Symptom 1
- Symptom 2
## Steps
### Step 1: [Action Name]
**Description**: [What are we doing and why?]
```bash
# Commands to run
command here
```
**Expected Output**: [What should we see?]
**If Failed**: [What to do if this step fails]
### Step 2: [Action Name]
...
## Verification
- [ ] Check 1
- [ ] Check 2
## Rollback Procedure
If something goes wrong:
1. Step 1
2. Step 2
## Related Runbooks
- Link to related runbook 1
- Link to related runbook 2
## Revision History
| Date | Author | Changes |
|------|--------|---------|
| 2026-01-15 | @username | Initial version |
