Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide CDEs for Startups
AI & Automation
AI Coding Assistants Agentic AI AI-Native IDEs Agentic Engineering AI Agent Orchestration AI Governance AI-Assisted Architecture Shift-Left AI LLMOps Autonomous Development AI/ML Workloads GPU Computing
Implementation
Architecture Patterns DevContainers Advanced DevContainers Language Quickstarts IDE Integration CI/CD Integration Platform Engineering Developer Portals Container Registry Multi-CDE Strategies Remote Dev Protocols Nix Environments
Operations
Performance Optimization High Availability & DR Disaster Recovery Monitoring Capacity Planning Multi-Cluster Development Troubleshooting Runbooks Ephemeral Environments
Security
Security Deep Dive Zero Trust Architecture Secrets Management Vulnerability Management Network Security IAM Guide Supply Chain Security Air-Gapped Environments AI Agent Security MicroVM Isolation Compliance Guide Governance
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis FinOps GreenOps Vendor Evaluation Training Resources Developer Onboarding Team Structure DevEx Metrics Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

CDE Operational Runbooks

Step-by-step procedures for common operations, incident response, and maintenance tasks in your Cloud Development Environment.

Daily Operations

Routine procedures for day-to-day CDE management

Morning Health Check Runbook

Est. Time: 10-15 minutes
1

Check Control Plane Status

# Coder
coder server health

# Kubernetes-based
kubectl get pods -n coder-system
kubectl get pods -n ona

# Check service endpoints
curl -s https://cde.company.com/api/v2/health | jq
2

Review Overnight Alerts

# Check alertmanager for firing alerts
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'

# Review PagerDuty/Opsgenie incidents
pd incident:list --status=triggered,acknowledged

# Check Slack #cde-alerts channel
# Review email inbox for overnight alerts
3

Verify Workspace Metrics

# Check active workspace count
curl -s https://cde.company.com/api/v2/workspaces | jq '[.[] | select(.latest_build.status=="running")] | length'

# Review resource utilization
kubectl top nodes
kubectl top pods -n workspaces --sort-by=memory

# Check for stuck workspaces
coder workspaces list --status=starting | grep -E "^[0-9]+ hours?"
4

Check Certificate Expiration

# Check TLS certificate expiry
echo | openssl s_client -servername cde.company.com -connect cde.company.com:443 2>/dev/null | openssl x509 -noout -dates

# Check all ingress certificates
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter
5

Document Status in Daily Log

Update the team's daily operations log with:

  • All systems operational / Any issues found
  • Number of active workspaces
  • Resource utilization summary
  • Any pending maintenance or upgrades

New User Onboarding Runbook

Est. Time: 5-10 minutes
1

Verify SSO/IdP Provisioning

# Confirm user exists in IdP group
# Azure AD
az ad group member check --group "CDE-Users" --member-id USER_OBJECT_ID

# Okta - Check group membership via admin console
# Or verify SCIM provisioning logs
2

Assign Appropriate Role/Template Access

# Coder - Assign to organization and template
coder users create --email [email protected] --username jsmith
coder organizations members add engineering jsmith --role member

# Grant template access
coder templates edit python-dev --group engineering
3

Send Welcome Resources

Provide the new user with:

  • CDE access URL: https://cde.company.com
  • Getting started documentation
  • Onboarding video/walkthrough
  • Support Slack channel: #cde-help
4

Verify First Workspace Creation

# Monitor first workspace creation
coder workspaces list --owner jsmith

# Check for any provisioning errors
coder workspaces show jsmith/my-workspace --json | jq '.latest_build.job.error'

Incident Response Runbooks

Emergency procedures for common CDE incidents

Control Plane Unavailable

SEV-1

Symptoms

  • CDE dashboard inaccessible
  • New workspaces cannot be created
  • Existing workspaces may still function but cannot be managed
1

Page On-Call and Create Incident

# PagerDuty
pd incident:create --title "CDE Control Plane Down" --service CDE-CRITICAL --urgency high

# Slack notification
/incident new "CDE Control Plane Unavailable" severity:sev1
2

Check Pod/Service Status

# Get pod status
kubectl get pods -n coder-system -o wide
kubectl get pods -n coder-system | grep -v Running

# Check recent events
kubectl get events -n coder-system --sort-by='.lastTimestamp' | tail -20

# Check service endpoints
kubectl get endpoints -n coder-system
3

Check Database Connectivity

# Test PostgreSQL connection
kubectl exec -it deploy/coder -n coder-system -- pg_isready -h $CODER_PG_CONNECTION_URL

# Check database pod (if self-hosted)
kubectl get pods -n database -l app=postgresql

# Check RDS status (if AWS)
aws rds describe-db-instances --db-instance-identifier coder-db --query 'DBInstances[0].DBInstanceStatus'
4

Attempt Pod Restart

# Rolling restart of control plane
kubectl rollout restart deployment/coder -n coder-system

# Monitor rollout
kubectl rollout status deployment/coder -n coder-system --timeout=300s

# If stuck, force delete problematic pods
kubectl delete pod POD_NAME -n coder-system --grace-period=0 --force
5

Verify Recovery and Update Status

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test workspace creation
coder create test-recovery --template minimal --yes
coder delete test-recovery --yes

# Update incident status
/incident update "Control plane recovered. Monitoring for stability."

Workspace Stuck in Starting/Stopping

SEV-3
1

Identify the Stuck Workspace

# List workspaces by status
coder workspaces list --status=starting
coder workspaces list --status=stopping

# Get detailed info
coder workspaces show USERNAME/WORKSPACE_NAME --json | jq '.latest_build'
2

Check Provisioner Logs

# Get build logs
coder workspaces logs USERNAME/WORKSPACE_NAME

# Check provisioner pods
kubectl logs -n coder-system -l app=coder-provisioner --tail=100

# Look for Terraform errors
kubectl logs -n coder-system -l app=coder-provisioner | grep -i error
3

Force Cancel and Retry

# Cancel the stuck build
coder workspaces cancel USERNAME/WORKSPACE_NAME

# For stuck in stopping - force stop the underlying resources
# Kubernetes
kubectl delete pod -n workspaces -l workspace=WORKSPACE_ID --grace-period=0 --force

# Retry the operation
coder workspaces start USERNAME/WORKSPACE_NAME
# or
coder workspaces stop USERNAME/WORKSPACE_NAME

Storage Capacity Critical

SEV-2
1

Identify Storage Usage

# Check PVC usage
kubectl get pvc -n workspaces -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.resources.requests.storage,USED:.status.capacity.storage

# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Find largest workspace volumes
kubectl exec -it STORAGE_POD -- df -h
kubectl exec -it STORAGE_POD -- du -sh /data/* | sort -rh | head -20
2

Clean Up Orphaned Resources

# Find orphaned PVCs
kubectl get pvc -n workspaces --no-headers | while read pvc rest; do
  if ! coder workspaces list --all | grep -q "$pvc"; then
    echo "Orphaned: $pvc"
  fi
done

# Clean docker images on nodes
kubectl get nodes -o name | xargs -I {} kubectl debug {} -- docker system prune -af

# Clean old workspace builds
coder workspaces list --all | awk '$5 > 30 {print $1}' | xargs -I {} coder delete {} --yes
3

Expand Storage (if needed)

# Expand PVC (if storage class supports it)
kubectl patch pvc WORKSPACE_PVC -n workspaces -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

# Add new storage nodes
# AWS EKS
eksctl scale nodegroup --cluster=cde-cluster --name=storage-nodes --nodes=5

# Update storage quotas in templates
coder templates push TEMPLATE --var disk_size=100

AI Agent Incident Runbooks

Emergency procedures for AI coding agent failures, runaway processes, and sandbox breaches

Runaway AI Agent - Resource Exhaustion

SEV-1

Symptoms

  • AI agent consuming 100% CPU or exhausting memory in workspace
  • Uncontrolled subprocess spawning (fork bombs, recursive builds)
  • Disk usage spiking from generated files, logs, or downloaded artifacts
  • LLM API token spend exceeding budget thresholds
1

Identify the Runaway Workspace

# Find workspaces with abnormal resource usage
kubectl top pods -n workspaces --sort-by=cpu | head -10
kubectl top pods -n workspaces --sort-by=memory | head -10

# Check for high process counts inside workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- ps aux --sort=-%cpu | head -20

# Review agent session logs
kubectl logs WORKSPACE_POD -n workspaces --tail=200 | grep -i "agent\|llm\|copilot"
2

Kill Runaway Processes and Contain

# Kill the agent process tree inside the workspace
kubectl exec -it WORKSPACE_POD -n workspaces -- pkill -9 -f "claude\|copilot\|aider\|agent"

# If subprocess spawning is out of control, apply cgroup limits
kubectl exec -it WORKSPACE_POD -n workspaces -- bash -c 'echo 50 > /sys/fs/cgroup/pids/pids.max'

# If workspace is unresponsive, force-stop it
coder workspaces stop USERNAME/WORKSPACE_NAME
kubectl delete pod WORKSPACE_POD -n workspaces --grace-period=10 --force
3

Check LLM API Token Spend

# Check proxy/gateway logs for token consumption
kubectl logs -n ai-gateway -l app=litellm --tail=100 | grep -i "tokens\|cost\|budget"

# If using a cost proxy, check budget status
curl -s http://litellm-proxy:4000/budget/info | jq

# Revoke API key if spend is out of control
curl -X POST http://litellm-proxy:4000/key/delete \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"keys": ["sk-compromised-key-id"]}'
4

Post-Incident Hardening

After containment, apply preventive controls:

  • Set CPU/memory resource limits on agent workspace templates
  • Configure per-user LLM API token budgets with hard caps
  • Enable max-process-count (pids.max) cgroup limits in templates
  • Add agent session timeout policies (e.g., 30-minute idle kill)
  • Set up alerting on sustained high CPU/memory in agent workspaces

AI Agent Sandbox Escape or Unauthorized Access

SEV-1

Symptoms

  • Agent accessing files, repos, or services outside its allowed scope
  • Unexpected network traffic from agent workspace (egress to unknown IPs)
  • Agent modifying system-level config, installing packages, or escalating privileges
  • Secrets or credentials appearing in agent output or logs
1

Isolate and Freeze the Workspace Immediately

# Apply network policy to block all egress immediately
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-isolate-agent
  namespace: workspaces
spec:
  podSelector:
    matchLabels:
      workspace: WORKSPACE_ID
  policyTypes:
    - Egress
  egress: []
EOF

# Freeze the workspace (do NOT delete - preserve for forensics)
coder workspaces stop USERNAME/WORKSPACE_NAME

# Revoke the user's agent API keys
curl -X POST http://litellm-proxy:4000/key/delete \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"keys": ["sk-affected-key"]}'
2

Collect Forensic Evidence

# Snapshot the workspace volume before cleanup
kubectl get pvc -n workspaces -l workspace=WORKSPACE_ID -o json > /tmp/pvc-snapshot.json

# Export agent session logs
kubectl logs WORKSPACE_POD -n workspaces --all-containers > /tmp/agent-forensics.log

# Check network flow logs for unauthorized connections
kubectl logs -n kube-system -l app=calico-node --tail=500 | grep WORKSPACE_POD_IP

# Review audit logs for privilege escalation
kubectl logs -n workspaces WORKSPACE_POD | grep -iE "sudo\|chmod\|chown\|curl\|wget\|ssh"

# Check if secrets were accessed
kubectl logs WORKSPACE_POD -n workspaces | grep -iE "password\|token\|secret\|api.key\|credentials"
3

Rotate Exposed Credentials

# Rotate any secrets that were mounted in the workspace
kubectl get secret -n workspaces -l workspace=WORKSPACE_ID -o json | jq '.items[].metadata.name'

# Rotate Git credentials
# GitHub: Revoke PAT, generate new one
# GitLab: Revoke token in Admin > Credentials

# Rotate cloud provider credentials
aws iam update-access-key --access-key-id AKIA... --status Inactive --user-name cde-agent
aws iam create-access-key --user-name cde-agent

# Notify security team
/incident new "AI Agent Sandbox Escape" severity:sev1 --notify security-oncall
4

Harden Agent Sandbox Configuration

After investigation, implement stronger isolation:

  • Deploy agent workspaces in Firecracker microVMs instead of containers
  • Apply read-only filesystem with explicit writable paths only
  • Enforce NetworkPolicy: allow-list only required egress (Git host, LLM API, registry)
  • Use short-lived, scoped credentials injected via Vault or IRSA instead of long-lived keys
  • Enable seccomp and AppArmor profiles to restrict syscalls
  • Enable comprehensive audit logging for all agent workspace activity

LLM Token Budget Exhaustion

SEV-3

Symptoms

  • Developers reporting "quota exceeded" or "rate limited" from AI coding tools
  • Team or organization LLM API spend approaching or exceeding monthly budget
  • Agent proxy returning 429 responses to workspace requests
1

Assess Current Token Usage

# Check LLM proxy spend dashboard
curl -s http://litellm-proxy:4000/global/spend | jq

# Get per-user token consumption
curl -s http://litellm-proxy:4000/global/spend/keys | jq '.[] | {user: .key_alias, spend: .spend}'

# Find top consumers
curl -s http://litellm-proxy:4000/global/spend/keys | jq 'sort_by(-.spend) | .[0:10]'

# Check rate limit status
curl -s http://litellm-proxy:4000/key/info -d '{"key": "sk-user-key"}' | jq '.info.max_budget,.info.spend'
2

Triage and Reallocate Budget

# Increase budget for a specific team (if within org limits)
curl -X POST http://litellm-proxy:4000/key/update \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"key": "sk-team-key", "max_budget": 500}'

# Reset spend counter for a budget period
curl -X POST http://litellm-proxy:4000/key/update \
  -H "Authorization: Bearer ADMIN_KEY" \
  -d '{"key": "sk-team-key", "spend": 0, "budget_duration": "30d"}'

# If needed, throttle low-priority agents to cheaper models
# Update proxy routing to use smaller models for autocomplete
3

Set Up Budget Guardrails

Prevent future exhaustion with these controls:

  • Set per-user daily and monthly token budgets with soft and hard caps
  • Configure alerts at 50%, 80%, and 95% of budget thresholds
  • Route autocomplete to smaller, cheaper models (e.g., use a small model for autocomplete, reserve large models for chat)
  • Implement request caching at the proxy layer for repeated queries
  • Add per-request max_tokens limits to prevent single-query budget spikes

AI Agent Code/Config Pollution

SEV-2

Symptoms

  • Agent committed broken, hallucinated, or insecure code to shared branches
  • CI/CD pipelines failing after agent-generated commits
  • Agent modified infrastructure configs (Terraform, Helm values, Dockerfiles) unexpectedly
  • Leaked secrets, hardcoded credentials, or debug code in agent commits
1

Identify and Revert Affected Commits

# Find recent agent-authored commits
git log --oneline --author="agent\|bot\|copilot\|claude" --since="24 hours ago"

# Review the diff of suspicious commits
git diff COMMIT_HASH~1 COMMIT_HASH

# Revert the problematic commit(s)
git revert COMMIT_HASH --no-edit
git push origin main

# If secrets were committed, treat as credential leak
# (see Sandbox Escape runbook, step 3)
2

Restrict Agent Git Access

# Enforce branch protection - agents must use feature branches + PRs
# GitHub
gh api repos/ORG/REPO/branches/main/protection -X PUT \
  -f 'required_pull_request_reviews[required_approving_review_count]=1' \
  -f 'required_status_checks[strict]=true'

# GitLab: protect main branch
# Settings > Repository > Protected Branches > main > "No one" can push

# Restrict agent Git credentials to fork-only or branch-only access
3

Establish Agent Code Quality Gates

Prevent future incidents with automated guardrails:

  • Require human approval on all PRs from agent-authored branches
  • Add pre-commit hooks for secret scanning (gitleaks, trufflehog)
  • Run SAST/DAST scans on agent-generated code before merge
  • Enforce .gitignore patterns that block agent temp files and logs
  • Block direct pushes to main/release branches from agent service accounts

Maintenance Runbooks

Scheduled maintenance and upgrade procedures

Platform Upgrade

Upgrade Coder/Ona (formerly Gitpod) version

1 Review release notes and breaking changes
2 Backup database and persistent volumes
3 Notify users of maintenance window
4 Stop all running workspaces
5 Apply helm/terraform upgrade
6 Run database migrations
7 Verify health and test workspace creation

Template Update

Update workspace templates

1 Update template code in Git repository
2 Test in staging environment
3 Push new template version
4 Notify users of available update
5 Monitor workspace rebuilds
# Push template update
coder templates push python-dev \
  --directory ./templates/python-dev \
  --message "Updated Python to 3.12"

# Check active versions
coder templates versions python-dev

Certificate Renewal

TLS certificate management

1 Check certificate expiration dates
2 Generate CSR or trigger Let's Encrypt
3 Update Kubernetes secrets
4 Restart ingress controller
5 Verify HTTPS connectivity

Database Maintenance

PostgreSQL optimization

1 Check database size and growth
2 Run VACUUM ANALYZE
3 Check and rebuild indexes
4 Archive old audit logs
5 Verify backup integrity

Disaster Recovery Runbooks

Procedures for major outages and data recovery

15 min
RTO (Recovery Time Objective)
1 hour
RPO (Recovery Point Objective)
99.9%
Availability Target

Full Platform Recovery Runbook

When to Use

  • Complete control plane failure
  • Database corruption or loss
  • Cluster-wide infrastructure failure
  • Ransomware or security incident requiring full rebuild
1

Assess Damage and Declare DR

# Confirm primary region is unrecoverable
aws ec2 describe-availability-zones --region us-east-1

# Declare disaster recovery
/incident update "DR declared. Initiating failover to us-west-2"

# Notify stakeholders
# - Executive team
# - All engineering teams
# - Security team
2

Restore Database from Backup

# List available backups
aws rds describe-db-snapshots --db-instance-identifier coder-db

# Restore to DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier coder-db-dr \
  --db-snapshot-identifier rds:coder-db-2026-01-15-00-05 \
  --availability-zone us-west-2a

# Wait for restoration
aws rds wait db-instance-available --db-instance-identifier coder-db-dr
3

Deploy Control Plane to DR Region

# Switch kubectl context to DR cluster
kubectl config use-context dr-cluster-us-west-2

# Deploy Coder with DR database connection
helm upgrade --install coder coder/coder \
  --namespace coder-system \
  --values dr-values.yaml \
  --set postgres.host=coder-db-dr.xxxxx.us-west-2.rds.amazonaws.com

# Verify deployment
kubectl rollout status deployment/coder -n coder-system
4

Update DNS and Load Balancers

# Update Route53 to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXXXXXXXXXX \
  --change-batch file://dr-dns-change.json

# Verify DNS propagation
dig cde.company.com +short

# Update any hardcoded references in IdP
5

Validate Recovery

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test SSO login
# Manually test in browser

# Create test workspace
coder create dr-test --template minimal --yes

# Verify existing user data
coder users list | head -10

# Announce recovery
/incident update "DR complete. CDE operational in us-west-2. Please recreate workspaces."

Scaling Operations

Procedures for scaling CDE infrastructure up or down

Emergency Scale-Up

Use when workspace creation is slow or failing due to resource constraints.

# Check current capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory

# Scale node group (AWS EKS)
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=10 \
  --nodes-max=20

# Or trigger cluster autoscaler
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=1

# Verify new nodes
kubectl get nodes -w

Caution

Emergency scaling may incur significant costs. Notify finance team if scaling beyond normal limits.

Scheduled Scale-Down

Use during off-hours or weekends to reduce costs.

# Stop idle workspaces older than 4 hours
coder workspaces list --status=running | \
  awk '$6 > 4 {print $1}' | \
  xargs -I {} coder workspaces stop {}

# Cordon nodes to prevent new scheduling
kubectl cordon NODE_NAME

# Drain workloads (with grace period)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Scale down node group
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=3

Automation

Consider using Kubernetes CronJobs or AWS Scheduled Scaling for automated off-hours scaling.

Runbook Template

Use this template to create new runbooks for your team

# Runbook: [TITLE]

## Overview
- **Purpose**: [What does this runbook accomplish?]
- **Audience**: [Who should use this runbook?]
- **Estimated Time**: [How long does this take?]
- **Severity/Priority**: [SEV-1/2/3/4]

## Prerequisites
- [ ] Access to X system
- [ ] Knowledge of Y
- [ ] Required tools: Z

## Symptoms / When to Use
- Symptom 1
- Symptom 2

## Steps

### Step 1: [Action Name]
**Description**: [What are we doing and why?]

```bash
# Commands to run
command here
```

**Expected Output**: [What should we see?]

**If Failed**: [What to do if this step fails]

### Step 2: [Action Name]
...

## Verification
- [ ] Check 1
- [ ] Check 2

## Rollback Procedure
If something goes wrong:
1. Step 1
2. Step 2

## Related Runbooks
- Link to related runbook 1
- Link to related runbook 2

## Revision History
| Date | Author | Changes |
|------|--------|---------|
| 2026-01-15 | @username | Initial version |