CDE Operational Runbooks
Step-by-step procedures for common operations, incident response, and maintenance tasks in your Cloud Development Environment.
Daily Operations
Routine procedures for day-to-day CDE management
Morning Health Check Runbook
Check Control Plane Status
# Coder
coder server health
# Kubernetes-based
kubectl get pods -n coder-system
kubectl get pods -n gitpod
# Check service endpoints
curl -s https://cde.company.com/api/v2/health | jq
Review Overnight Alerts
# Check alertmanager for firing alerts
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'
# Review PagerDuty/Opsgenie incidents
pd incident:list --status=triggered,acknowledged
# Check Slack #cde-alerts channel
# Review email inbox for overnight alerts
Verify Workspace Metrics
# Check active workspace count
curl -s https://cde.company.com/api/v2/workspaces | jq '[.[] | select(.latest_build.status=="running")] | length'
# Review resource utilization
kubectl top nodes
kubectl top pods -n workspaces --sort-by=memory
# Check for stuck workspaces
coder workspaces list --status=starting | grep -E "^[0-9]+ hours?"
Check Certificate Expiration
# Check TLS certificate expiry
echo | openssl s_client -servername cde.company.com -connect cde.company.com:443 2>/dev/null | openssl x509 -noout -dates
# Check all ingress certificates
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter
Document Status in Daily Log
Update the team's daily operations log with:
- All systems operational / Any issues found
- Number of active workspaces
- Resource utilization summary
- Any pending maintenance or upgrades
New User Onboarding Runbook
Verify SSO/IdP Provisioning
# Confirm user exists in IdP group
# Azure AD
az ad group member check --group "CDE-Users" --member-id USER_OBJECT_ID
# Okta - Check group membership via admin console
# Or verify SCIM provisioning logs
Assign Appropriate Role/Template Access
# Coder - Assign to organization and template
coder users create --email [email protected] --username jsmith
coder organizations members add engineering jsmith --role member
# Grant template access
coder templates edit python-dev --group engineering
Send Welcome Resources
Provide the new user with:
- CDE access URL: https://cde.company.com
- Getting started documentation
- Onboarding video/walkthrough
- Support Slack channel: #cde-help
Verify First Workspace Creation
# Monitor first workspace creation
coder workspaces list --owner jsmith
# Check for any provisioning errors
coder workspaces show jsmith/my-workspace --json | jq '.latest_build.job.error'
Incident Response Runbooks
Emergency procedures for common CDE incidents
Control Plane Unavailable
SEV-1Symptoms
- CDE dashboard inaccessible
- New workspaces cannot be created
- Existing workspaces may still function but cannot be managed
Page On-Call and Create Incident
# PagerDuty
pd incident:create --title "CDE Control Plane Down" --service CDE-CRITICAL --urgency high
# Slack notification
/incident new "CDE Control Plane Unavailable" severity:sev1
Check Pod/Service Status
# Get pod status
kubectl get pods -n coder-system -o wide
kubectl get pods -n coder-system | grep -v Running
# Check recent events
kubectl get events -n coder-system --sort-by='.lastTimestamp' | tail -20
# Check service endpoints
kubectl get endpoints -n coder-system
Check Database Connectivity
# Test PostgreSQL connection
kubectl exec -it deploy/coder -n coder-system -- pg_isready -h $CODER_PG_CONNECTION_URL
# Check database pod (if self-hosted)
kubectl get pods -n database -l app=postgresql
# Check RDS status (if AWS)
aws rds describe-db-instances --db-instance-identifier coder-db --query 'DBInstances[0].DBInstanceStatus'
Attempt Pod Restart
# Rolling restart of control plane
kubectl rollout restart deployment/coder -n coder-system
# Monitor rollout
kubectl rollout status deployment/coder -n coder-system --timeout=300s
# If stuck, force delete problematic pods
kubectl delete pod POD_NAME -n coder-system --grace-period=0 --force
Verify Recovery and Update Status
# Health check
curl -s https://cde.company.com/api/v2/health | jq
# Test workspace creation
coder create test-recovery --template minimal --yes
coder delete test-recovery --yes
# Update incident status
/incident update "Control plane recovered. Monitoring for stability."
Workspace Stuck in Starting/Stopping
SEV-3Identify the Stuck Workspace
# List workspaces by status
coder workspaces list --status=starting
coder workspaces list --status=stopping
# Get detailed info
coder workspaces show USERNAME/WORKSPACE_NAME --json | jq '.latest_build'
Check Provisioner Logs
# Get build logs
coder workspaces logs USERNAME/WORKSPACE_NAME
# Check provisioner pods
kubectl logs -n coder-system -l app=coder-provisioner --tail=100
# Look for Terraform errors
kubectl logs -n coder-system -l app=coder-provisioner | grep -i error
Force Cancel and Retry
# Cancel the stuck build
coder workspaces cancel USERNAME/WORKSPACE_NAME
# For stuck in stopping - force stop the underlying resources
# Kubernetes
kubectl delete pod -n workspaces -l workspace=WORKSPACE_ID --grace-period=0 --force
# Retry the operation
coder workspaces start USERNAME/WORKSPACE_NAME
# or
coder workspaces stop USERNAME/WORKSPACE_NAME
Storage Capacity Critical
SEV-2Identify Storage Usage
# Check PVC usage
kubectl get pvc -n workspaces -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.resources.requests.storage,USED:.status.capacity.storage
# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage
# Find largest workspace volumes
kubectl exec -it STORAGE_POD -- df -h
kubectl exec -it STORAGE_POD -- du -sh /data/* | sort -rh | head -20
Clean Up Orphaned Resources
# Find orphaned PVCs
kubectl get pvc -n workspaces --no-headers | while read pvc rest; do
if ! coder workspaces list --all | grep -q "$pvc"; then
echo "Orphaned: $pvc"
fi
done
# Clean docker images on nodes
kubectl get nodes -o name | xargs -I {} kubectl debug {} -- docker system prune -af
# Clean old workspace builds
coder workspaces list --all | awk '$5 > 30 {print $1}' | xargs -I {} coder delete {} --yes
Expand Storage (if needed)
# Expand PVC (if storage class supports it)
kubectl patch pvc WORKSPACE_PVC -n workspaces -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
# Add new storage nodes
# AWS EKS
eksctl scale nodegroup --cluster=cde-cluster --name=storage-nodes --nodes=5
# Update storage quotas in templates
coder templates push TEMPLATE --var disk_size=100
Maintenance Runbooks
Scheduled maintenance and upgrade procedures
Platform Upgrade
Upgrade Coder/Gitpod version
Template Update
Update workspace templates
# Push template update
coder templates push python-dev \
--directory ./templates/python-dev \
--message "Updated Python to 3.12"
# Check active versions
coder templates versions python-dev
Certificate Renewal
TLS certificate management
Database Maintenance
PostgreSQL optimization
Disaster Recovery Runbooks
Procedures for major outages and data recovery
Full Platform Recovery Runbook
When to Use
- Complete control plane failure
- Database corruption or loss
- Cluster-wide infrastructure failure
- Ransomware or security incident requiring full rebuild
Assess Damage and Declare DR
# Confirm primary region is unrecoverable
aws ec2 describe-availability-zones --region us-east-1
# Declare disaster recovery
/incident update "DR declared. Initiating failover to us-west-2"
# Notify stakeholders
# - Executive team
# - All engineering teams
# - Security team
Restore Database from Backup
# List available backups
aws rds describe-db-snapshots --db-instance-identifier coder-db
# Restore to DR region
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier coder-db-dr \
--db-snapshot-identifier rds:coder-db-2024-01-15-00-05 \
--availability-zone us-west-2a
# Wait for restoration
aws rds wait db-instance-available --db-instance-identifier coder-db-dr
Deploy Control Plane to DR Region
# Switch kubectl context to DR cluster
kubectl config use-context dr-cluster-us-west-2
# Deploy Coder with DR database connection
helm upgrade --install coder coder/coder \
--namespace coder-system \
--values dr-values.yaml \
--set postgres.host=coder-db-dr.xxxxx.us-west-2.rds.amazonaws.com
# Verify deployment
kubectl rollout status deployment/coder -n coder-system
Update DNS and Load Balancers
# Update Route53 to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXXXXXXXXXXXX \
--change-batch file://dr-dns-change.json
# Verify DNS propagation
dig cde.company.com +short
# Update any hardcoded references in IdP
Validate Recovery
# Health check
curl -s https://cde.company.com/api/v2/health | jq
# Test SSO login
# Manually test in browser
# Create test workspace
coder create dr-test --template minimal --yes
# Verify existing user data
coder users list | head -10
# Announce recovery
/incident update "DR complete. CDE operational in us-west-2. Please recreate workspaces."
Scaling Operations
Procedures for scaling CDE infrastructure up or down
Emergency Scale-Up
Use when workspace creation is slow or failing due to resource constraints.
# Check current capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory
# Scale node group (AWS EKS)
eksctl scale nodegroup \
--cluster=cde-cluster \
--name=workspace-nodes \
--nodes=10 \
--nodes-max=20
# Or trigger cluster autoscaler
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=1
# Verify new nodes
kubectl get nodes -w
Caution
Emergency scaling may incur significant costs. Notify finance team if scaling beyond normal limits.
Scheduled Scale-Down
Use during off-hours or weekends to reduce costs.
# Stop idle workspaces older than 4 hours
coder workspaces list --status=running | \
awk '$6 > 4 {print $1}' | \
xargs -I {} coder workspaces stop {}
# Cordon nodes to prevent new scheduling
kubectl cordon NODE_NAME
# Drain workloads (with grace period)
kubectl drain NODE_NAME \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300
# Scale down node group
eksctl scale nodegroup \
--cluster=cde-cluster \
--name=workspace-nodes \
--nodes=3
Automation
Consider using Kubernetes CronJobs or AWS Scheduled Scaling for automated off-hours scaling.
Runbook Template
Use this template to create new runbooks for your team
# Runbook: [TITLE]
## Overview
- **Purpose**: [What does this runbook accomplish?]
- **Audience**: [Who should use this runbook?]
- **Estimated Time**: [How long does this take?]
- **Severity/Priority**: [SEV-1/2/3/4]
## Prerequisites
- [ ] Access to X system
- [ ] Knowledge of Y
- [ ] Required tools: Z
## Symptoms / When to Use
- Symptom 1
- Symptom 2
## Steps
### Step 1: [Action Name]
**Description**: [What are we doing and why?]
```bash
# Commands to run
command here
```
**Expected Output**: [What should we see?]
**If Failed**: [What to do if this step fails]
### Step 2: [Action Name]
...
## Verification
- [ ] Check 1
- [ ] Check 2
## Rollback Procedure
If something goes wrong:
1. Step 1
2. Step 2
## Related Runbooks
- Link to related runbook 1
- Link to related runbook 2
## Revision History
| Date | Author | Changes |
|------|--------|---------|
| 2024-01-15 | @username | Initial version |