Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide
Implementation
Architecture Patterns DevContainers Language Quickstarts IDE Integration AI/ML Workloads Advanced DevContainers
Operations
Performance Optimization High Availability & DR Monitoring Capacity Planning Troubleshooting Runbooks
Security
Security Deep Dive Secrets Management Vulnerability Management Network Security IAM Guide Compliance Guide
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis Vendor Evaluation Training Resources Team Structure Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

CDE Operational Runbooks

Step-by-step procedures for common operations, incident response, and maintenance tasks in your Cloud Development Environment.

Daily Operations

Routine procedures for day-to-day CDE management

Morning Health Check Runbook

Est. Time: 10-15 minutes
1

Check Control Plane Status

# Coder
coder server health

# Kubernetes-based
kubectl get pods -n coder-system
kubectl get pods -n gitpod

# Check service endpoints
curl -s https://cde.company.com/api/v2/health | jq
2

Review Overnight Alerts

# Check alertmanager for firing alerts
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="active")'

# Review PagerDuty/Opsgenie incidents
pd incident:list --status=triggered,acknowledged

# Check Slack #cde-alerts channel
# Review email inbox for overnight alerts
3

Verify Workspace Metrics

# Check active workspace count
curl -s https://cde.company.com/api/v2/workspaces | jq '[.[] | select(.latest_build.status=="running")] | length'

# Review resource utilization
kubectl top nodes
kubectl top pods -n workspaces --sort-by=memory

# Check for stuck workspaces
coder workspaces list --status=starting | grep -E "^[0-9]+ hours?"
4

Check Certificate Expiration

# Check TLS certificate expiry
echo | openssl s_client -servername cde.company.com -connect cde.company.com:443 2>/dev/null | openssl x509 -noout -dates

# Check all ingress certificates
kubectl get certificates -A -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter
5

Document Status in Daily Log

Update the team's daily operations log with:

  • All systems operational / Any issues found
  • Number of active workspaces
  • Resource utilization summary
  • Any pending maintenance or upgrades

New User Onboarding Runbook

Est. Time: 5-10 minutes
1

Verify SSO/IdP Provisioning

# Confirm user exists in IdP group
# Azure AD
az ad group member check --group "CDE-Users" --member-id USER_OBJECT_ID

# Okta - Check group membership via admin console
# Or verify SCIM provisioning logs
2

Assign Appropriate Role/Template Access

# Coder - Assign to organization and template
coder users create --email [email protected] --username jsmith
coder organizations members add engineering jsmith --role member

# Grant template access
coder templates edit python-dev --group engineering
3

Send Welcome Resources

Provide the new user with:

  • CDE access URL: https://cde.company.com
  • Getting started documentation
  • Onboarding video/walkthrough
  • Support Slack channel: #cde-help
4

Verify First Workspace Creation

# Monitor first workspace creation
coder workspaces list --owner jsmith

# Check for any provisioning errors
coder workspaces show jsmith/my-workspace --json | jq '.latest_build.job.error'

Incident Response Runbooks

Emergency procedures for common CDE incidents

Control Plane Unavailable

SEV-1

Symptoms

  • CDE dashboard inaccessible
  • New workspaces cannot be created
  • Existing workspaces may still function but cannot be managed
1

Page On-Call and Create Incident

# PagerDuty
pd incident:create --title "CDE Control Plane Down" --service CDE-CRITICAL --urgency high

# Slack notification
/incident new "CDE Control Plane Unavailable" severity:sev1
2

Check Pod/Service Status

# Get pod status
kubectl get pods -n coder-system -o wide
kubectl get pods -n coder-system | grep -v Running

# Check recent events
kubectl get events -n coder-system --sort-by='.lastTimestamp' | tail -20

# Check service endpoints
kubectl get endpoints -n coder-system
3

Check Database Connectivity

# Test PostgreSQL connection
kubectl exec -it deploy/coder -n coder-system -- pg_isready -h $CODER_PG_CONNECTION_URL

# Check database pod (if self-hosted)
kubectl get pods -n database -l app=postgresql

# Check RDS status (if AWS)
aws rds describe-db-instances --db-instance-identifier coder-db --query 'DBInstances[0].DBInstanceStatus'
4

Attempt Pod Restart

# Rolling restart of control plane
kubectl rollout restart deployment/coder -n coder-system

# Monitor rollout
kubectl rollout status deployment/coder -n coder-system --timeout=300s

# If stuck, force delete problematic pods
kubectl delete pod POD_NAME -n coder-system --grace-period=0 --force
5

Verify Recovery and Update Status

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test workspace creation
coder create test-recovery --template minimal --yes
coder delete test-recovery --yes

# Update incident status
/incident update "Control plane recovered. Monitoring for stability."

Workspace Stuck in Starting/Stopping

SEV-3
1

Identify the Stuck Workspace

# List workspaces by status
coder workspaces list --status=starting
coder workspaces list --status=stopping

# Get detailed info
coder workspaces show USERNAME/WORKSPACE_NAME --json | jq '.latest_build'
2

Check Provisioner Logs

# Get build logs
coder workspaces logs USERNAME/WORKSPACE_NAME

# Check provisioner pods
kubectl logs -n coder-system -l app=coder-provisioner --tail=100

# Look for Terraform errors
kubectl logs -n coder-system -l app=coder-provisioner | grep -i error
3

Force Cancel and Retry

# Cancel the stuck build
coder workspaces cancel USERNAME/WORKSPACE_NAME

# For stuck in stopping - force stop the underlying resources
# Kubernetes
kubectl delete pod -n workspaces -l workspace=WORKSPACE_ID --grace-period=0 --force

# Retry the operation
coder workspaces start USERNAME/WORKSPACE_NAME
# or
coder workspaces stop USERNAME/WORKSPACE_NAME

Storage Capacity Critical

SEV-2
1

Identify Storage Usage

# Check PVC usage
kubectl get pvc -n workspaces -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.resources.requests.storage,USED:.status.capacity.storage

# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Find largest workspace volumes
kubectl exec -it STORAGE_POD -- df -h
kubectl exec -it STORAGE_POD -- du -sh /data/* | sort -rh | head -20
2

Clean Up Orphaned Resources

# Find orphaned PVCs
kubectl get pvc -n workspaces --no-headers | while read pvc rest; do
  if ! coder workspaces list --all | grep -q "$pvc"; then
    echo "Orphaned: $pvc"
  fi
done

# Clean docker images on nodes
kubectl get nodes -o name | xargs -I {} kubectl debug {} -- docker system prune -af

# Clean old workspace builds
coder workspaces list --all | awk '$5 > 30 {print $1}' | xargs -I {} coder delete {} --yes
3

Expand Storage (if needed)

# Expand PVC (if storage class supports it)
kubectl patch pvc WORKSPACE_PVC -n workspaces -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

# Add new storage nodes
# AWS EKS
eksctl scale nodegroup --cluster=cde-cluster --name=storage-nodes --nodes=5

# Update storage quotas in templates
coder templates push TEMPLATE --var disk_size=100

Maintenance Runbooks

Scheduled maintenance and upgrade procedures

Platform Upgrade

Upgrade Coder/Gitpod version

1 Review release notes and breaking changes
2 Backup database and persistent volumes
3 Notify users of maintenance window
4 Stop all running workspaces
5 Apply helm/terraform upgrade
6 Run database migrations
7 Verify health and test workspace creation

Template Update

Update workspace templates

1 Update template code in Git repository
2 Test in staging environment
3 Push new template version
4 Notify users of available update
5 Monitor workspace rebuilds
# Push template update
coder templates push python-dev \
  --directory ./templates/python-dev \
  --message "Updated Python to 3.12"

# Check active versions
coder templates versions python-dev

Certificate Renewal

TLS certificate management

1 Check certificate expiration dates
2 Generate CSR or trigger Let's Encrypt
3 Update Kubernetes secrets
4 Restart ingress controller
5 Verify HTTPS connectivity

Database Maintenance

PostgreSQL optimization

1 Check database size and growth
2 Run VACUUM ANALYZE
3 Check and rebuild indexes
4 Archive old audit logs
5 Verify backup integrity

Disaster Recovery Runbooks

Procedures for major outages and data recovery

15 min
RTO (Recovery Time Objective)
1 hour
RPO (Recovery Point Objective)
99.9%
Availability Target

Full Platform Recovery Runbook

When to Use

  • Complete control plane failure
  • Database corruption or loss
  • Cluster-wide infrastructure failure
  • Ransomware or security incident requiring full rebuild
1

Assess Damage and Declare DR

# Confirm primary region is unrecoverable
aws ec2 describe-availability-zones --region us-east-1

# Declare disaster recovery
/incident update "DR declared. Initiating failover to us-west-2"

# Notify stakeholders
# - Executive team
# - All engineering teams
# - Security team
2

Restore Database from Backup

# List available backups
aws rds describe-db-snapshots --db-instance-identifier coder-db

# Restore to DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier coder-db-dr \
  --db-snapshot-identifier rds:coder-db-2024-01-15-00-05 \
  --availability-zone us-west-2a

# Wait for restoration
aws rds wait db-instance-available --db-instance-identifier coder-db-dr
3

Deploy Control Plane to DR Region

# Switch kubectl context to DR cluster
kubectl config use-context dr-cluster-us-west-2

# Deploy Coder with DR database connection
helm upgrade --install coder coder/coder \
  --namespace coder-system \
  --values dr-values.yaml \
  --set postgres.host=coder-db-dr.xxxxx.us-west-2.rds.amazonaws.com

# Verify deployment
kubectl rollout status deployment/coder -n coder-system
4

Update DNS and Load Balancers

# Update Route53 to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXXXXXXXXXX \
  --change-batch file://dr-dns-change.json

# Verify DNS propagation
dig cde.company.com +short

# Update any hardcoded references in IdP
5

Validate Recovery

# Health check
curl -s https://cde.company.com/api/v2/health | jq

# Test SSO login
# Manually test in browser

# Create test workspace
coder create dr-test --template minimal --yes

# Verify existing user data
coder users list | head -10

# Announce recovery
/incident update "DR complete. CDE operational in us-west-2. Please recreate workspaces."

Scaling Operations

Procedures for scaling CDE infrastructure up or down

Emergency Scale-Up

Use when workspace creation is slow or failing due to resource constraints.

# Check current capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory

# Scale node group (AWS EKS)
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=10 \
  --nodes-max=20

# Or trigger cluster autoscaler
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=1

# Verify new nodes
kubectl get nodes -w

Caution

Emergency scaling may incur significant costs. Notify finance team if scaling beyond normal limits.

Scheduled Scale-Down

Use during off-hours or weekends to reduce costs.

# Stop idle workspaces older than 4 hours
coder workspaces list --status=running | \
  awk '$6 > 4 {print $1}' | \
  xargs -I {} coder workspaces stop {}

# Cordon nodes to prevent new scheduling
kubectl cordon NODE_NAME

# Drain workloads (with grace period)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Scale down node group
eksctl scale nodegroup \
  --cluster=cde-cluster \
  --name=workspace-nodes \
  --nodes=3

Automation

Consider using Kubernetes CronJobs or AWS Scheduled Scaling for automated off-hours scaling.

Runbook Template

Use this template to create new runbooks for your team

# Runbook: [TITLE]

## Overview
- **Purpose**: [What does this runbook accomplish?]
- **Audience**: [Who should use this runbook?]
- **Estimated Time**: [How long does this take?]
- **Severity/Priority**: [SEV-1/2/3/4]

## Prerequisites
- [ ] Access to X system
- [ ] Knowledge of Y
- [ ] Required tools: Z

## Symptoms / When to Use
- Symptom 1
- Symptom 2

## Steps

### Step 1: [Action Name]
**Description**: [What are we doing and why?]

```bash
# Commands to run
command here
```

**Expected Output**: [What should we see?]

**If Failed**: [What to do if this step fails]

### Step 2: [Action Name]
...

## Verification
- [ ] Check 1
- [ ] Check 2

## Rollback Procedure
If something goes wrong:
1. Step 1
2. Step 2

## Related Runbooks
- Link to related runbook 1
- Link to related runbook 2

## Revision History
| Date | Author | Changes |
|------|--------|---------|
| 2024-01-15 | @username | Initial version |