Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide
Implementation
Architecture Patterns DevContainers Language Quickstarts IDE Integration AI/ML Workloads Advanced DevContainers
Operations
Performance Optimization High Availability & DR Monitoring Capacity Planning Troubleshooting Runbooks
Security
Security Deep Dive Secrets Management Vulnerability Management Network Security IAM Guide Compliance Guide
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis Vendor Evaluation Training Resources Team Structure Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

Architecture & Infrastructure Design

Reference architectures, cloud deployments, network design, high availability, and best practices for production-ready CDE infrastructure

Reference Architectures

Common deployment patterns for Cloud Development Environments across different infrastructure types

Self-Hosted on Kubernetes

The most popular architecture for enterprise deployments. Provides maximum control, scalability, and integration with existing Kubernetes infrastructure.

Architecture Flow:
1
Developer Access Layer
Load Balancer -> Ingress Controller -> CDE Control Plane (authentication, workspace management)
2
Control Plane Components
PostgreSQL (metadata), Provisioner (Terraform executor), API Server, Web UI
3
Workspace Layer
Kubernetes namespace per workspace -> Pods with persistent volumes -> Private registry for custom images
4
External Integrations
Git repositories, Container registries, Cloud resources (databases, S3), Monitoring systems
Advantages
  • Horizontal scaling with pod autoscaling
  • Native multi-tenancy with namespaces
  • Resource quotas and limits enforcement
  • Built-in health checks and self-healing
Requirements
  • Kubernetes 1.24+ cluster
  • Storage class with dynamic provisioning
  • Ingress controller (NGINX, Traefik)
  • PostgreSQL 13+ database

Self-Hosted on Docker

Simpler deployment for smaller teams or single-host setups. Lower overhead than Kubernetes but with reduced scalability.

Architecture Flow:
1
Reverse Proxy
NGINX or Caddy -> Routes to CDE control plane container
2
CDE Server Container
Single container with API, provisioner, and web UI -> PostgreSQL container for state
3
Workspace Containers
Docker-in-Docker for workspace isolation -> Named volumes for persistence
Best For
  • Small teams (under 50 developers)
  • Quick proof-of-concept deployments
  • Development/testing environments
Limitations
  • Single host - no horizontal scaling
  • Less robust resource isolation
  • Manual failover required

Managed SaaS Architecture

Fully managed service (e.g., GitHub Codespaces, GitPod) - zero infrastructure management required.

Architecture Flow:
1
Developer Authentication
OAuth/SSO -> Vendor-managed control plane (GitHub, GitPod, etc.)
2
Workspace Provisioning
Vendor's global infrastructure -> Auto-scaling compute pools -> CDN-backed file systems
3
Access Methods
Web IDE (browser), VS Code desktop, SSH tunneling
Pros
  • Zero infrastructure ops
  • Global edge network
  • Automatic updates
Cons
  • Data leaves your network
  • Less customization
  • Vendor lock-in
Use Case
  • Open-source projects
  • Rapid onboarding
  • No compliance needs

Hybrid Architecture

Combines self-hosted control plane with cloud-based workspaces, or uses SaaS for non-sensitive projects alongside self-hosted for regulated workloads.

Common Patterns:
Self-Hosted Control + Cloud Workspaces
CDE control plane in your VPC, workspaces auto-scale in multiple clouds based on region/cost
SaaS for Public + Self-Hosted for Secrets
GitHub Codespaces for open-source, Coder self-hosted for HIPAA/SOC2 workloads

Cloud Provider Deployments

Architecture considerations for major cloud platforms

Amazon Web Services (AWS)

Three main deployment options for AWS

Amazon EKS
Managed Kubernetes - Production recommended
  • Managed control plane
  • Fargate or EC2 node groups
  • Native VPC networking
  • EBS for persistent volumes
  • Higher cost than ECS
Amazon ECS
Container orchestration - Cost-effective
  • Simpler than Kubernetes
  • Fargate serverless option
  • Lower operational overhead
  • EFS for shared storage
  • Less ecosystem tooling
Amazon EC2
Direct VMs - Maximum control
  • Full OS customization
  • GPU instance types available
  • Best for Windows workloads
  • Spot instances for cost savings
  • Manual scaling management
AWS Best Practice
Use EKS for teams over 100 developers, ECS for smaller teams or cost-focused deployments, EC2 direct for specialized workloads (AI/ML with GPUs, Windows development).

Microsoft Azure

Three deployment strategies for Azure

Azure AKS
Kubernetes Service - Enterprise standard
  • Azure AD integration
  • Virtual nodes (serverless)
  • Azure Files/Disks storage
  • Policy-based governance
Container Apps
Serverless containers - Simplified ops
  • Scale-to-zero capability
  • Built on Kubernetes (KEDA)
  • Simple HTTP ingress
  • Limited customization
Azure VMs
Virtual machines - Legacy workloads
  • Windows Server support
  • Visual Studio licensing
  • Hybrid cloud connectivity
  • Higher cost per workspace

Google Cloud Platform (GCP)

Three GCP deployment models

Google GKE
Kubernetes Engine - Most mature K8s
  • Autopilot mode (serverless)
  • Fastest Kubernetes updates
  • Native Google services
  • Persistent Disk CSI driver
Cloud Run
Serverless containers - Pay-per-use
  • Instant scale from zero
  • Per-request billing
  • Integrated Cloud Build
  • 15-minute timeout limit
Compute Engine
Virtual machines - Custom infrastructure
  • Custom machine types
  • Preemptible instances
  • Persistent Disk snapshots
  • Manual orchestration

On-Premises Deployment

Self-hosted infrastructure options

Kubernetes (Rancher, OpenShift)
Most flexible, requires expertise
  • Full control over data
  • Air-gapped support
VMware vSphere
Enterprise virtualization
  • Existing VM infrastructure
  • vMotion for HA
Bare Metal
Maximum performance
  • No hypervisor overhead
  • GPU workloads

Network Architecture

Secure, scalable network design for Cloud Development Environments

VPC Design & Isolation

Recommended VPC Layout
1
Public Subnet
Load balancer, NAT gateway
2
Private Subnet (App)
CDE control plane, ingress controller
3
Private Subnet (Workspaces)
Developer workspace pods/containers
4
Private Subnet (Data)
RDS, PostgreSQL, Redis cache
Isolation Best Practices
  • Separate VPC for CDE (do not mix with production apps)
  • Use VPC peering or Transit Gateway for multi-VPC access
  • Enable VPC Flow Logs for audit trails
  • Implement network policies in Kubernetes
  • Use private DNS zones for internal service discovery

Security Groups & Firewall Rules

Layer Inbound Outbound Purpose
Load Balancer 443 (HTTPS) from 0.0.0.0/0 To control plane on 443 Public HTTPS access
Control Plane 443 from LB, 5432 from workspaces To DB, to K8s API API & orchestration
Workspaces SSH (22) from control plane only Internet via NAT (egress control) Developer access
Database 5432 from control plane only None Metadata storage

Private Endpoints & Service Connect

Access cloud services without public internet exposure using private connectivity.

AWS PrivateLink
  • S3 via VPC endpoint
  • ECR (Docker registry)
  • Secrets Manager
  • Systems Manager
Azure Private Link
  • Blob Storage endpoint
  • Azure Container Registry
  • Key Vault
  • SQL Database
GCP Private Service Connect
  • Cloud Storage private access
  • Artifact Registry
  • Secret Manager
  • Cloud SQL

Egress Controls & Proxy Configuration

Data Exfiltration Prevention
  • Block public package registries (npm, PyPI, Maven Central) - use private mirrors
  • Allowlist git domains - only your GitHub/GitLab enterprise
  • DLP scanning on egress - detect secrets, PII in outbound traffic
  • Disable direct SSH/RDP out - require bastion host
HTTP(S) Proxy Setup
# Workspace environment variables
export HTTP_PROXY=http://proxy.corp:3128
export HTTPS_PROXY=http://proxy.corp:3128
export NO_PROXY=localhost,127.0.0.1,.internal

# For Docker-in-Docker workspaces
{
  "proxies": {
    "default": {
      "httpProxy": "http://proxy.corp:3128",
      "httpsProxy": "http://proxy.corp:3128"
    }
  }
}
Proxy logs all outbound requests for audit compliance

Zero-Trust Network Patterns

Identity-Based Access
  • Mutual TLS (mTLS) between services
  • Service mesh (Istio, Linkerd) for policy enforcement
  • SPIFFE/SPIRE for workload identity
  • No trust based on network location
Micro-Segmentation
  • Kubernetes NetworkPolicies per namespace
  • Calico for advanced policy rules
  • Default deny - explicitly allow required traffic
  • Separate dev/staging/prod workspaces by namespace

High Availability & Disaster Recovery

Design patterns for resilient CDE deployments

Multi-Region Deployment

Active-Active Pattern
Workspaces distributed across regions, global load balancer routes to nearest healthy region
  • No downtime during regional failure
  • Reduced latency for global teams
  • Complex: Requires database replication
Active-Passive Pattern
Primary region serves all traffic, secondary region on standby with replicated data
  • Simpler to implement
  • Lower cost (standby can be smaller)
  • RTO: 5-15 minutes for failover

Failover Strategies

Database Failover
  • - PostgreSQL streaming replication
  • - Read replicas in multiple AZs
  • - Automatic failover with Patroni/Stolon
  • - RDS Multi-AZ for managed option
Control Plane Failover
  • - Kubernetes Deployment with 3+ replicas
  • - Anti-affinity rules (spread across AZs)
  • - Health checks with auto-restart
  • - Stateless design (state in DB only)
Workspace Recovery
  • - Persistent volumes with multi-AZ replication
  • - Workspace state stored in external Git
  • - Automatic re-provisioning on node failure
  • - User data loss: none (Git + PV backups)

Backup & Recovery

What to Back Up
  • PostgreSQL database - all CDE metadata (users, workspaces, templates)
  • Persistent volumes - workspace home directories (/home/coder)
  • Terraform templates - version controlled in Git
  • Configuration files - Kubernetes manifests, Helm values
Backup Tools
Database
pg_dump, Velero, AWS Backup
Volumes
Velero, Restic, cloud snapshots

RPO/RTO Considerations

Tier RPO RTO Cost
Critical
Production workspaces
Under 1 hour Under 15 min High
Standard
Most development
4-8 hours 1-2 hours Medium
Low-Priority
Testing/experiments
24 hours 4+ hours Low
RPO = Recovery Point Objective (max data loss), RTO = Recovery Time Objective (max downtime)

Storage Architecture

Persistent storage patterns for workspaces and shared data

Workspace Persistent Storage

Per-Workspace Volumes
Each workspace gets its own persistent volume for /home/coder directory
AWS
EBS gp3 volumes
Azure
Azure Managed Disks
GCP
Persistent Disk SSD
Size Recommendations
  • Web development (Node.js, Python): 20-50 GB
  • Backend (Java, .NET, Go): 50-100 GB
  • Data science (ML models): 100-500 GB
  • AI/ML training (large datasets): 500 GB - 2 TB

Shared File Systems

Team Collaboration Storage
Shared volumes for team datasets, models, build caches
AWS EFS
NFS protocol, scales automatically, multi-AZ
Azure Files
SMB/NFS, Premium tier for IOPS-heavy workloads
GCP Filestore
Managed NFS, High Scale tier for large teams
Use Cases
  • Shared ML training datasets (read-only mount)
  • npm/Maven cache (speeds up builds)
  • Docker layer cache (BuildKit)

Database Access Patterns

Development Databases
Developers need ephemeral databases for testing. Two approaches:
Sidecar Container
PostgreSQL/MySQL container in same pod, auto-destroyed with workspace
Managed DB Pool
RDS/CloudSQL instances provisioned on-demand, returned to pool when done
Production DB Access
  • Never allow direct production DB writes from workspaces
  • Read-only replicas for debugging (with auditing)
  • Use cloud IAM roles (no hardcoded passwords)

Cache Layers

Build Cache
  • Docker BuildKit Cache
    Store build layers in shared registry (ECR, ACR, GCR) with --cache-from flag
  • Package Manager Cache
    npm/yarn/pip cache on EFS/Azure Files - shared across workspaces (saves bandwidth)
Application Cache
Redis Cluster
Session cache, rate limiting
Memcached
Object cache, query results

Scaling Patterns

Auto-scaling strategies for workspaces and infrastructure

Horizontal Pod Autoscaling (HPA)

Automatically scale the number of CDE control plane pods based on CPU/memory utilization or custom metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  • Scale up during peak hours (9am-5pm)
  • Scale down at night/weekends to save cost
  • Use custom metrics (workspace provision rate)

Cluster Node Autoscaling

Automatically add or remove Kubernetes worker nodes when workspaces need more capacity.
Cloud-Specific Solutions
AWS
EKS Cluster Autoscaler or Karpenter
Azure
AKS Virtual Nodes + autoscaler
GCP
GKE Autopilot (serverless nodes)
Node Pool Strategy
  • General pool: Standard workspaces (4 CPU, 8GB RAM)
  • GPU pool: ML workspaces (NVIDIA T4/A100)
  • Spot/preemptible pool: Cost-sensitive workloads

Resource Quotas & Limits

Prevent resource hogging and ensure fair sharing across teams using Kubernetes resource quotas.
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    limits.cpu: "100"
    limits.memory: 200Gi
    persistentvolumeclaims: "20"
    pods: "50"
This quota allows team-backend namespace to use up to 50 CPUs and 100GB RAM across all workspaces.

Multi-Tenant Considerations

Namespace Isolation
  • One namespace per team or project
  • Network policies prevent cross-namespace traffic
  • RBAC limits what users can see/modify
Cost Allocation
  • Label all resources with team/cost-center tags
  • Use Kubecost or cloud billing for chargeback
  • Alert when team exceeds budget threshold

Monitoring & Observability

Essential metrics, logging, and alerting for production CDE deployments

Key Metrics to Collect

Workspace Metrics
  • - Active workspaces count
  • - Workspace provision time (p50, p95, p99)
  • - Workspace uptime/idle time
  • - Failed workspace starts (errors)
  • - Workspace CPU/memory utilization
Infrastructure Metrics
  • - Kubernetes node CPU/memory
  • - Persistent volume usage
  • - Network throughput (egress/ingress)
  • - Database connection pool size
  • - API request latency
User Experience Metrics
  • - Login success rate
  • - Time to first workspace (TTFWS)
  • - VS Code/IDE connection failures
  • - User session duration

Logging Architecture

Log Aggregation Stack
ELK Stack
Elasticsearch + Logstash + Kibana (classic, resource-heavy)
Loki + Grafana
Lightweight, Prometheus-like for logs (recommended)
Cloud Native
CloudWatch Logs, Azure Monitor, GCP Logging
What to Log
  • CDE control plane logs (API requests, provisioner actions)
  • Kubernetes events (pod starts, failures, OOMKills)
  • Workspace startup logs (Terraform apply output)
  • Authentication logs (SSO, failed logins)
  • Do NOT log user code execution (privacy risk)

Alerting Strategies

Critical Alerts (Page On-Call)
  • CDE control plane down (all workspaces inaccessible)
  • Database connection failures (metadata loss risk)
  • Workspace provision failure rate over 20%
  • Cluster autoscaler unable to add nodes
Warning Alerts (Slack/Email)
  • High memory usage on nodes (over 85%)
  • Persistent volume usage over 80%
  • Workspace provision time exceeds 5 minutes
  • SSL certificate expiring in under 7 days

Dashboard Recommendations

Platform Health Dashboard
- Control plane uptime SLA
- Active workspace count
- API request rate
- Database query latency
- Kubernetes cluster health
- Network egress costs
Developer Experience Dashboard
- Avg time to provision workspace
- Workspace start success rate
- IDE connection failures
- User login errors
- Top 10 slowest workspaces
- Resource quota violations
Use Grafana with Prometheus for metrics, Loki for logs, and Jaeger for distributed tracing