Architecture & Infrastructure Design
Reference architectures, cloud deployments, network design, high availability, and best practices for production-ready CDE infrastructure
Reference Architectures
Common deployment patterns for Cloud Development Environments across different infrastructure types
Self-Hosted on Kubernetes
The most popular architecture for enterprise deployments. Provides maximum control, scalability, and integration with existing Kubernetes infrastructure.
- Horizontal scaling with pod autoscaling
- Native multi-tenancy with namespaces
- Resource quotas and limits enforcement
- Built-in health checks and self-healing
- Kubernetes 1.24+ cluster
- Storage class with dynamic provisioning
- Ingress controller (NGINX, Traefik)
- PostgreSQL 13+ database
Self-Hosted on Docker
Simpler deployment for smaller teams or single-host setups. Lower overhead than Kubernetes but with reduced scalability.
- Small teams (under 50 developers)
- Quick proof-of-concept deployments
- Development/testing environments
- Single host - no horizontal scaling
- Less robust resource isolation
- Manual failover required
Managed SaaS Architecture
Fully managed service (e.g., GitHub Codespaces, GitPod) - zero infrastructure management required.
- Zero infrastructure ops
- Global edge network
- Automatic updates
- Data leaves your network
- Less customization
- Vendor lock-in
- Open-source projects
- Rapid onboarding
- No compliance needs
Hybrid Architecture
Combines self-hosted control plane with cloud-based workspaces, or uses SaaS for non-sensitive projects alongside self-hosted for regulated workloads.
Cloud Provider Deployments
Architecture considerations for major cloud platforms
Amazon Web Services (AWS)
Three main deployment options for AWS
- Managed control plane
- Fargate or EC2 node groups
- Native VPC networking
- EBS for persistent volumes
- Higher cost than ECS
- Simpler than Kubernetes
- Fargate serverless option
- Lower operational overhead
- EFS for shared storage
- Less ecosystem tooling
- Full OS customization
- GPU instance types available
- Best for Windows workloads
- Spot instances for cost savings
- Manual scaling management
Microsoft Azure
Three deployment strategies for Azure
- Azure AD integration
- Virtual nodes (serverless)
- Azure Files/Disks storage
- Policy-based governance
- Scale-to-zero capability
- Built on Kubernetes (KEDA)
- Simple HTTP ingress
- Limited customization
- Windows Server support
- Visual Studio licensing
- Hybrid cloud connectivity
- Higher cost per workspace
Google Cloud Platform (GCP)
Three GCP deployment models
- Autopilot mode (serverless)
- Fastest Kubernetes updates
- Native Google services
- Persistent Disk CSI driver
- Instant scale from zero
- Per-request billing
- Integrated Cloud Build
- 15-minute timeout limit
- Custom machine types
- Preemptible instances
- Persistent Disk snapshots
- Manual orchestration
On-Premises Deployment
Self-hosted infrastructure options
- Full control over data
- Air-gapped support
- Existing VM infrastructure
- vMotion for HA
- No hypervisor overhead
- GPU workloads
Network Architecture
Secure, scalable network design for Cloud Development Environments
VPC Design & Isolation
- Separate VPC for CDE (do not mix with production apps)
- Use VPC peering or Transit Gateway for multi-VPC access
- Enable VPC Flow Logs for audit trails
- Implement network policies in Kubernetes
- Use private DNS zones for internal service discovery
Security Groups & Firewall Rules
| Layer | Inbound | Outbound | Purpose |
|---|---|---|---|
| Load Balancer | 443 (HTTPS) from 0.0.0.0/0 | To control plane on 443 | Public HTTPS access |
| Control Plane | 443 from LB, 5432 from workspaces | To DB, to K8s API | API & orchestration |
| Workspaces | SSH (22) from control plane only | Internet via NAT (egress control) | Developer access |
| Database | 5432 from control plane only | None | Metadata storage |
Private Endpoints & Service Connect
Access cloud services without public internet exposure using private connectivity.
- S3 via VPC endpoint
- ECR (Docker registry)
- Secrets Manager
- Systems Manager
- Blob Storage endpoint
- Azure Container Registry
- Key Vault
- SQL Database
- Cloud Storage private access
- Artifact Registry
- Secret Manager
- Cloud SQL
Egress Controls & Proxy Configuration
- Block public package registries (npm, PyPI, Maven Central) - use private mirrors
- Allowlist git domains - only your GitHub/GitLab enterprise
- DLP scanning on egress - detect secrets, PII in outbound traffic
- Disable direct SSH/RDP out - require bastion host
# Workspace environment variables
export HTTP_PROXY=http://proxy.corp:3128
export HTTPS_PROXY=http://proxy.corp:3128
export NO_PROXY=localhost,127.0.0.1,.internal
# For Docker-in-Docker workspaces
{
"proxies": {
"default": {
"httpProxy": "http://proxy.corp:3128",
"httpsProxy": "http://proxy.corp:3128"
}
}
}
Zero-Trust Network Patterns
- Mutual TLS (mTLS) between services
- Service mesh (Istio, Linkerd) for policy enforcement
- SPIFFE/SPIRE for workload identity
- No trust based on network location
- Kubernetes NetworkPolicies per namespace
- Calico for advanced policy rules
- Default deny - explicitly allow required traffic
- Separate dev/staging/prod workspaces by namespace
High Availability & Disaster Recovery
Design patterns for resilient CDE deployments
Multi-Region Deployment
- No downtime during regional failure
- Reduced latency for global teams
- Complex: Requires database replication
- Simpler to implement
- Lower cost (standby can be smaller)
- RTO: 5-15 minutes for failover
Failover Strategies
- - PostgreSQL streaming replication
- - Read replicas in multiple AZs
- - Automatic failover with Patroni/Stolon
- - RDS Multi-AZ for managed option
- - Kubernetes Deployment with 3+ replicas
- - Anti-affinity rules (spread across AZs)
- - Health checks with auto-restart
- - Stateless design (state in DB only)
- - Persistent volumes with multi-AZ replication
- - Workspace state stored in external Git
- - Automatic re-provisioning on node failure
- - User data loss: none (Git + PV backups)
Backup & Recovery
- PostgreSQL database - all CDE metadata (users, workspaces, templates)
- Persistent volumes - workspace home directories (/home/coder)
- Terraform templates - version controlled in Git
- Configuration files - Kubernetes manifests, Helm values
RPO/RTO Considerations
| Tier | RPO | RTO | Cost |
|---|---|---|---|
|
Critical
Production workspaces
|
Under 1 hour | Under 15 min | High |
|
Standard
Most development
|
4-8 hours | 1-2 hours | Medium |
|
Low-Priority
Testing/experiments
|
24 hours | 4+ hours | Low |
Storage Architecture
Persistent storage patterns for workspaces and shared data
Workspace Persistent Storage
- Web development (Node.js, Python): 20-50 GB
- Backend (Java, .NET, Go): 50-100 GB
- Data science (ML models): 100-500 GB
- AI/ML training (large datasets): 500 GB - 2 TB
Shared File Systems
- Shared ML training datasets (read-only mount)
- npm/Maven cache (speeds up builds)
- Docker layer cache (BuildKit)
Database Access Patterns
- Never allow direct production DB writes from workspaces
- Read-only replicas for debugging (with auditing)
- Use cloud IAM roles (no hardcoded passwords)
Cache Layers
-
Docker BuildKit CacheStore build layers in shared registry (ECR, ACR, GCR) with --cache-from flag
-
Package Manager Cachenpm/yarn/pip cache on EFS/Azure Files - shared across workspaces (saves bandwidth)
Scaling Patterns
Auto-scaling strategies for workspaces and infrastructure
Horizontal Pod Autoscaling (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coder-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coder-server
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- Scale up during peak hours (9am-5pm)
- Scale down at night/weekends to save cost
- Use custom metrics (workspace provision rate)
Cluster Node Autoscaling
- General pool: Standard workspaces (4 CPU, 8GB RAM)
- GPU pool: ML workspaces (NVIDIA T4/A100)
- Spot/preemptible pool: Cost-sensitive workloads
Resource Quotas & Limits
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
limits.cpu: "100"
limits.memory: 200Gi
persistentvolumeclaims: "20"
pods: "50"
Multi-Tenant Considerations
- One namespace per team or project
- Network policies prevent cross-namespace traffic
- RBAC limits what users can see/modify
- Label all resources with team/cost-center tags
- Use Kubecost or cloud billing for chargeback
- Alert when team exceeds budget threshold
Monitoring & Observability
Essential metrics, logging, and alerting for production CDE deployments
Key Metrics to Collect
- - Active workspaces count
- - Workspace provision time (p50, p95, p99)
- - Workspace uptime/idle time
- - Failed workspace starts (errors)
- - Workspace CPU/memory utilization
- - Kubernetes node CPU/memory
- - Persistent volume usage
- - Network throughput (egress/ingress)
- - Database connection pool size
- - API request latency
- - Login success rate
- - Time to first workspace (TTFWS)
- - VS Code/IDE connection failures
- - User session duration
Logging Architecture
- CDE control plane logs (API requests, provisioner actions)
- Kubernetes events (pod starts, failures, OOMKills)
- Workspace startup logs (Terraform apply output)
- Authentication logs (SSO, failed logins)
- Do NOT log user code execution (privacy risk)
Alerting Strategies
- CDE control plane down (all workspaces inaccessible)
- Database connection failures (metadata loss risk)
- Workspace provision failure rate over 20%
- Cluster autoscaler unable to add nodes
- High memory usage on nodes (over 85%)
- Persistent volume usage over 80%
- Workspace provision time exceeds 5 minutes
- SSL certificate expiring in under 7 days