What is the recommended CDE architecture for enterprises?

Enterprise CDE architecture typically includes: Kubernetes cluster for workspace orchestration, control plane for management, identity provider integration, persistent storage layer, container registry, and observability stack. Multi-region deployment for HA is recommended.

Should CDEs run on Kubernetes or VMs?

Kubernetes is preferred for most use cases due to better resource utilization, faster scaling, and container isolation. VMs are better for Windows development, GPU workloads, or when stronger isolation is required.

How do you scale CDE infrastructure?

CDE scaling involves: cluster autoscaling for compute, node pools for different workload types, persistent volume provisioning, and network capacity planning. Most platforms support automatic scaling based on workspace demand.

Architecture & Infrastructure Design

Reference architectures, cloud deployments, network design, high availability, AI agent workspaces, and best practices for production-ready CDE infrastructure

Reference Architectures

Common deployment patterns for Cloud Development Environments across different infrastructure types, including AI agent workspaces and microVM isolation

Self-Hosted on Kubernetes

The most popular architecture for enterprise deployments. Provides maximum control, scalability, and integration with existing Kubernetes infrastructure.

Architecture Flow:

Developer Access Layer

Load Balancer -> Ingress Controller -> CDE Control Plane (authentication, workspace management)

Control Plane Components

PostgreSQL (metadata), Provisioner (Terraform executor), API Server, Web UI

Workspace Layer

Kubernetes namespace per workspace -> Pods with persistent volumes (or microVM isolation via Firecracker/Kata) -> Private registry for custom images

External Integrations

Git repositories, Container registries, Cloud resources (databases, S3), Monitoring systems

Advantages

Horizontal scaling with pod autoscaling
Native multi-tenancy with namespaces
Resource quotas and limits enforcement
Built-in health checks and self-healing

Requirements

Kubernetes 1.29+ cluster
Storage class with dynamic provisioning
Ingress controller (NGINX, Traefik)
PostgreSQL 15+ database

Self-Hosted on Docker

Simpler deployment for smaller teams or single-host setups. Lower overhead than Kubernetes but with reduced scalability.

Architecture Flow:

Reverse Proxy

NGINX or Caddy -> Routes to CDE control plane container

CDE Server Container

Single container with API, provisioner, and web UI -> PostgreSQL container for state

Workspace Containers

Docker-in-Docker for workspace isolation -> Named volumes for persistence

Best For

Small teams (under 50 developers)
Quick proof-of-concept deployments
Development/testing environments

Limitations

Single host - no horizontal scaling
Less robust resource isolation
Manual failover required

Managed SaaS Architecture

Fully managed service (e.g., GitHub Codespaces, Ona (formerly Gitpod)) - zero infrastructure management required.

Architecture Flow:

Developer Authentication

OAuth/SSO -> Vendor-managed control plane (GitHub, Ona, etc.)

Workspace Provisioning

Vendor's global infrastructure -> Auto-scaling compute pools -> CDN-backed file systems

Access Methods

Web IDE (browser), VS Code desktop, SSH tunneling

Pros

Zero infrastructure ops
Global edge network
Automatic updates

Cons

Data leaves your network
Less customization
Vendor lock-in

Use Case

Open-source projects
Rapid onboarding
No compliance needs

Hybrid Architecture

Combines self-hosted control plane with cloud-based workspaces, or uses SaaS for non-sensitive projects alongside self-hosted for regulated workloads.

Common Patterns:

Self-Hosted Control + Cloud Workspaces

CDE control plane in your VPC, workspaces auto-scale in multiple clouds based on region/cost

SaaS for Public + Self-Hosted for Secrets

GitHub Codespaces for open-source, Coder self-hosted for HIPAA/SOC2 workloads

AI Agent Workspace Architecture

Purpose-built architecture for AI coding agents (Claude Code, Copilot Workspace, Devin, etc.) that provision, execute, and tear down workspaces autonomously without human interaction.

Architecture Flow:

Agent Orchestration Layer

API gateway -> Agent authentication (service tokens, OIDC) -> Rate limiting and concurrency controls

Ephemeral Workspace Provisioning

MicroVM pool (Firecracker/Cloud Hypervisor) -> Sub-second boot from snapshots -> Pre-warmed workspace images per repository

Sandboxed Execution

MicroVM isolation (no shared kernel) -> Restricted network egress -> Time-boxed sessions with automatic cleanup

Output and Audit

Code diffs written to Git branch -> Full session audit log -> Workspace destroyed after task completion

Security Requirements

MicroVM isolation - agents must not share a kernel with other workloads
Network egress allowlists - restrict to approved domains only
Time-limited sessions with hard shutdown after timeout
Read-only access to secrets via vault injection (no plaintext env vars)

Performance Requirements

Sub-second workspace boot (microVM snapshots or pre-warmed pools)
Burst scaling to hundreds of concurrent agent sessions
Shared read-only layer for dependencies (container image layers, package caches)
Instant teardown - no orphaned resources after agent completes

Agent vs Human Workspace Design

Agent workspaces are ephemeral (minutes, not days), headless (no IDE UI), and high-concurrency (dozens per developer). Design your control plane and node pools accordingly - agent workloads are bursty and benefit from pre-warmed microVM pools rather than cold-start Kubernetes pods. See our MicroVM Isolation guide for implementation details.

Multi-CDE Architecture

Organizations increasingly run multiple CDE platforms simultaneously - one for regulated workloads, another for open-source, and a third for AI agent sandboxing. A multi-CDE architecture provides a unified control layer across platforms.

Common Patterns:

Hub-and-Spoke

Central platform team manages a primary CDE (e.g., Coder) while individual teams use satellite CDEs (Codespaces, DevPod) with unified identity and audit logging

Segmented by Compliance

Self-hosted Coder for HIPAA/SOC2 workloads, GitHub Codespaces for open-source, ephemeral microVM-based CDEs for AI agents

Unified DevContainer Layer

Shared devcontainer.json specs work across Coder, Codespaces, DevPod, and Ona - developers switch platforms without changing their workspace config

For detailed platform selection criteria, governance patterns, and migration strategies, see our dedicated Multi-CDE Strategies guide.

Cloud Provider Deployments

Architecture considerations for major cloud platforms

Amazon Web Services (AWS)

Three main deployment options for AWS

Amazon EKS

Managed Kubernetes - Production recommended

Managed control plane
Fargate or EC2 node groups
Native VPC networking
EBS for persistent volumes
Higher cost than ECS

Amazon ECS

Container orchestration - Cost-effective

Simpler than Kubernetes
Fargate serverless option
Lower operational overhead
EFS for shared storage
Less ecosystem tooling

Amazon EC2

Direct VMs - Maximum control

Full OS customization
GPU instance types available
Best for Windows workloads
Spot instances for cost savings
Manual scaling management

AWS Best Practice

Use EKS for teams over 100 developers, ECS for smaller teams or cost-focused deployments, EC2 direct for specialized workloads (AI/ML with GPUs, Windows development). For AI agent workloads, consider Firecracker-based microVMs on EC2 bare metal for sub-second boot and strong isolation.

Microsoft Azure

Three deployment strategies for Azure

Azure AKS

Kubernetes Service - Enterprise standard

Microsoft Entra ID integration
Virtual nodes (serverless)
Azure Files/Disks storage
Policy-based governance

Container Apps

Serverless containers - Simplified ops

Scale-to-zero capability
Built on Kubernetes (KEDA)
Simple HTTP ingress
Limited customization

Azure VMs

Virtual machines - Legacy workloads

Windows Server support
Visual Studio licensing
Hybrid cloud connectivity
Higher cost per workspace

Google Cloud Platform (GCP)

Three GCP deployment models

Google GKE

Kubernetes Engine - Most mature K8s

Autopilot mode (serverless)
Fastest Kubernetes updates
Native Google services
Persistent Disk CSI driver

Cloud Run

Serverless containers - Pay-per-use

Instant scale from zero
Per-request billing
Integrated Cloud Build
60-minute request timeout limit

Compute Engine

Virtual machines - Custom infrastructure

Custom machine types
Spot VM instances (formerly Preemptible)
Persistent Disk snapshots
Manual orchestration

On-Premises Deployment

Self-hosted infrastructure options

Kubernetes (Rancher, OpenShift)

Most flexible, requires expertise

Full control over data
Air-gapped support

VMware vSphere

Enterprise virtualization

Existing VM infrastructure
vMotion for HA

Bare Metal

Maximum performance

No hypervisor overhead
GPU workloads

Network Architecture

Secure, scalable network design for Cloud Development Environments

VPC Design & Isolation

Recommended VPC Layout

Public Subnet

Load balancer, NAT gateway

Private Subnet (App)

CDE control plane, ingress controller

Private Subnet (Workspaces)

Developer workspace pods/containers

Private Subnet (Data)

RDS, PostgreSQL, Redis cache

Isolated Subnet (AI Agents)

MicroVM agent workspaces, restricted egress, no lateral movement

Isolation Best Practices

Separate VPC for CDE (do not mix with production apps)
Use VPC peering or Transit Gateway for multi-VPC access
Enable VPC Flow Logs for audit trails
Implement network policies in Kubernetes
Use private DNS zones for internal service discovery

Security Groups & Firewall Rules

Layer	Inbound	Outbound	Purpose
Load Balancer	443 (HTTPS) from 0.0.0.0/0	To control plane on 443	Public HTTPS access
Control Plane	443 from LB, 5432 from workspaces	To DB, to K8s API	API & orchestration
Workspaces	SSH (22) from control plane only	Internet via NAT (egress control)	Developer access
Database	5432 from control plane only	None	Metadata storage
AI Agent VMs	API from control plane only	Allowlisted domains via proxy (Git, registries)	Sandboxed agent execution

Private Endpoints & Service Connect

Access cloud services without public internet exposure using private connectivity.

AWS PrivateLink

S3 via VPC endpoint
ECR (Docker registry)
Secrets Manager
Systems Manager

Azure Private Link

Blob Storage endpoint
Azure Container Registry
Key Vault
SQL Database

GCP Private Service Connect

Cloud Storage private access
Artifact Registry
Secret Manager
Cloud SQL

Egress Controls & Proxy Configuration

Data Exfiltration Prevention

Block public package registries (npm, PyPI, Maven Central) - use private mirrors
Allowlist git domains - only your GitHub/GitLab enterprise
DLP scanning on egress - detect secrets, PII in outbound traffic
Disable direct SSH/RDP out - require bastion host
AI agent network isolation - agent workspaces get stricter egress rules than human workspaces (block LLM API callbacks, limit to build toolchains only)

HTTP(S) Proxy Setup

# Workspace environment variables
export HTTP_PROXY=http://proxy.corp:3128
export HTTPS_PROXY=http://proxy.corp:3128
export NO_PROXY=localhost,127.0.0.1,.internal

# For Docker-in-Docker workspaces
{
  "proxies": {
    "default": {
      "httpProxy": "http://proxy.corp:3128",
      "httpsProxy": "http://proxy.corp:3128"
    }
  }
}

Proxy logs all outbound requests for audit compliance

Zero-Trust Network Patterns

Identity-Based Access

Mutual TLS (mTLS) between services
Service mesh (Istio, Linkerd) for policy enforcement
SPIFFE/SPIRE for workload identity
No trust based on network location
AI agent workload identity - separate service accounts with scoped permissions per agent type

Micro-Segmentation

Kubernetes NetworkPolicies per namespace
Calico for advanced policy rules
Default deny - explicitly allow required traffic
Separate dev/staging/prod workspaces by namespace
Isolate AI agent workspaces from human workspaces at the network level

High Availability & Disaster Recovery

Design patterns for resilient CDE deployments

Multi-Region Deployment

Active-Active Pattern

Workspaces distributed across regions, global load balancer routes to nearest healthy region

No downtime during regional failure
Reduced latency for global teams
Complex: Requires database replication

Active-Passive Pattern

Primary region serves all traffic, secondary region on standby with replicated data

Simpler to implement
Lower cost (standby can be smaller)
RTO: 5-15 minutes for failover

Failover Strategies

Database Failover

- PostgreSQL streaming replication
- Read replicas in multiple AZs
- Automatic failover with Patroni/Stolon
- RDS Multi-AZ for managed option

Control Plane Failover

- Kubernetes Deployment with 3+ replicas
- Anti-affinity rules (spread across AZs)
- Health checks with auto-restart
- Stateless design (state in DB only)

Workspace Recovery

- Persistent volumes with multi-AZ replication
- Workspace state stored in external Git
- Automatic re-provisioning on node failure
- User data loss: none (Git + PV backups)

Backup & Recovery

What to Back Up

PostgreSQL database - all CDE metadata (users, workspaces, templates)
Persistent volumes - workspace home directories (/home/coder)
Terraform templates - version controlled in Git
Configuration files - Kubernetes manifests, Helm values

Backup Tools

Database

pg_dump, Velero, AWS Backup

Volumes

Velero, Restic, cloud snapshots

RPO/RTO Considerations

Tier	RPO	RTO	Cost
Critical Production workspaces	Under 1 hour	Under 15 min	High
Standard Most development	4-8 hours	1-2 hours	Medium
Low-Priority Testing/experiments	24 hours	4+ hours	Low

RPO = Recovery Point Objective (max data loss), RTO = Recovery Time Objective (max downtime)

Storage Architecture

Persistent storage patterns for workspaces and shared data

Workspace Persistent Storage

Per-Workspace Volumes

Each workspace gets its own persistent volume for /home/coder directory

AWS

EBS gp3 volumes

Azure

Azure Managed Disks

GCP

Persistent Disk SSD

Size Recommendations

Web development (Node.js, Python): 20-50 GB
Backend (Java, .NET, Go): 50-100 GB
Data science (ML models): 100-500 GB
AI/ML training (large datasets): 500 GB - 2 TB
AI agent workspaces (ephemeral): 5-20 GB (tmpfs)

Shared File Systems

Team Collaboration Storage

Shared volumes for team datasets, models, build caches

AWS EFS

NFS protocol, scales automatically, multi-AZ

Azure Files

SMB/NFS, Premium tier for IOPS-heavy workloads

GCP Filestore

Managed NFS, High Scale tier for large teams

Use Cases

Shared ML training datasets (read-only mount)
npm/Maven cache (speeds up builds)
Docker layer cache (BuildKit)

Database Access Patterns

Development Databases

Developers need ephemeral databases for testing. Two approaches:

Sidecar Container

PostgreSQL/MySQL container in same pod, auto-destroyed with workspace

Managed DB Pool

RDS/CloudSQL instances provisioned on-demand, returned to pool when done

Production DB Access

Never allow direct production DB writes from workspaces
Read-only replicas for debugging (with auditing)
Use cloud IAM roles (no hardcoded passwords)

Cache Layers

Build Cache

Docker BuildKit Cache
Store build layers in shared registry (ECR, ACR, GCR) with --cache-from flag
Package Manager Cache
npm/yarn/pip cache on EFS/Azure Files - shared across workspaces (saves bandwidth)

Application Cache

Redis Cluster

Session cache, rate limiting

Memcached

Object cache, query results

Scaling Patterns

Auto-scaling strategies for human workspaces, AI agent bursts, and infrastructure

Horizontal Pod Autoscaling (HPA)

Automatically scale the number of CDE control plane pods based on CPU/memory utilization or custom metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Scale up during peak hours (9am-5pm) for human workspaces
Scale down at night/weekends to save cost
Use custom metrics (workspace provision rate)
AI agent burst scaling - handle sudden spikes of 50-200 ephemeral workspaces

Cluster Node Autoscaling

Automatically add or remove Kubernetes worker nodes when workspaces need more capacity.

Cloud-Specific Solutions

AWS

EKS Cluster Autoscaler or Karpenter

Azure

AKS Virtual Nodes + autoscaler

GCP

GKE Autopilot (serverless nodes)

Node Pool Strategy

General pool: Standard workspaces (4 CPU, 8GB RAM)
GPU pool: ML/AI agent workspaces (NVIDIA L4/A100/H100)
Spot/preemptible pool: Cost-sensitive workloads

Resource Quotas & Limits

Prevent resource hogging and ensure fair sharing across teams using Kubernetes resource quotas.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    limits.cpu: "100"
    limits.memory: 200Gi
    persistentvolumeclaims: "20"
    pods: "50"

This quota allows team-backend namespace to use up to 50 CPUs and 100GB RAM across all workspaces.

Multi-Tenant Considerations

Namespace Isolation

One namespace per team or project
Network policies prevent cross-namespace traffic
RBAC limits what users can see/modify

Cost Allocation

Label all resources with team/cost-center tags
Use Kubecost or cloud billing for chargeback
Alert when team exceeds budget threshold

Monitoring & Observability

Essential metrics, logging, and alerting for production CDE deployments including AI agent workloads

Key Metrics to Collect

Workspace Metrics

- Active workspaces count
- Workspace provision time (p50, p95, p99)
- Workspace uptime/idle time
- Failed workspace starts (errors)
- Workspace CPU/memory utilization

Infrastructure Metrics

- Kubernetes node CPU/memory
- Persistent volume usage
- Network throughput (egress/ingress)
- Database connection pool size
- API request latency

User Experience Metrics

- Login success rate
- Time to first workspace (TTFWS)
- VS Code/IDE connection failures
- User session duration

AI Agent Metrics

- Agent workspace boot time (p50, p95)
- Concurrent agent sessions
- Agent task success/failure rate
- MicroVM pool utilization
- Agent egress bandwidth consumption

Logging Architecture

Log Aggregation Stack

ELK Stack

Elasticsearch + Logstash + Kibana (classic, resource-heavy)

Loki + Grafana

Lightweight, Prometheus-like for logs (recommended)

Cloud Native

CloudWatch Logs, Azure Monitor, GCP Logging

What to Log

CDE control plane logs (API requests, provisioner actions)
Kubernetes events (pod starts, failures, OOMKills)
Workspace startup logs (Terraform apply output)
Authentication logs (SSO, failed logins)
AI agent session logs (task start/stop, files modified, commands executed)
Do NOT log user code execution (privacy risk)

Alerting Strategies

Critical Alerts (Page On-Call)

CDE control plane down (all workspaces inaccessible)
Database connection failures (metadata loss risk)
Workspace provision failure rate over 20%
Cluster autoscaler unable to add nodes
AI agent workspace pool exhausted (no available microVMs)

Warning Alerts (Slack/Email)

High memory usage on nodes (over 85%)
Persistent volume usage over 80%
Workspace provision time exceeds 5 minutes
SSL certificate expiring in under 7 days

Dashboard Recommendations

Platform Health Dashboard

- Control plane uptime SLA

- Active workspace count

- API request rate

- Database query latency

- Kubernetes cluster health

- Network egress costs

Developer Experience Dashboard

- Avg time to provision workspace

- Workspace start success rate

- IDE connection failures

- User login errors

- Top 10 slowest workspaces

- Resource quota violations

Use Grafana with Prometheus for metrics, Loki for logs, and OpenTelemetry with Jaeger or Tempo for distributed tracing

Related Resources

Security & Compliance Compare CDE Tools MicroVM Isolation Multi-CDE Strategies How It Works