Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide CDEs for Startups
AI & Automation
AI Coding Assistants Agentic AI AI-Native IDEs Agentic Engineering AI Agent Orchestration AI Governance AI-Assisted Architecture Shift-Left AI LLMOps Autonomous Development AI/ML Workloads GPU Computing
Implementation
Architecture Patterns DevContainers Advanced DevContainers Language Quickstarts IDE Integration CI/CD Integration Platform Engineering Developer Portals Container Registry Multi-CDE Strategies Remote Dev Protocols Nix Environments
Operations
Performance Optimization High Availability & DR Disaster Recovery Monitoring Capacity Planning Multi-Cluster Development Troubleshooting Runbooks Ephemeral Environments
Security
Security Deep Dive Zero Trust Architecture Secrets Management Vulnerability Management Network Security IAM Guide Supply Chain Security Air-Gapped Environments AI Agent Security MicroVM Isolation Compliance Guide Governance
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis FinOps GreenOps Vendor Evaluation Training Resources Developer Onboarding Team Structure DevEx Metrics Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

Architecture & Infrastructure Design

Reference architectures, cloud deployments, network design, high availability, AI agent workspaces, and best practices for production-ready CDE infrastructure

Reference Architectures

Common deployment patterns for Cloud Development Environments across different infrastructure types, including AI agent workspaces and microVM isolation

Self-Hosted on Kubernetes

The most popular architecture for enterprise deployments. Provides maximum control, scalability, and integration with existing Kubernetes infrastructure.

Architecture Flow:
1
Developer Access Layer
Load Balancer -> Ingress Controller -> CDE Control Plane (authentication, workspace management)
2
Control Plane Components
PostgreSQL (metadata), Provisioner (Terraform executor), API Server, Web UI
3
Workspace Layer
Kubernetes namespace per workspace -> Pods with persistent volumes (or microVM isolation via Firecracker/Kata) -> Private registry for custom images
4
External Integrations
Git repositories, Container registries, Cloud resources (databases, S3), Monitoring systems
Advantages
  • Horizontal scaling with pod autoscaling
  • Native multi-tenancy with namespaces
  • Resource quotas and limits enforcement
  • Built-in health checks and self-healing
Requirements
  • Kubernetes 1.29+ cluster
  • Storage class with dynamic provisioning
  • Ingress controller (NGINX, Traefik)
  • PostgreSQL 15+ database

Self-Hosted on Docker

Simpler deployment for smaller teams or single-host setups. Lower overhead than Kubernetes but with reduced scalability.

Architecture Flow:
1
Reverse Proxy
NGINX or Caddy -> Routes to CDE control plane container
2
CDE Server Container
Single container with API, provisioner, and web UI -> PostgreSQL container for state
3
Workspace Containers
Docker-in-Docker for workspace isolation -> Named volumes for persistence
Best For
  • Small teams (under 50 developers)
  • Quick proof-of-concept deployments
  • Development/testing environments
Limitations
  • Single host - no horizontal scaling
  • Less robust resource isolation
  • Manual failover required

Managed SaaS Architecture

Fully managed service (e.g., GitHub Codespaces, Ona (formerly Gitpod)) - zero infrastructure management required.

Architecture Flow:
1
Developer Authentication
OAuth/SSO -> Vendor-managed control plane (GitHub, Ona, etc.)
2
Workspace Provisioning
Vendor's global infrastructure -> Auto-scaling compute pools -> CDN-backed file systems
3
Access Methods
Web IDE (browser), VS Code desktop, SSH tunneling
Pros
  • Zero infrastructure ops
  • Global edge network
  • Automatic updates
Cons
  • Data leaves your network
  • Less customization
  • Vendor lock-in
Use Case
  • Open-source projects
  • Rapid onboarding
  • No compliance needs

Hybrid Architecture

Combines self-hosted control plane with cloud-based workspaces, or uses SaaS for non-sensitive projects alongside self-hosted for regulated workloads.

Common Patterns:
Self-Hosted Control + Cloud Workspaces
CDE control plane in your VPC, workspaces auto-scale in multiple clouds based on region/cost
SaaS for Public + Self-Hosted for Secrets
GitHub Codespaces for open-source, Coder self-hosted for HIPAA/SOC2 workloads

AI Agent Workspace Architecture

Purpose-built architecture for AI coding agents (Claude Code, Copilot Workspace, Devin, etc.) that provision, execute, and tear down workspaces autonomously without human interaction.

Architecture Flow:
1
Agent Orchestration Layer
API gateway -> Agent authentication (service tokens, OIDC) -> Rate limiting and concurrency controls
2
Ephemeral Workspace Provisioning
MicroVM pool (Firecracker/Cloud Hypervisor) -> Sub-second boot from snapshots -> Pre-warmed workspace images per repository
3
Sandboxed Execution
MicroVM isolation (no shared kernel) -> Restricted network egress -> Time-boxed sessions with automatic cleanup
4
Output and Audit
Code diffs written to Git branch -> Full session audit log -> Workspace destroyed after task completion
Security Requirements
  • MicroVM isolation - agents must not share a kernel with other workloads
  • Network egress allowlists - restrict to approved domains only
  • Time-limited sessions with hard shutdown after timeout
  • Read-only access to secrets via vault injection (no plaintext env vars)
Performance Requirements
  • Sub-second workspace boot (microVM snapshots or pre-warmed pools)
  • Burst scaling to hundreds of concurrent agent sessions
  • Shared read-only layer for dependencies (container image layers, package caches)
  • Instant teardown - no orphaned resources after agent completes
Agent vs Human Workspace Design
Agent workspaces are ephemeral (minutes, not days), headless (no IDE UI), and high-concurrency (dozens per developer). Design your control plane and node pools accordingly - agent workloads are bursty and benefit from pre-warmed microVM pools rather than cold-start Kubernetes pods. See our MicroVM Isolation guide for implementation details.

Multi-CDE Architecture

Organizations increasingly run multiple CDE platforms simultaneously - one for regulated workloads, another for open-source, and a third for AI agent sandboxing. A multi-CDE architecture provides a unified control layer across platforms.

Common Patterns:
Hub-and-Spoke
Central platform team manages a primary CDE (e.g., Coder) while individual teams use satellite CDEs (Codespaces, DevPod) with unified identity and audit logging
Segmented by Compliance
Self-hosted Coder for HIPAA/SOC2 workloads, GitHub Codespaces for open-source, ephemeral microVM-based CDEs for AI agents
Unified DevContainer Layer
Shared devcontainer.json specs work across Coder, Codespaces, DevPod, and Ona - developers switch platforms without changing their workspace config
For detailed platform selection criteria, governance patterns, and migration strategies, see our dedicated Multi-CDE Strategies guide.

Cloud Provider Deployments

Architecture considerations for major cloud platforms

Amazon Web Services (AWS)

Three main deployment options for AWS

Amazon EKS
Managed Kubernetes - Production recommended
  • Managed control plane
  • Fargate or EC2 node groups
  • Native VPC networking
  • EBS for persistent volumes
  • Higher cost than ECS
Amazon ECS
Container orchestration - Cost-effective
  • Simpler than Kubernetes
  • Fargate serverless option
  • Lower operational overhead
  • EFS for shared storage
  • Less ecosystem tooling
Amazon EC2
Direct VMs - Maximum control
  • Full OS customization
  • GPU instance types available
  • Best for Windows workloads
  • Spot instances for cost savings
  • Manual scaling management
AWS Best Practice
Use EKS for teams over 100 developers, ECS for smaller teams or cost-focused deployments, EC2 direct for specialized workloads (AI/ML with GPUs, Windows development). For AI agent workloads, consider Firecracker-based microVMs on EC2 bare metal for sub-second boot and strong isolation.

Microsoft Azure

Three deployment strategies for Azure

Azure AKS
Kubernetes Service - Enterprise standard
  • Microsoft Entra ID integration
  • Virtual nodes (serverless)
  • Azure Files/Disks storage
  • Policy-based governance
Container Apps
Serverless containers - Simplified ops
  • Scale-to-zero capability
  • Built on Kubernetes (KEDA)
  • Simple HTTP ingress
  • Limited customization
Azure VMs
Virtual machines - Legacy workloads
  • Windows Server support
  • Visual Studio licensing
  • Hybrid cloud connectivity
  • Higher cost per workspace

Google Cloud Platform (GCP)

Three GCP deployment models

Google GKE
Kubernetes Engine - Most mature K8s
  • Autopilot mode (serverless)
  • Fastest Kubernetes updates
  • Native Google services
  • Persistent Disk CSI driver
Cloud Run
Serverless containers - Pay-per-use
  • Instant scale from zero
  • Per-request billing
  • Integrated Cloud Build
  • 60-minute request timeout limit
Compute Engine
Virtual machines - Custom infrastructure
  • Custom machine types
  • Spot VM instances (formerly Preemptible)
  • Persistent Disk snapshots
  • Manual orchestration

On-Premises Deployment

Self-hosted infrastructure options

Kubernetes (Rancher, OpenShift)
Most flexible, requires expertise
  • Full control over data
  • Air-gapped support
VMware vSphere
Enterprise virtualization
  • Existing VM infrastructure
  • vMotion for HA
Bare Metal
Maximum performance
  • No hypervisor overhead
  • GPU workloads

Network Architecture

Secure, scalable network design for Cloud Development Environments

VPC Design & Isolation

Recommended VPC Layout
1
Public Subnet
Load balancer, NAT gateway
2
Private Subnet (App)
CDE control plane, ingress controller
3
Private Subnet (Workspaces)
Developer workspace pods/containers
4
Private Subnet (Data)
RDS, PostgreSQL, Redis cache
5
Isolated Subnet (AI Agents)
MicroVM agent workspaces, restricted egress, no lateral movement
Isolation Best Practices
  • Separate VPC for CDE (do not mix with production apps)
  • Use VPC peering or Transit Gateway for multi-VPC access
  • Enable VPC Flow Logs for audit trails
  • Implement network policies in Kubernetes
  • Use private DNS zones for internal service discovery

Security Groups & Firewall Rules

LayerInboundOutboundPurpose
Load Balancer443 (HTTPS) from 0.0.0.0/0To control plane on 443Public HTTPS access
Control Plane443 from LB, 5432 from workspacesTo DB, to K8s APIAPI & orchestration
WorkspacesSSH (22) from control plane onlyInternet via NAT (egress control)Developer access
Database5432 from control plane onlyNoneMetadata storage
AI Agent VMsAPI from control plane onlyAllowlisted domains via proxy (Git, registries)Sandboxed agent execution

Private Endpoints & Service Connect

Access cloud services without public internet exposure using private connectivity.

AWS PrivateLink
  • S3 via VPC endpoint
  • ECR (Docker registry)
  • Secrets Manager
  • Systems Manager
Azure Private Link
  • Blob Storage endpoint
  • Azure Container Registry
  • Key Vault
  • SQL Database
GCP Private Service Connect
  • Cloud Storage private access
  • Artifact Registry
  • Secret Manager
  • Cloud SQL

Egress Controls & Proxy Configuration

Data Exfiltration Prevention
  • Block public package registries (npm, PyPI, Maven Central) - use private mirrors
  • Allowlist git domains - only your GitHub/GitLab enterprise
  • DLP scanning on egress - detect secrets, PII in outbound traffic
  • Disable direct SSH/RDP out - require bastion host
  • AI agent network isolation - agent workspaces get stricter egress rules than human workspaces (block LLM API callbacks, limit to build toolchains only)
HTTP(S) Proxy Setup
# Workspace environment variables
export HTTP_PROXY=http://proxy.corp:3128
export HTTPS_PROXY=http://proxy.corp:3128
export NO_PROXY=localhost,127.0.0.1,.internal

# For Docker-in-Docker workspaces
{
  "proxies": {
    "default": {
      "httpProxy": "http://proxy.corp:3128",
      "httpsProxy": "http://proxy.corp:3128"
    }
  }
}
Proxy logs all outbound requests for audit compliance

Zero-Trust Network Patterns

Identity-Based Access
  • Mutual TLS (mTLS) between services
  • Service mesh (Istio, Linkerd) for policy enforcement
  • SPIFFE/SPIRE for workload identity
  • No trust based on network location
  • AI agent workload identity - separate service accounts with scoped permissions per agent type
Micro-Segmentation
  • Kubernetes NetworkPolicies per namespace
  • Calico for advanced policy rules
  • Default deny - explicitly allow required traffic
  • Separate dev/staging/prod workspaces by namespace
  • Isolate AI agent workspaces from human workspaces at the network level

High Availability & Disaster Recovery

Design patterns for resilient CDE deployments

Multi-Region Deployment

Active-Active Pattern
Workspaces distributed across regions, global load balancer routes to nearest healthy region
  • No downtime during regional failure
  • Reduced latency for global teams
  • Complex: Requires database replication
Active-Passive Pattern
Primary region serves all traffic, secondary region on standby with replicated data
  • Simpler to implement
  • Lower cost (standby can be smaller)
  • RTO: 5-15 minutes for failover

Failover Strategies

Database Failover
  • - PostgreSQL streaming replication
  • - Read replicas in multiple AZs
  • - Automatic failover with Patroni/Stolon
  • - RDS Multi-AZ for managed option
Control Plane Failover
  • - Kubernetes Deployment with 3+ replicas
  • - Anti-affinity rules (spread across AZs)
  • - Health checks with auto-restart
  • - Stateless design (state in DB only)
Workspace Recovery
  • - Persistent volumes with multi-AZ replication
  • - Workspace state stored in external Git
  • - Automatic re-provisioning on node failure
  • - User data loss: none (Git + PV backups)

Backup & Recovery

What to Back Up
  • PostgreSQL database - all CDE metadata (users, workspaces, templates)
  • Persistent volumes - workspace home directories (/home/coder)
  • Terraform templates - version controlled in Git
  • Configuration files - Kubernetes manifests, Helm values
Backup Tools
Database
pg_dump, Velero, AWS Backup
Volumes
Velero, Restic, cloud snapshots

RPO/RTO Considerations

TierRPORTOCost
Critical
Production workspaces
Under 1 hourUnder 15 minHigh
Standard
Most development
4-8 hours1-2 hoursMedium
Low-Priority
Testing/experiments
24 hours4+ hoursLow
RPO = Recovery Point Objective (max data loss), RTO = Recovery Time Objective (max downtime)

Storage Architecture

Persistent storage patterns for workspaces and shared data

Workspace Persistent Storage

Per-Workspace Volumes
Each workspace gets its own persistent volume for /home/coder directory
AWS
EBS gp3 volumes
Azure
Azure Managed Disks
GCP
Persistent Disk SSD
Size Recommendations
  • Web development (Node.js, Python): 20-50 GB
  • Backend (Java, .NET, Go): 50-100 GB
  • Data science (ML models): 100-500 GB
  • AI/ML training (large datasets): 500 GB - 2 TB
  • AI agent workspaces (ephemeral): 5-20 GB (tmpfs)

Shared File Systems

Team Collaboration Storage
Shared volumes for team datasets, models, build caches
AWS EFS
NFS protocol, scales automatically, multi-AZ
Azure Files
SMB/NFS, Premium tier for IOPS-heavy workloads
GCP Filestore
Managed NFS, High Scale tier for large teams
Use Cases
  • Shared ML training datasets (read-only mount)
  • npm/Maven cache (speeds up builds)
  • Docker layer cache (BuildKit)

Database Access Patterns

Development Databases
Developers need ephemeral databases for testing. Two approaches:
Sidecar Container
PostgreSQL/MySQL container in same pod, auto-destroyed with workspace
Managed DB Pool
RDS/CloudSQL instances provisioned on-demand, returned to pool when done
Production DB Access
  • Never allow direct production DB writes from workspaces
  • Read-only replicas for debugging (with auditing)
  • Use cloud IAM roles (no hardcoded passwords)

Cache Layers

Build Cache
  • Docker BuildKit Cache
    Store build layers in shared registry (ECR, ACR, GCR) with --cache-from flag
  • Package Manager Cache
    npm/yarn/pip cache on EFS/Azure Files - shared across workspaces (saves bandwidth)
Application Cache
Redis Cluster
Session cache, rate limiting
Memcached
Object cache, query results

Scaling Patterns

Auto-scaling strategies for human workspaces, AI agent bursts, and infrastructure

Horizontal Pod Autoscaling (HPA)

Automatically scale the number of CDE control plane pods based on CPU/memory utilization or custom metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  • Scale up during peak hours (9am-5pm) for human workspaces
  • Scale down at night/weekends to save cost
  • Use custom metrics (workspace provision rate)
  • AI agent burst scaling - handle sudden spikes of 50-200 ephemeral workspaces

Cluster Node Autoscaling

Automatically add or remove Kubernetes worker nodes when workspaces need more capacity.
Cloud-Specific Solutions
AWS
EKS Cluster Autoscaler or Karpenter
Azure
AKS Virtual Nodes + autoscaler
GCP
GKE Autopilot (serverless nodes)
Node Pool Strategy
  • General pool: Standard workspaces (4 CPU, 8GB RAM)
  • GPU pool: ML/AI agent workspaces (NVIDIA L4/A100/H100)
  • Spot/preemptible pool: Cost-sensitive workloads

Resource Quotas & Limits

Prevent resource hogging and ensure fair sharing across teams using Kubernetes resource quotas.
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    limits.cpu: "100"
    limits.memory: 200Gi
    persistentvolumeclaims: "20"
    pods: "50"
This quota allows team-backend namespace to use up to 50 CPUs and 100GB RAM across all workspaces.

Multi-Tenant Considerations

Namespace Isolation
  • One namespace per team or project
  • Network policies prevent cross-namespace traffic
  • RBAC limits what users can see/modify
Cost Allocation
  • Label all resources with team/cost-center tags
  • Use Kubecost or cloud billing for chargeback
  • Alert when team exceeds budget threshold

Monitoring & Observability

Essential metrics, logging, and alerting for production CDE deployments including AI agent workloads

Key Metrics to Collect

Workspace Metrics
  • - Active workspaces count
  • - Workspace provision time (p50, p95, p99)
  • - Workspace uptime/idle time
  • - Failed workspace starts (errors)
  • - Workspace CPU/memory utilization
Infrastructure Metrics
  • - Kubernetes node CPU/memory
  • - Persistent volume usage
  • - Network throughput (egress/ingress)
  • - Database connection pool size
  • - API request latency
User Experience Metrics
  • - Login success rate
  • - Time to first workspace (TTFWS)
  • - VS Code/IDE connection failures
  • - User session duration
AI Agent Metrics
  • - Agent workspace boot time (p50, p95)
  • - Concurrent agent sessions
  • - Agent task success/failure rate
  • - MicroVM pool utilization
  • - Agent egress bandwidth consumption

Logging Architecture

Log Aggregation Stack
ELK Stack
Elasticsearch + Logstash + Kibana (classic, resource-heavy)
Loki + Grafana
Lightweight, Prometheus-like for logs (recommended)
Cloud Native
CloudWatch Logs, Azure Monitor, GCP Logging
What to Log
  • CDE control plane logs (API requests, provisioner actions)
  • Kubernetes events (pod starts, failures, OOMKills)
  • Workspace startup logs (Terraform apply output)
  • Authentication logs (SSO, failed logins)
  • AI agent session logs (task start/stop, files modified, commands executed)
  • Do NOT log user code execution (privacy risk)

Alerting Strategies

Critical Alerts (Page On-Call)
  • CDE control plane down (all workspaces inaccessible)
  • Database connection failures (metadata loss risk)
  • Workspace provision failure rate over 20%
  • Cluster autoscaler unable to add nodes
  • AI agent workspace pool exhausted (no available microVMs)
Warning Alerts (Slack/Email)
  • High memory usage on nodes (over 85%)
  • Persistent volume usage over 80%
  • Workspace provision time exceeds 5 minutes
  • SSL certificate expiring in under 7 days

Dashboard Recommendations

Platform Health Dashboard
- Control plane uptime SLA
- Active workspace count
- API request rate
- Database query latency
- Kubernetes cluster health
- Network egress costs
Developer Experience Dashboard
- Avg time to provision workspace
- Workspace start success rate
- IDE connection failures
- User login errors
- Top 10 slowest workspaces
- Resource quota violations
Use Grafana with Prometheus for metrics, Loki for logs, and OpenTelemetry with Jaeger or Tempo for distributed tracing