When do you need multi-cluster CDEs?

Multi-cluster CDEs are needed for: geographically distributed teams requiring low latency, data sovereignty and compliance requirements, high availability across regions, scale beyond single cluster limits, and separating production-adjacent from development workloads.

How do you manage CDEs across multiple Kubernetes clusters?

Use a centralized control plane that federates workspace management across clusters. Route developers to the nearest cluster, replicate configurations, and maintain consistent policies. Tools like Coder support multi-cluster deployments natively.

What are the networking challenges of multi-cluster CDEs?

Challenges include cross-cluster service discovery, consistent network policies, shared storage access, certificate management, DNS routing, and maintaining low-latency connections. Service mesh solutions like Istio or Cilium Cluster Mesh help address these.

How do you handle data sovereignty in multi-cluster CDEs?

Route developers to clusters in their compliance region, ensure source code and workspace data stay within regional boundaries, configure storage replication policies per region, and implement network controls preventing cross-region data transfer.

Multi-Cluster Development

Scale your Cloud Development Environment infrastructure across multiple Kubernetes clusters for geographic distribution, improved resilience, AI workload isolation, and compliance requirements.

Explore Multi-Cluster AI Workloads Architecture Patterns

Why Multi-Cluster Development?

Running development workspaces across multiple Kubernetes clusters provides operational flexibility, improved resilience, dedicated AI and GPU infrastructure, and better alignment with organizational requirements.

AI Workload Isolation

Dedicate clusters to AI-intensive workloads including agentic coding sessions, LLM inference, model fine-tuning, and GPU-accelerated development. AI workloads have fundamentally different resource profiles than standard development - they consume large amounts of GPU memory, generate bursty compute demands, and often run unattended for hours. Isolating these workloads prevents them from starving traditional developer workspaces of CPU and memory.

Dedicated AI clusters can be provisioned with H100, A100, or L4 GPU node pools and configured with scheduling policies optimized for long-running agent tasks. This separation also simplifies cost attribution - GPU infrastructure costs are clearly assigned to AI workloads rather than blended across all development activity. Platforms like Coder, Ona (formerly Gitpod), and DevPod support workspace templates that target specific cluster labels, making it straightforward to route AI workloads to GPU-equipped infrastructure.

Geographic Distribution

Deploy clusters closer to your globally distributed development teams to reduce latency and improve developer experience. A team in Europe can connect to a cluster in Frankfurt while a team in Asia uses Singapore infrastructure.

This reduces network round-trip times for IDE interactions, file operations, and terminal sessions - making remote development feel truly local. Developers experience sub-50ms latency instead of 200ms+ when connecting to distant clusters.

Blast Radius Reduction

Isolate failures to individual clusters rather than impacting all development workspaces simultaneously. If one cluster experiences issues - whether from misconfiguration, resource exhaustion, or infrastructure problems - other clusters continue operating normally.

This isolation is particularly valuable during experimentation with cluster configurations, Kubernetes version upgrades, or when testing new CDE platform features. You can safely test changes on a single cluster without risking organization-wide disruption.

Compliance and Data Residency

Meet regulatory requirements by keeping development workspaces and data within specific geographic boundaries. GDPR, CCPA, and industry-specific regulations often mandate that certain data never leaves particular regions or countries.

Multi-cluster architectures let you deploy dedicated clusters in compliant regions. European customer data workspaces run in EU clusters, healthcare workspaces in HIPAA-compliant regions, and financial services in audited data centers - all managed through a unified control plane.

Resource Isolation

Separate workload types across clusters to optimize resource allocation and prevent resource contention. AI agent workloads, LLM fine-tuning jobs, and machine learning workspaces with GPU requirements run in specialized clusters, while standard development workspaces use general-purpose infrastructure.

This separation extends to organizational structure as well. Different business units, product teams, or customer segments can have dedicated clusters with appropriate resource quotas, security policies, and operational procedures - all while sharing common platform tooling. CDE platforms like Coder and Ona can route workspace creation to the appropriate cluster based on workload type automatically.

Multi-Cluster Architectures

Different architectural patterns offer tradeoffs between complexity, scalability, and operational overhead. Choose the pattern that best fits your organization's size and requirements.

Hub-Spoke Architecture

A central hub cluster manages the CDE control plane while spoke clusters run actual development workspaces. This is the most common pattern used by Coder and similar self-hosted CDE platforms.

Advantages

Centralized management: Single control plane simplifies administration, user management, and policy enforcement across all clusters.
Clear separation: Management plane and workload plane are isolated, improving security and reducing risk of workspace workloads affecting platform stability.
Easier upgrades: Control plane can be upgraded independently from workspace clusters, reducing coordination complexity.
Scalable onboarding: New spoke clusters can be added without modifying existing infrastructure or reconfiguring workspaces.

Challenges

Hub availability: The hub cluster becomes a single point of failure for management operations - spoke clusters may continue running but new workspace creation stops.
Network dependencies: Spokes must maintain connectivity to the hub, requiring reliable networking and introducing latency for control operations.
Hub resource sizing: Control plane cluster must be sized appropriately to handle metadata for all workspaces across all spokes.
Disaster recovery complexity: Hub backup and recovery procedures become critical for restoring management capabilities.

Best for: Organizations with 5+ clusters, centralized platform teams, and requirements for consistent policy enforcement across all development infrastructure.

Mesh Architecture

Each cluster runs both control plane and workload plane components, with clusters communicating peer-to-peer.

Advantages

No single point of failure: Each cluster can operate independently even if communication with other clusters is interrupted.
Lower latency: Management operations execute locally within each cluster without cross-cluster API calls.
Geographic independence: Clusters in different regions can have poor or intermittent connectivity without affecting operations.
Simplified networking: No hub-spoke network topology to maintain, reducing firewall rules and peering connections.

Challenges

Configuration drift: Without centralized management, each cluster may diverge in configuration, policies, and versions over time.
State synchronization: User accounts, permissions, and configurations must be replicated across all clusters through external mechanisms.
Operational overhead: Upgrades, configuration changes, and troubleshooting must be performed separately on each cluster.
Resource overhead: Control plane components consume resources on every cluster, increasing overall infrastructure costs.

Best for: Organizations with 2-4 clusters in very different regions, strong automation culture, and requirements for cluster independence during network partitions.

Federated Architecture

A lightweight federation layer coordinates multiple independent clusters without requiring tight coupling or constant connectivity.

Advantages

Loose coupling: Clusters maintain independence while benefiting from coordination where needed for user experience.
Flexible policies: Some policies can be enforced globally while others remain cluster-specific based on regional or business unit requirements.
Gradual adoption: Existing clusters can be federated incrementally without requiring architectural changes or downtime.
Multi-vendor support: Can federate different CDE platforms or Kubernetes distributions under a common interface.

Challenges

Limited federation tools: Few CDE platforms provide native federation capabilities, requiring custom tooling or third-party solutions.
Inconsistent experiences: Without strong coordination, developers may see different capabilities or behaviors across clusters.
Complex identity: Federating identity and access control across clusters requires integration with external identity providers.
Monitoring gaps: Federated view of cluster health, capacity, and usage requires aggregating metrics from multiple sources.

Best for: Large enterprises with existing cluster deployments, acquisitions bringing different CDE platforms, or highly autonomous regional teams requiring local control.

Cluster Selection Strategies

Intelligently routing workspace creation requests to appropriate clusters ensures optimal developer experience while meeting compliance and cost objectives.

Region-Based Routing

Route workspace creation based on developer location or team region. Developers in North America automatically get workspaces in US clusters, European developers in EU clusters, and Asian teams in Singapore or Tokyo.

Use GeoIP lookup on developer IP addresses, explicit region selection during workspace creation, or team-based defaults configured in user profiles. This ensures low latency and can satisfy data residency requirements.

Consider "sticky" assignments so developers consistently use the same cluster for familiarity and predictable performance.

Workload-Based Routing

Direct specific workload types to specialized clusters. AI agent sessions and machine learning workspaces requiring GPUs route to clusters with NVIDIA hardware, LLM fine-tuning jobs target high-memory nodes with NVLink interconnects, data processing workloads needing high memory go to clusters with large instance types, and standard web development uses general-purpose infrastructure.

This specialization improves resource utilization and cost efficiency. GPU clusters run at high utilization rather than having expensive accelerators sitting idle, while commodity workloads use cheaper general-purpose nodes. With agentic AI workloads growing rapidly in 2026, workload-based routing is essential for managing the cost and performance of unattended AI coding sessions.

Workspace templates or project configurations can specify required cluster characteristics for automatic routing.

Cost-Optimized Placement

Route workspaces to clusters with available capacity or lower cloud provider pricing. If multiple regions meet latency requirements, automatically select the cluster with spot instance availability or lower on-demand pricing.

This strategy works well for batch processing workspaces, temporary workspaces for specific tasks, or development workspaces without strict latency requirements. Can reduce infrastructure costs by 40-70% through spot instance usage.

Implement workspace migration to move long-running workspaces from expensive to cheaper clusters during off-hours.

Hybrid Selection Strategies

Most organizations use hybrid strategies combining multiple routing rules with priority ordering. Start with compliance requirements (hard constraints), then optimize for latency and specialized hardware needs, and finally apply cost optimization within acceptable regions.

Example rule chain: If workspace handles EU customer data, use EU cluster (compliance). Else if workspace is an AI agent session or requires GPU, use GPU-enabled cluster (hardware). Else if developer in Asia, use Singapore cluster (latency). Else use cheapest available cluster with capacity (cost). This ensures requirements are met while optimizing for cost where possible.

Cross-Cluster Networking

Enable workspaces in different clusters to communicate securely for microservices development, database access, shared model registries, and shared infrastructure services.

Service Mesh

Deploy Istio, Linkerd, or Consul across multiple clusters to provide service discovery, encrypted communication, and traffic management between workspace services regardless of cluster location.

Multi-cluster service mesh creates a unified service registry. A workspace in cluster A can call services in cluster B using the same DNS names and service endpoints as local services. The mesh handles routing, load balancing, and mTLS encryption automatically.

Adds observability with distributed tracing across cluster boundaries for debugging cross-cluster service calls.

Cluster Mesh (Cilium)

Cilium Cluster Mesh provides pod-to-pod connectivity across Kubernetes clusters with native network performance. Pods in different clusters can communicate directly using pod IPs as if they were in the same cluster.

This approach works at the network layer (L3/L4) rather than requiring application-layer proxies. Offers better performance than service mesh for high-throughput applications but requires Cilium CNI deployment across all clusters.

Supports Kubernetes NetworkPolicy enforcement across clusters for consistent security posture.

VPN Peering

Establish VPC peering connections (AWS), VNet peering (Azure), or VPC Network Peering (GCP) between cluster networks. This creates network-level connectivity allowing any resource in one cluster to reach any resource in another.

Simpler than service mesh but less granular. All pod-to-pod traffic is allowed once networking is configured. Requires careful CIDR planning to avoid IP address overlap across clusters and security group management for traffic control.

Works across cloud providers using VPN tunnels or AWS Transit Gateway for hybrid architectures.

Cross-Cluster Security Considerations

Network Segmentation

Even with cross-cluster connectivity, implement network policies to restrict which workspaces can communicate across clusters. Default-deny policies with explicit allow rules prevent unauthorized access.

Encryption in Transit

Ensure all cross-cluster traffic is encrypted. Service mesh provides mTLS automatically, while VPN peering requires additional configuration or application-level TLS.

Identity and Access

Use workload identity (SPIFFE/SPIRE) to authenticate services across clusters rather than relying on network-level security alone. Services prove their identity regardless of source cluster.

Audit and Monitoring

Log all cross-cluster traffic patterns and monitor for anomalies. Unexpected cross-cluster communication may indicate compromised workspaces or misconfigurations.

State Management Across Clusters

Managing persistent state, configuration, and data across multiple clusters requires careful architecture to maintain consistency and availability.

Shared Databases

Deploy regional database clusters (PostgreSQL, MySQL, MongoDB) that multiple workspace clusters can access. Use managed cloud database services (RDS, Cloud SQL, Azure Database) for operational simplicity.

Workspace metadata, user profiles, and shared application data live in these databases rather than in cluster-local storage. This enables workspace portability and consistent state across clusters.

Considerations:

Cross-region latency for database queries
Database becomes critical dependency
Read replicas can improve read performance

Distributed Storage

Use object storage (S3, GCS, Azure Blob) or distributed file systems (Ceph, MinIO) for workspace persistent volumes. Workspace data can be accessed from any cluster through standard storage APIs.

Store workspace home directories, project files, and build artifacts in object storage. Workspaces can be created in any cluster and access the same data, enabling true workspace mobility across infrastructure.

Considerations:

FUSE mounting for file system semantics
Caching layers for improved performance
Data transfer costs between regions

Configuration Sync

Replicate Kubernetes ConfigMaps, Secrets, and custom resources across clusters using GitOps tools (Flux, Argo CD) or federation controllers. Changes to workspace templates propagate automatically.

Store configuration in Git as the source of truth. Automation tools monitor Git repositories and ensure all clusters have consistent configuration, reducing drift and enabling atomic updates across infrastructure.

Considerations:

Eventual consistency delays during updates
Cluster-specific overrides for regional config
Secret encryption and secure distribution

State Management Best Practices

Separate concerns: User state (profiles, preferences) goes in databases. Workspace state (files, build outputs) goes in distributed storage. Configuration goes in Git. This separation enables independent scaling and optimization.

Plan for network partitions: Design systems to degrade gracefully when clusters cannot communicate. Read-only mode, cached credentials, and eventual consistency patterns help workspaces remain functional during network issues.

Implement versioning: Version all configuration and state schemas to enable rolling updates across clusters. Older clusters can work with newer state formats during transition periods.

Monitor consistency: Track replication lag, configuration drift, and state synchronization delays. Alert when clusters diverge beyond acceptable thresholds to catch issues before they impact developers.

Developer Experience in Multi-Cluster Environments

The best multi-cluster architecture is one developers do not need to think about. Hide complexity behind intuitive interfaces and intelligent automation.

Transparent Cluster Selection

Developers should rarely need to know or care which cluster runs their workspace. The platform automatically selects the optimal cluster based on their location, project requirements, and resource availability.

When developers create a workspace, show them a single "Create Workspace" button. Behind the scenes, the system evaluates selection rules, checks cluster capacity, validates compliance requirements, and provisions the workspace in the appropriate location.

For advanced users who need control, provide optional cluster selection with clear descriptions of each cluster's capabilities, geographic location, and resource availability. Default to automatic selection for everyone else.

Consistent Tooling

Ensure workspace templates, pre-installed tools, and development environments are identical across all clusters. Whether using Coder templates, Ona workspace classes, DevPod providers, or GitHub Codespaces configurations, a workspace created in the US cluster should have exactly the same capabilities as one in Europe or Asia.

Use container image registries with multi-region replication to ensure all clusters pull from the same image versions. Configuration sync tools keep workspace templates consistent, and automated testing validates that each cluster provides the same developer experience.

Document any cluster-specific differences clearly. If GPU workspaces for AI agent sessions are only available in certain regions or some clusters have higher resource limits, surface this information during workspace creation rather than after failed attempts.

Workspace Portability

Enable developers to move workspaces between clusters when needed. A developer traveling from New York to London should be able to recreate their workspace in a European cluster for better performance during their trip.

Implement workspace backup and restore functionality that works across clusters. Developers snapshot their current workspace state (files, installed packages, git repositories) and restore it in a different cluster with a few clicks.

For stateless workspaces using remote Git repositories and package managers, portability is automatic. For workspaces with local state, provide clear guidance on what will and will not transfer between clusters to set appropriate expectations.

Unified Dashboard

Present all workspaces across all clusters in a single dashboard. Developers should not need to log into different UIs or remember which cluster hosts each workspace.

The dashboard shows workspace status, cluster location (as metadata rather than requiring user action), resource usage, and access URLs. Developers can start, stop, or delete workspaces regardless of backing cluster through the same interface.

Include cluster health indicators for transparency. If a cluster is experiencing issues or scheduled for maintenance, show affected workspaces with recommendations to migrate to alternative clusters temporarily.

Communication and Documentation

Even with transparent automation, developers benefit from understanding the multi-cluster architecture. Provide documentation explaining why multiple clusters exist, which regions they serve, and how cluster selection works.

Communicate cluster maintenance windows clearly with advance notice. If a cluster will be temporarily unavailable, give developers time to migrate workspaces or plan around the downtime. Automated migration suggestions reduce friction.

Create feedback channels for developers to report cluster-specific issues. Performance problems, networking issues, or missing capabilities in specific clusters should be easy to report and tracked for resolution.

Multi-Cluster for AI Workloads

AI agent sessions, LLM fine-tuning, and GPU-accelerated development have unique multi-cluster requirements that differ from traditional developer workspaces.

GPU Cluster Topology

Not all GPU availability is equal across cloud regions. H100 and A100 instances are scarce in many regions, while L4 and T4 GPUs are more widely available. Design your multi-cluster topology around GPU availability - place dedicated AI clusters in regions with reliable GPU capacity and preemptible pricing.

Consider multi-cloud GPU clusters to avoid dependency on a single provider's capacity. An AI workload cluster on GCP for TPU access, another on AWS for Inferentia, and a third on Azure for cost-optimized spot GPU instances gives your platform flexibility to route AI workloads where capacity is available.

Agentic Session Routing

AI coding agents like Claude Code, Cursor Agent, and Windsurf create long-running unattended sessions that consume resources differently than interactive developer workspaces. Route these sessions to dedicated clusters with policies for maximum runtime, resource caps, and automatic cleanup of abandoned sessions.

Coder and Ona both support workspace template parameters that can trigger routing to specific clusters. Define agent-specific templates that target GPU-equipped clusters with appropriate timeout policies, cost limits, and monitoring hooks to prevent runaway sessions from consuming expensive resources indefinitely.

Model Registry Access

AI workloads frequently pull large model weights (tens to hundreds of gigabytes) from registries. In a multi-cluster setup, avoid repeatedly downloading models across regions by deploying regional model caches or registry mirrors. Place model registries in the same region as your GPU clusters to minimize transfer time and egress costs.

Use persistent volume claims or shared storage layers so that model weights downloaded by one workspace are available to all workspaces on the same cluster without re-downloading. This significantly reduces workspace startup time for AI development environments.

AI Cost Attribution

GPU infrastructure costs dwarf standard compute costs - a single H100 instance can cost more per hour than an entire team's standard development workspaces. Dedicated AI clusters simplify cost attribution by clearly separating GPU spending from general development infrastructure.

Implement per-workspace and per-team cost tracking on AI clusters. Tag workspaces with team, project, and workload type labels so finance teams can understand exactly which AI initiatives are driving infrastructure costs. This visibility is critical for justifying and optimizing AI development investment.

AI Cluster Scaling Patterns

AI workloads are inherently bursty - a team might launch dozens of agent sessions simultaneously during a sprint, then have near-zero GPU usage overnight. Use cluster autoscaling with aggressive scale-down policies and node pool preemption to match capacity to demand. Configure minimum node counts to avoid cold-start delays for the first workspace of the day, but scale to zero during extended idle periods to control costs.

For predictable batch workloads like nightly model training or scheduled fine-tuning jobs, use Kubernetes CronJobs to pre-scale GPU node pools before the workload arrives. Combine this with spot or preemptible instances for batch-tolerant workloads to reduce GPU costs by 60-80% compared to on-demand pricing.

Operational Challenges

Running multiple clusters introduces operational complexity. Address these challenges with robust tooling and processes.

Monitoring Across Clusters

Traditional single-cluster monitoring approaches break down with multiple clusters. You need visibility into resource utilization, workspace health, and platform performance across all infrastructure simultaneously.

Aggregated Metrics

Deploy a central metrics aggregation system (Prometheus with Thanos, Grafana Mimir, or Datadog) that collects metrics from all clusters. This enables dashboards showing organization-wide resource usage, capacity, and performance trends.

Create cluster-scoped and global dashboards. Operators can drill down from organization-wide views into specific cluster details when investigating issues. Include cluster comparison views to identify outliers.

Health Checks

Implement synthetic monitoring that regularly tests critical workflows on each cluster. Automated tests create workspaces, connect to them, execute code, and verify expected behavior - catching issues before users report them.

Track cluster-specific SLIs like workspace creation time, connection latency, and failure rates. Alert when any cluster degrades below acceptable thresholds even if other clusters remain healthy.

Log Aggregation

Centralize logs from all clusters into a unified logging platform (Elasticsearch, Loki, Splunk, or cloud provider logging). Developers and operators should search logs across all clusters from a single interface.

Correlation

Tag logs with cluster identifiers, workspace IDs, and user identifiers to enable correlation across infrastructure. When troubleshooting issues affecting multiple clusters, filter logs by affected users or workspaces regardless of location.

Implement distributed tracing with tools like Jaeger or Zipkin to follow requests across cluster boundaries. See the complete path of a workspace creation request from API gateway through hub cluster to destination spoke cluster.

Retention and Cost

Log volume grows multiplicatively with cluster count. Implement intelligent retention policies: keep detailed logs from recent periods, summarized logs from older periods, and aggregate high-level metrics indefinitely.

Use log sampling for high-volume but low-value logs (health check passes, routine operations) while preserving all error logs and security-relevant events. This reduces storage costs while maintaining investigative capability.

Incident Response

When incidents occur, quickly determine scope and impact. Is the issue affecting one cluster, multiple clusters, or all infrastructure? Are specific types of workspaces impacted or all workspaces?

Runbooks

Maintain cluster-specific and cross-cluster runbooks. Include procedures for common issues like cluster capacity exhaustion, network connectivity problems, and control plane failures. Document which team members have access to each cluster.

Create decision trees for incident responders: How do I determine which cluster is affected? How do I drain workspaces from a failing cluster? When should I fail over to an alternative cluster versus attempting repair?

Communication

During cluster-specific incidents, clearly communicate which developers are affected. A status page showing per-cluster health helps developers understand if their workspaces are impacted or if issues are isolated elsewhere.

Provide mitigation options when possible. If a cluster is degraded, offer automated workspace migration to healthy clusters. Give developers agency to work around issues rather than waiting for resolution.

Upgrade Coordination

Upgrading Kubernetes versions, CDE platforms (Coder, Ona, Daytona), or infrastructure components across multiple clusters requires careful planning to avoid widespread disruption and maintain compatibility.

Rolling Upgrades

Never upgrade all clusters simultaneously. Use a phased approach: test cluster first, then one production cluster, then gradually roll out to remaining clusters over days or weeks. This limits blast radius if issues emerge.

Maintain version compatibility windows where clusters on different versions can coexist. The hub cluster should support workspace clusters on N-1 or N-2 versions during transitions. Test version interoperability explicitly.

Automation

Automate upgrade testing with canary workspaces on newly upgraded clusters. Create workspaces, run integration tests, verify cross-cluster communication, and validate performance before declaring the upgrade successful.

Build automated rollback capabilities. If an upgrade causes issues, the system should automatically revert to the previous version or evacuate workspaces to stable clusters without manual intervention.

Frequently Asked Questions

Common questions about multi-cluster development environments

How many clusters should we run for our development environment?

Start with 2-3 clusters based on your primary geographic regions or business requirements. A typical starting point is one cluster per major region where you have significant developer populations (US, Europe, Asia), plus a dedicated cluster for GPU and AI workloads if your teams are adopting agentic coding tools. Add more clusters as specific needs emerge: compliance requirements, specialized hardware needs, or team isolation. Most organizations with under 500 developers do well with 3-5 clusters. Large enterprises may run 10-20 clusters across regions, business units, and specialized workload types including dedicated AI infrastructure.

Can workspaces in different clusters communicate with each other?

Yes, with appropriate networking configuration. Service mesh solutions like Istio or Linkerd, cluster mesh with Cilium, or VPC peering connections enable cross-cluster communication. The choice depends on your requirements: service mesh provides the richest features but adds complexity, while VPC peering is simpler but less granular. For most development use cases, workspaces should be self-contained or use shared external services rather than requiring direct pod-to-pod communication across clusters. Reserve cross-cluster networking for specific scenarios like microservices development or shared infrastructure services.

What happens to workspaces if a cluster goes down?

Workspaces on the failed cluster become unavailable until the cluster is restored. This is why multi-cluster architecture improves resilience - only workspaces on the affected cluster are impacted rather than your entire development environment. To minimize impact, implement automated workspace migration that detects cluster health issues and recreates workspaces in healthy clusters. Store workspace data in distributed storage (object storage, distributed databases) rather than cluster-local volumes so recreated workspaces can access the same data. For critical workspaces, some organizations run active-passive pairs across clusters with automated failover, though this doubles infrastructure costs.

How do we handle different Kubernetes versions across clusters during upgrades?

Design your CDE platform and infrastructure to tolerate version skew within Kubernetes supported ranges (typically N to N-2 minor versions). Test your CDE platform components against multiple Kubernetes versions before beginning upgrades. Use a phased rollout strategy: upgrade one cluster completely, monitor for issues for several days, then proceed with the next cluster. Your hub cluster should gracefully handle spoke clusters running older Kubernetes versions during this transition period. Document which features or configurations are version-dependent and communicate any temporary limitations. Plan upgrade cycles to complete all clusters within 3-6 months to avoid supporting too many versions simultaneously.

Should AI agent workloads run on the same clusters as developer workspaces?

In most cases, no. AI agent sessions have fundamentally different resource profiles - they consume GPU memory, run unattended for extended periods, and generate bursty compute demands that can starve interactive developer workspaces of resources. Dedicated AI clusters with GPU node pools, aggressive autoscaling policies, and cost-tracking labels provide better resource isolation and cost visibility. Platforms like Coder and Ona support routing workspace creation to specific clusters based on template parameters, making it straightforward to direct agent workloads to GPU-equipped infrastructure while keeping standard developer workspaces on cost-effective general-purpose clusters. Start with a single dedicated AI cluster and scale out to multiple GPU clusters as your AI workload volume grows.

Continue Learning

Explore related topics to deepen your understanding of CDE infrastructure and operations

Multi-Cluster Development

Why Multi-Cluster Development?

AI Workload Isolation

Geographic Distribution

Blast Radius Reduction

Compliance and Data Residency

Resource Isolation

Multi-Cluster Architectures

Hub-Spoke Architecture

Advantages

Challenges

Mesh Architecture

Advantages

Challenges

Federated Architecture

Advantages

Challenges

Cluster Selection Strategies

Region-Based Routing

Workload-Based Routing

Cost-Optimized Placement

Hybrid Selection Strategies

Cross-Cluster Networking

Service Mesh

Cluster Mesh (Cilium)

VPN Peering

Cross-Cluster Security Considerations

Network Segmentation

Encryption in Transit

Identity and Access

Audit and Monitoring

State Management Across Clusters

Shared Databases

Distributed Storage

Configuration Sync

State Management Best Practices

Developer Experience in Multi-Cluster Environments

Transparent Cluster Selection

Consistent Tooling

Workspace Portability

Unified Dashboard

Communication and Documentation

Multi-Cluster for AI Workloads

GPU Cluster Topology

Agentic Session Routing

Model Registry Access

AI Cost Attribution

AI Cluster Scaling Patterns

Operational Challenges

Monitoring Across Clusters

Aggregated Metrics

Health Checks

Log Aggregation

Correlation

Retention and Cost

Incident Response

Runbooks

Communication

Upgrade Coordination

Rolling Upgrades

Automation

Frequently Asked Questions

How many clusters should we run for our development environment?

Can workspaces in different clusters communicate with each other?

What happens to workspaces if a cluster goes down?

How do we handle different Kubernetes versions across clusters during upgrades?

Should AI agent workloads run on the same clusters as developer workspaces?

Continue Learning

Agentic Engineering

Architecture Patterns

High Availability

Capacity Planning

Disaster Recovery

AI Agent Security