What metrics should I monitor for CDEs?

Key CDE metrics include: workspace startup time, active workspace count, resource utilization (CPU/memory/storage), connection latency, build success rate, and cost per developer. Track SLIs like workspace availability and startup time p95.

How do I set up Prometheus monitoring for CDEs?

Most CDE platforms expose Prometheus metrics endpoints. Configure Prometheus to scrape the control plane metrics, add Grafana dashboards for visualization, and set up alerts for workspace failures, high latency, or resource exhaustion.

What SLOs should CDE platforms target?

Typical CDE SLOs: 99.5% workspace availability, workspace startup under 2 minutes (p95), IDE connection latency under 100ms, and 99% build success rate. Adjust based on your organization's requirements.

Monitoring & Observability Guide

Complete monitoring strategy for Cloud Development Environments. Track performance, ensure reliability, optimize costs, and maintain SLAs.

Quick Setup Dashboard Templates

Why Monitoring Matters

CDEs are mission-critical infrastructure. Without proper monitoring, you fly blind.

Business Impact

Developer productivity tracking
Incident response automation
Capacity planning data
Compliance evidence

SLA Management

Track uptime commitments
Measure error budgets
Identify SLO violations
Prove service quality

Cost Optimization

Detect idle workspaces
Right-size resources
Forecast spend trends
Reduce cloud waste

Key Metrics to Monitor

Track these metrics across your CDE platform for comprehensive observability.

Control Plane

API Latency

P50, P95, P99 response times

Error Rates

4xx, 5xx HTTP status codes

Auth Failures

SSO, OIDC, token validation

Workspaces

Creation Time

Request to ready (target: 60s)

Startup Time

Container/VM boot duration

Failure Rate

Failed provisioning attempts

Active Count

Running workspaces per user

User Experience

Connection Latency

SSH, VS Code Remote ping time

IDE Responsiveness

File save, autocomplete lag

Session Duration

Average time in workspace

Infrastructure

CPU Usage

Per workspace & cluster-wide

Memory Usage

RAM consumption & pressure

Disk I/O

Read/write throughput & IOPS

Network Traffic

Ingress/egress bandwidth

Cost Tracking

Spend per User

Daily/monthly cost attribution

Idle Time

Workspaces with no activity

Resource Waste

Over-provisioned instances

Budget Alerts

Threshold-based notifications

Prometheus & Grafana Setup

Industry-standard monitoring stack for CDEs. Open-source, powerful, and battle-tested.

Prometheus Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Coder control plane metrics
  - job_name: 'coder'
    static_configs:
      - targets: ['coder:2112']

  # Kubernetes node metrics
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node

  # Workspace pod metrics
  - job_name: 'workspace-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - coder-workspaces
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  # Cost metrics (AWS CloudWatch exporter)
  - job_name: 'cloudwatch-exporter'
    static_configs:
      - targets: ['cloudwatch-exporter:9106']

Essential PromQL Queries

Workspace Creation Time (P95)

histogram_quantile(0.95,
  sum(rate(workspace_build_duration_bucket[5m]))
  by (le, template)
)

API Error Rate

sum(rate(http_requests_total{
  status=~"5.."
}[5m]))
/
sum(rate(http_requests_total[5m]))

Idle Workspaces (> 1 hour)

count(
  time() - workspace_last_activity_seconds
  > 3600
)

Memory Usage per Workspace

sum(container_memory_usage_bytes{
  namespace="coder-workspaces"
}) by (pod) / 1024^3

Import Grafana Dashboard

Quick Import: Create a new Grafana dashboard and paste this JSON to get started immediately.

This dashboard includes panels for workspace metrics, control plane health, cost tracking, and SLO monitoring.

cde-monitoring-dashboard.json (excerpt)

{
  "dashboard": {
    "title": "CDE Monitoring - Overview",
    "panels": [
      {
        "title": "Workspace Creation Time (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(workspace_build_duration_bucket[5m])) by (le))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Workspaces",
        "targets": [
          {
            "expr": "count(workspace_status{state='running'})"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Control Plane API Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Cost per User (Last 7 Days)",
        "targets": [
          {
            "expr": "sum(workspace_cost_usd) by (user) / 7"
          }
        ],
        "type": "table"
      }
    ]
  }
}

Alerting Strategy

Tiered alerting ensures the right people get notified at the right time.

Critical

Immediate page-out. Service disruption or data loss risk.

Control plane down
Database unavailable
Workspace creation failing 100%
Storage volume full

Warning

Investigate during business hours. Degraded performance.

API latency > 500ms (P95)
Error rate > 1%
Workspace creation > 90s
Cluster capacity < 20%

Info

Non-urgent notifications. Trends and optimizations.

Cost spike detected
Idle workspaces > 10
New user onboarded
Weekly usage report

Example Alert Rules

alerting-rules.yml

groups:
  - name: cde_critical
    interval: 30s
    rules:
      - alert: ControlPlaneDown
        expr: up{job="coder"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Coder control plane is down"
          description: "Control plane has been unreachable for 1 minute"

      - alert: WorkspaceCreationFailureRate
        expr: |
          sum(rate(workspace_build_failed_total[5m]))
          /
          sum(rate(workspace_build_total[5m]))
          > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "50% of workspace builds are failing"

  - name: cde_warnings
    interval: 1m
    rules:
      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API latency P95 > 500ms"

      - alert: HighMemoryUsage
        expr: |
          (
            sum(container_memory_usage_bytes{namespace="coder-workspaces"})
            /
            sum(container_spec_memory_limit_bytes{namespace="coder-workspaces"})
          ) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Workspace cluster memory usage > 85%"

Log Aggregation

Centralize logs from control plane, workspaces, and infrastructure for troubleshooting and auditing.

ELK Stack

Elasticsearch, Logstash, Kibana - open-source logging platform

Self-hosted option
Powerful search queries
Custom dashboards
Free tier available

Datadog

All-in-one SaaS platform for logs, metrics, and traces

Unified observability
APM integration
AI-powered insights
Built-in alerting

Splunk

Enterprise-grade log management and security analytics

Advanced analytics
Compliance reporting
Machine learning
Security focus

Structured Log Format (JSON)

workspace-build.log

{
  "timestamp": "2025-11-28T14:32:15Z",
  "level": "info",
  "service": "coder",
  "component": "provisioner",
  "event": "workspace_build_started",
  "workspace_id": "ws-abc123",
  "workspace_name": "dev-python",
  "user_id": "user-xyz789",
  "user_email": "[email protected]",
  "template": "python-3.11",
  "region": "us-east-1",
  "instance_type": "t3.large",
  "trace_id": "1a2b3c4d5e6f"
}

{
  "timestamp": "2025-11-28T14:33:47Z",
  "level": "info",
  "service": "coder",
  "component": "provisioner",
  "event": "workspace_build_completed",
  "workspace_id": "ws-abc123",
  "duration_seconds": 92,
  "status": "success",
  "trace_id": "1a2b3c4d5e6f"
}

SLO/SLI Definition

Define clear Service Level Objectives and track Service Level Indicators to measure platform reliability.

Understanding SLIs and SLOs

SLI (Service Level Indicator)

A quantitative measure of service performance. Example: "95% of workspace builds complete in under 60 seconds."

SLO (Service Level Objective)

Your target reliability goal. Example: "99.9% workspace availability per month."

Error Budget

Acceptable amount of downtime. For 99.9% SLO: 43.2 minutes/month. Once depleted, freeze feature releases.

Example CDE SLOs

Availability

99.9% uptime for control plane and existing workspaces

Creation Time

95% of workspaces ready in < 60 seconds

API Latency

P99 response time < 500ms

Success Rate

99% of workspace builds succeed

Tracking SLO Compliance

slo-tracking.promql

# Availability SLO (99.9% = 0.999)
# Calculate uptime percentage over 30 days
(
  sum(up{job="coder"} == 1)
  /
  count(up{job="coder"})
) * 100

# Workspace creation time SLO (95% < 60s)
# Percentage of builds meeting target
(
  sum(workspace_build_duration_seconds_bucket{le="60"})
  /
  sum(workspace_build_duration_seconds_count)
) * 100

# Error budget remaining (for 99.9% SLO)
# 1 - (actual_errors / allowed_errors)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  /
  (sum(increase(http_requests_total[30d])) * 0.001)
)

Cost Monitoring

Track cloud spending, identify waste, and optimize resource allocation across your CDE platform.

What to Track

Per-User Cost

Compute + storage + network per developer
Template Cost

Avg cost by workspace template type
Idle Time

Workspaces running with no activity
Trend Analysis

Month-over-month spending growth

Optimization Strategies

Auto-Stop Policies

Shutdown after 2 hours of inactivity
Right-Sizing

Downgrade over-provisioned instances
Spot Instances

Use spot for non-critical workloads (70% savings)
Reserved Capacity

Commit to 1-year for predictable workloads

Cost Dashboard Queries

cost-queries.promql

# Daily cost per user (last 30 days)
sum(
  aws_ec2_instance_cost_usd{namespace="coder-workspaces"}
) by (user) / 30

# Idle workspace cost (running > 1 hour with no activity)
sum(
  workspace_cost_per_hour
  * on(workspace_id) (time() - workspace_last_activity_seconds > 3600)
)

# Top 10 most expensive templates
topk(10,
  sum(workspace_total_cost_usd) by (template)
)

# Projected monthly spend
sum(workspace_cost_per_hour) * 730

# Savings from auto-stop (workspaces stopped by policy)
sum(workspace_autostop_savings_usd)

AIOps & AI-Powered Observability

AIOps and AI-powered observability tools now automatically detect anomalies, correlate incidents, and suggest remediation for CDE infrastructure issues.

Anomaly Detection

AI models learn normal patterns for workspace startup times, resource usage, and API latency, then flag deviations before they become incidents.

Baseline learning per template
Dynamic threshold adjustment
Seasonal pattern awareness

Incident Correlation

AI-powered monitoring reduces mean time to resolution (MTTR) by automatically identifying root causes across complex CDE infrastructure.

Cross-service correlation
Automatic root cause analysis
Topology-aware alerting

Predictive Remediation

AI suggests and can auto-execute remediation actions such as scaling resources, restarting unhealthy services, or rerouting traffic before users are impacted.

Suggested runbooks per incident type
Auto-scaling recommendations
Capacity forecasting

AIOps Platforms for CDE Monitoring

Datadog AI

Watchdog auto-detects anomalies across metrics, logs, and traces

Dynatrace Davis AI

Causal AI engine for automatic root cause analysis

New Relic AI

AI-assisted incident response and error analysis

Grafana ML

Machine learning-powered alerting and forecasting for Prometheus

Platform-Specific Monitoring

Setup guides for popular CDE platforms.

Coder

Built-in Prometheus metrics on :2112/metrics

workspace_build_duration_seconds
http_request_duration_seconds
workspace_status (running/stopped)
provisioner_job_queue_length

View Coder Docs

Ona (formerly Gitpod)

Self-hosted: metrics-server exposes workspace stats

gitpod_ws_startup_seconds
gitpod_ws_active_count
gitpod_server_api_calls_total
gitpod_workspace_failure_total

View Ona Docs

Codespaces

Use GitHub Audit Log API + billing API

codespaces.create events
codespaces.delete events
Billing API for usage
Export to SIEM/log platform

View GitHub Docs

Dashboard Templates

Ready-to-use Grafana dashboard JSON templates. Import and customize for your environment.

CDE Overview Dashboard

High-level platform health: active workspaces, creation time, error rates, SLO compliance.

12 panels Prometheus Grafana 9+

Cost Analysis Dashboard

Per-user spend, template costs, idle workspace detection, projected monthly spend.

8 panels CloudWatch Grafana 9+

User Experience Dashboard

Connection latency, IDE responsiveness, session durations, user satisfaction metrics.

10 panels Prometheus Grafana 9+

Infrastructure Dashboard

CPU, memory, disk I/O, network traffic across Kubernetes cluster and workspace pods.

16 panels cAdvisor Grafana 9+

Note: These dashboard templates are reference examples. You'll need to:

Adjust metric names to match your CDE platform (Coder, Ona, etc.)
Configure Prometheus data source in Grafana
Customize thresholds and alert rules for your SLOs
Add cloud provider cost metrics (AWS, Azure, GCP)

Ready to Monitor Your CDE Platform?

Start with our Prometheus + Grafana setup guide and implement comprehensive observability in under an hour.

Migration Guide Architecture Patterns