Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide
Implementation
Architecture Patterns DevContainers Language Quickstarts IDE Integration AI/ML Workloads Advanced DevContainers
Operations
Performance Optimization High Availability & DR Monitoring Capacity Planning Troubleshooting Runbooks
Security
Security Deep Dive Secrets Management Vulnerability Management Network Security IAM Guide Compliance Guide
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis Vendor Evaluation Training Resources Team Structure Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

Monitoring & Observability Guide

Complete monitoring strategy for Cloud Development Environments. Track performance, ensure reliability, optimize costs, and maintain SLAs.

Why Monitoring Matters

CDEs are mission-critical infrastructure. Without proper monitoring, you fly blind.

Business Impact

  • Developer productivity tracking
  • Incident response automation
  • Capacity planning data
  • Compliance evidence

SLA Management

  • Track uptime commitments
  • Measure error budgets
  • Identify SLO violations
  • Prove service quality

Cost Optimization

  • Detect idle workspaces
  • Right-size resources
  • Forecast spend trends
  • Reduce cloud waste

Key Metrics to Monitor

Track these metrics across your CDE platform for comprehensive observability.

Control Plane

API Latency

P50, P95, P99 response times

Error Rates

4xx, 5xx HTTP status codes

Auth Failures

SSO, OIDC, token validation

Workspaces

Creation Time

Request to ready (target: 60s)

Startup Time

Container/VM boot duration

Failure Rate

Failed provisioning attempts

Active Count

Running workspaces per user

User Experience

Connection Latency

SSH, VS Code Remote ping time

IDE Responsiveness

File save, autocomplete lag

Session Duration

Average time in workspace

Infrastructure

CPU Usage

Per workspace & cluster-wide

Memory Usage

RAM consumption & pressure

Disk I/O

Read/write throughput & IOPS

Network Traffic

Ingress/egress bandwidth

Cost Tracking

Spend per User

Daily/monthly cost attribution

Idle Time

Workspaces with no activity

Resource Waste

Over-provisioned instances

Budget Alerts

Threshold-based notifications

Prometheus & Grafana Setup

Industry-standard monitoring stack for CDEs. Open-source, powerful, and battle-tested.

Prometheus Configuration

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Coder control plane metrics
  - job_name: 'coder'
    static_configs:
      - targets: ['coder:2112']

  # Kubernetes node metrics
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node

  # Workspace pod metrics
  - job_name: 'workspace-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - coder-workspaces
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  # Cost metrics (AWS CloudWatch exporter)
  - job_name: 'cloudwatch-exporter'
    static_configs:
      - targets: ['cloudwatch-exporter:9106']

Essential PromQL Queries

Workspace Creation Time (P95)

histogram_quantile(0.95, sum(rate(workspace_build_duration_bucket[5m])) by (le, template) )

API Error Rate

sum(rate(http_requests_total{ status=~"5.." }[5m])) / sum(rate(http_requests_total[5m]))

Idle Workspaces (> 1 hour)

count( time() - workspace_last_activity_seconds > 3600 )

Memory Usage per Workspace

sum(container_memory_usage_bytes{ namespace="coder-workspaces" }) by (pod) / 1024^3

Import Grafana Dashboard

Quick Import: Create a new Grafana dashboard and paste this JSON to get started immediately.

This dashboard includes panels for workspace metrics, control plane health, cost tracking, and SLO monitoring.

cde-monitoring-dashboard.json (excerpt)
{
  "dashboard": {
    "title": "CDE Monitoring - Overview",
    "panels": [
      {
        "title": "Workspace Creation Time (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(workspace_build_duration_bucket[5m])) by (le))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Workspaces",
        "targets": [
          {
            "expr": "count(workspace_status{state='running'})"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Control Plane API Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Cost per User (Last 7 Days)",
        "targets": [
          {
            "expr": "sum(workspace_cost_usd) by (user) / 7"
          }
        ],
        "type": "table"
      }
    ]
  }
}

Alerting Strategy

Tiered alerting ensures the right people get notified at the right time.

Critical

Immediate page-out. Service disruption or data loss risk.

  • Control plane down
  • Database unavailable
  • Workspace creation failing 100%
  • Storage volume full

Warning

Investigate during business hours. Degraded performance.

  • API latency > 500ms (P95)
  • Error rate > 1%
  • Workspace creation > 90s
  • Cluster capacity < 20%

Info

Non-urgent notifications. Trends and optimizations.

  • Cost spike detected
  • Idle workspaces > 10
  • New user onboarded
  • Weekly usage report

Example Alert Rules

alerting-rules.yml
groups:
  - name: cde_critical
    interval: 30s
    rules:
      - alert: ControlPlaneDown
        expr: up{job="coder"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Coder control plane is down"
          description: "Control plane has been unreachable for 1 minute"

      - alert: WorkspaceCreationFailureRate
        expr: |
          sum(rate(workspace_build_failed_total[5m]))
          /
          sum(rate(workspace_build_total[5m]))
          > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "50% of workspace builds are failing"

  - name: cde_warnings
    interval: 1m
    rules:
      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API latency P95 > 500ms"

      - alert: HighMemoryUsage
        expr: |
          (
            sum(container_memory_usage_bytes{namespace="coder-workspaces"})
            /
            sum(container_spec_memory_limit_bytes{namespace="coder-workspaces"})
          ) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Workspace cluster memory usage > 85%"

Log Aggregation

Centralize logs from control plane, workspaces, and infrastructure for troubleshooting and auditing.

ELK Stack

Elasticsearch, Logstash, Kibana - open-source logging platform

  • Self-hosted option
  • Powerful search queries
  • Custom dashboards
  • Free tier available

Datadog

All-in-one SaaS platform for logs, metrics, and traces

  • Unified observability
  • APM integration
  • AI-powered insights
  • Built-in alerting

Splunk

Enterprise-grade log management and security analytics

  • Advanced analytics
  • Compliance reporting
  • Machine learning
  • Security focus

Structured Log Format (JSON)

workspace-build.log
{
  "timestamp": "2025-11-28T14:32:15Z",
  "level": "info",
  "service": "coder",
  "component": "provisioner",
  "event": "workspace_build_started",
  "workspace_id": "ws-abc123",
  "workspace_name": "dev-python",
  "user_id": "user-xyz789",
  "user_email": "[email protected]",
  "template": "python-3.11",
  "region": "us-east-1",
  "instance_type": "t3.large",
  "trace_id": "1a2b3c4d5e6f"
}

{
  "timestamp": "2025-11-28T14:33:47Z",
  "level": "info",
  "service": "coder",
  "component": "provisioner",
  "event": "workspace_build_completed",
  "workspace_id": "ws-abc123",
  "duration_seconds": 92,
  "status": "success",
  "trace_id": "1a2b3c4d5e6f"
}

SLO/SLI Definition

Define clear Service Level Objectives and track Service Level Indicators to measure platform reliability.

Understanding SLIs and SLOs

SLI (Service Level Indicator)

A quantitative measure of service performance. Example: "95% of workspace builds complete in under 60 seconds."

SLO (Service Level Objective)

Your target reliability goal. Example: "99.9% workspace availability per month."

Error Budget

Acceptable amount of downtime. For 99.9% SLO: 43.2 minutes/month. Once depleted, freeze feature releases.

Example CDE SLOs

Availability

99.9% uptime for control plane and existing workspaces

Creation Time

95% of workspaces ready in < 60 seconds

API Latency

P99 response time < 500ms

Success Rate

99% of workspace builds succeed

Tracking SLO Compliance

slo-tracking.promql
# Availability SLO (99.9% = 0.999)
# Calculate uptime percentage over 30 days
(
  sum(up{job="coder"} == 1)
  /
  count(up{job="coder"})
) * 100

# Workspace creation time SLO (95% < 60s)
# Percentage of builds meeting target
(
  sum(workspace_build_duration_seconds_bucket{le="60"})
  /
  sum(workspace_build_duration_seconds_count)
) * 100

# Error budget remaining (for 99.9% SLO)
# 1 - (actual_errors / allowed_errors)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  /
  (sum(increase(http_requests_total[30d])) * 0.001)
)

Cost Monitoring

Track cloud spending, identify waste, and optimize resource allocation across your CDE platform.

What to Track

  • Per-User Cost

    Compute + storage + network per developer

  • Template Cost

    Avg cost by workspace template type

  • Idle Time

    Workspaces running with no activity

  • Trend Analysis

    Month-over-month spending growth

Optimization Strategies

  • Auto-Stop Policies

    Shutdown after 2 hours of inactivity

  • Right-Sizing

    Downgrade over-provisioned instances

  • Spot Instances

    Use spot for non-critical workloads (70% savings)

  • Reserved Capacity

    Commit to 1-year for predictable workloads

Cost Dashboard Queries

cost-queries.promql
# Daily cost per user (last 30 days)
sum(
  aws_ec2_instance_cost_usd{namespace="coder-workspaces"}
) by (user) / 30

# Idle workspace cost (running > 1 hour with no activity)
sum(
  workspace_cost_per_hour
  * on(workspace_id) (time() - workspace_last_activity_seconds > 3600)
)

# Top 10 most expensive templates
topk(10,
  sum(workspace_total_cost_usd) by (template)
)

# Projected monthly spend
sum(workspace_cost_per_hour) * 730

# Savings from auto-stop (workspaces stopped by policy)
sum(workspace_autostop_savings_usd)

Platform-Specific Monitoring

Setup guides for popular CDE platforms.

Coder

Built-in Prometheus metrics on :2112/metrics

  • workspace_build_duration_seconds
  • http_request_duration_seconds
  • workspace_status (running/stopped)
  • provisioner_job_queue_length
View Coder Docs

Gitpod

Self-hosted: metrics-server exposes workspace stats

  • gitpod_ws_startup_seconds
  • gitpod_ws_active_count
  • gitpod_server_api_calls_total
  • gitpod_workspace_failure_total
View Gitpod Docs

Codespaces

Use GitHub Audit Log API + billing API

  • codespaces.create events
  • codespaces.delete events
  • Billing API for usage
  • Export to SIEM/log platform
View GitHub Docs

Dashboard Templates

Ready-to-use Grafana dashboard JSON templates. Import and customize for your environment.

CDE Overview Dashboard

High-level platform health: active workspaces, creation time, error rates, SLO compliance.

12 panels Prometheus Grafana 9+

Cost Analysis Dashboard

Per-user spend, template costs, idle workspace detection, projected monthly spend.

8 panels CloudWatch Grafana 9+

User Experience Dashboard

Connection latency, IDE responsiveness, session durations, user satisfaction metrics.

10 panels Prometheus Grafana 9+

Infrastructure Dashboard

CPU, memory, disk I/O, network traffic across Kubernetes cluster and workspace pods.

16 panels cAdvisor Grafana 9+

Note: These dashboard templates are reference examples. You'll need to:

  • Adjust metric names to match your CDE platform (Coder, Gitpod, etc.)
  • Configure Prometheus data source in Grafana
  • Customize thresholds and alert rules for your SLOs
  • Add cloud provider cost metrics (AWS, Azure, GCP)

Ready to Monitor Your CDE Platform?

Start with our Prometheus + Grafana setup guide and implement comprehensive observability in under an hour.