Monitoring & Observability Guide
Complete monitoring strategy for Cloud Development Environments. Track performance, ensure reliability, optimize costs, and maintain SLAs.
Why Monitoring Matters
CDEs are mission-critical infrastructure. Without proper monitoring, you fly blind.
Business Impact
- Developer productivity tracking
- Incident response automation
- Capacity planning data
- Compliance evidence
SLA Management
- Track uptime commitments
- Measure error budgets
- Identify SLO violations
- Prove service quality
Cost Optimization
- Detect idle workspaces
- Right-size resources
- Forecast spend trends
- Reduce cloud waste
Key Metrics to Monitor
Track these metrics across your CDE platform for comprehensive observability.
Control Plane
API Latency
P50, P95, P99 response times
Error Rates
4xx, 5xx HTTP status codes
Auth Failures
SSO, OIDC, token validation
Workspaces
Creation Time
Request to ready (target: 60s)
Startup Time
Container/VM boot duration
Failure Rate
Failed provisioning attempts
Active Count
Running workspaces per user
User Experience
Connection Latency
SSH, VS Code Remote ping time
IDE Responsiveness
File save, autocomplete lag
Session Duration
Average time in workspace
Infrastructure
CPU Usage
Per workspace & cluster-wide
Memory Usage
RAM consumption & pressure
Disk I/O
Read/write throughput & IOPS
Network Traffic
Ingress/egress bandwidth
Cost Tracking
Spend per User
Daily/monthly cost attribution
Idle Time
Workspaces with no activity
Resource Waste
Over-provisioned instances
Budget Alerts
Threshold-based notifications
Prometheus & Grafana Setup
Industry-standard monitoring stack for CDEs. Open-source, powerful, and battle-tested.
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Coder control plane metrics
- job_name: 'coder'
static_configs:
- targets: ['coder:2112']
# Kubernetes node metrics
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
# Workspace pod metrics
- job_name: 'workspace-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- coder-workspaces
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Cost metrics (AWS CloudWatch exporter)
- job_name: 'cloudwatch-exporter'
static_configs:
- targets: ['cloudwatch-exporter:9106']
Essential PromQL Queries
Workspace Creation Time (P95)
histogram_quantile(0.95,
sum(rate(workspace_build_duration_bucket[5m]))
by (le, template)
)
API Error Rate
sum(rate(http_requests_total{
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total[5m]))
Idle Workspaces (> 1 hour)
count(
time() - workspace_last_activity_seconds
> 3600
)
Memory Usage per Workspace
sum(container_memory_usage_bytes{
namespace="coder-workspaces"
}) by (pod) / 1024^3
Import Grafana Dashboard
Quick Import: Create a new Grafana dashboard and paste this JSON to get started immediately.
This dashboard includes panels for workspace metrics, control plane health, cost tracking, and SLO monitoring.
{
"dashboard": {
"title": "CDE Monitoring - Overview",
"panels": [
{
"title": "Workspace Creation Time (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(workspace_build_duration_bucket[5m])) by (le))"
}
],
"type": "graph"
},
{
"title": "Active Workspaces",
"targets": [
{
"expr": "count(workspace_status{state='running'})"
}
],
"type": "stat"
},
{
"title": "Control Plane API Latency",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
}
],
"type": "graph"
},
{
"title": "Cost per User (Last 7 Days)",
"targets": [
{
"expr": "sum(workspace_cost_usd) by (user) / 7"
}
],
"type": "table"
}
]
}
}
Alerting Strategy
Tiered alerting ensures the right people get notified at the right time.
Critical
Immediate page-out. Service disruption or data loss risk.
- Control plane down
- Database unavailable
- Workspace creation failing 100%
- Storage volume full
Warning
Investigate during business hours. Degraded performance.
- API latency > 500ms (P95)
- Error rate > 1%
- Workspace creation > 90s
- Cluster capacity < 20%
Info
Non-urgent notifications. Trends and optimizations.
- Cost spike detected
- Idle workspaces > 10
- New user onboarded
- Weekly usage report
Example Alert Rules
groups:
- name: cde_critical
interval: 30s
rules:
- alert: ControlPlaneDown
expr: up{job="coder"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Coder control plane is down"
description: "Control plane has been unreachable for 1 minute"
- alert: WorkspaceCreationFailureRate
expr: |
sum(rate(workspace_build_failed_total[5m]))
/
sum(rate(workspace_build_total[5m]))
> 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "50% of workspace builds are failing"
- name: cde_warnings
interval: 1m
rules:
- alert: HighAPILatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le)
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "API latency P95 > 500ms"
- alert: HighMemoryUsage
expr: |
(
sum(container_memory_usage_bytes{namespace="coder-workspaces"})
/
sum(container_spec_memory_limit_bytes{namespace="coder-workspaces"})
) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "Workspace cluster memory usage > 85%"
Log Aggregation
Centralize logs from control plane, workspaces, and infrastructure for troubleshooting and auditing.
ELK Stack
Elasticsearch, Logstash, Kibana - open-source logging platform
- Self-hosted option
- Powerful search queries
- Custom dashboards
- Free tier available
Datadog
All-in-one SaaS platform for logs, metrics, and traces
- Unified observability
- APM integration
- AI-powered insights
- Built-in alerting
Splunk
Enterprise-grade log management and security analytics
- Advanced analytics
- Compliance reporting
- Machine learning
- Security focus
Structured Log Format (JSON)
{
"timestamp": "2025-11-28T14:32:15Z",
"level": "info",
"service": "coder",
"component": "provisioner",
"event": "workspace_build_started",
"workspace_id": "ws-abc123",
"workspace_name": "dev-python",
"user_id": "user-xyz789",
"user_email": "[email protected]",
"template": "python-3.11",
"region": "us-east-1",
"instance_type": "t3.large",
"trace_id": "1a2b3c4d5e6f"
}
{
"timestamp": "2025-11-28T14:33:47Z",
"level": "info",
"service": "coder",
"component": "provisioner",
"event": "workspace_build_completed",
"workspace_id": "ws-abc123",
"duration_seconds": 92,
"status": "success",
"trace_id": "1a2b3c4d5e6f"
}
SLO/SLI Definition
Define clear Service Level Objectives and track Service Level Indicators to measure platform reliability.
Understanding SLIs and SLOs
SLI (Service Level Indicator)
A quantitative measure of service performance. Example: "95% of workspace builds complete in under 60 seconds."
SLO (Service Level Objective)
Your target reliability goal. Example: "99.9% workspace availability per month."
Error Budget
Acceptable amount of downtime. For 99.9% SLO: 43.2 minutes/month. Once depleted, freeze feature releases.
Example CDE SLOs
Availability
99.9% uptime for control plane and existing workspaces
Creation Time
95% of workspaces ready in < 60 seconds
API Latency
P99 response time < 500ms
Success Rate
99% of workspace builds succeed
Tracking SLO Compliance
# Availability SLO (99.9% = 0.999)
# Calculate uptime percentage over 30 days
(
sum(up{job="coder"} == 1)
/
count(up{job="coder"})
) * 100
# Workspace creation time SLO (95% < 60s)
# Percentage of builds meeting target
(
sum(workspace_build_duration_seconds_bucket{le="60"})
/
sum(workspace_build_duration_seconds_count)
) * 100
# Error budget remaining (for 99.9% SLO)
# 1 - (actual_errors / allowed_errors)
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/
(sum(increase(http_requests_total[30d])) * 0.001)
)
Cost Monitoring
Track cloud spending, identify waste, and optimize resource allocation across your CDE platform.
What to Track
-
Per-User Cost
Compute + storage + network per developer
-
Template Cost
Avg cost by workspace template type
-
Idle Time
Workspaces running with no activity
-
Trend Analysis
Month-over-month spending growth
Optimization Strategies
-
Auto-Stop Policies
Shutdown after 2 hours of inactivity
-
Right-Sizing
Downgrade over-provisioned instances
-
Spot Instances
Use spot for non-critical workloads (70% savings)
-
Reserved Capacity
Commit to 1-year for predictable workloads
Cost Dashboard Queries
# Daily cost per user (last 30 days)
sum(
aws_ec2_instance_cost_usd{namespace="coder-workspaces"}
) by (user) / 30
# Idle workspace cost (running > 1 hour with no activity)
sum(
workspace_cost_per_hour
* on(workspace_id) (time() - workspace_last_activity_seconds > 3600)
)
# Top 10 most expensive templates
topk(10,
sum(workspace_total_cost_usd) by (template)
)
# Projected monthly spend
sum(workspace_cost_per_hour) * 730
# Savings from auto-stop (workspaces stopped by policy)
sum(workspace_autostop_savings_usd)
Platform-Specific Monitoring
Setup guides for popular CDE platforms.
Coder
Built-in Prometheus metrics on :2112/metrics
- workspace_build_duration_seconds
- http_request_duration_seconds
- workspace_status (running/stopped)
- provisioner_job_queue_length
Gitpod
Self-hosted: metrics-server exposes workspace stats
- gitpod_ws_startup_seconds
- gitpod_ws_active_count
- gitpod_server_api_calls_total
- gitpod_workspace_failure_total
Codespaces
Use GitHub Audit Log API + billing API
- codespaces.create events
- codespaces.delete events
- Billing API for usage
- Export to SIEM/log platform
Dashboard Templates
Ready-to-use Grafana dashboard JSON templates. Import and customize for your environment.
CDE Overview Dashboard
High-level platform health: active workspaces, creation time, error rates, SLO compliance.
Cost Analysis Dashboard
Per-user spend, template costs, idle workspace detection, projected monthly spend.
User Experience Dashboard
Connection latency, IDE responsiveness, session durations, user satisfaction metrics.
Infrastructure Dashboard
CPU, memory, disk I/O, network traffic across Kubernetes cluster and workspace pods.
Note: These dashboard templates are reference examples. You'll need to:
- Adjust metric names to match your CDE platform (Coder, Gitpod, etc.)
- Configure Prometheus data source in Grafana
- Customize thresholds and alert rules for your SLOs
- Add cloud provider cost metrics (AWS, Azure, GCP)
Ready to Monitor Your CDE Platform?
Start with our Prometheus + Grafana setup guide and implement comprehensive observability in under an hour.