CDE Troubleshooting Guide
Common issues and solutions for cloud development environments
Quick Diagnostics
coder pingTest workspace connectivity
df -hCheck disk space
free -mCheck memory usage
nvidia-smiCheck GPU status
curl -s $LLM_ENDPOINT/healthTest LLM API connectivity
ps aux | grep agentCheck running AI agents
cat /proc/meminfo | headDetailed memory breakdown
ss -tlnpList open ports and listeners
AI Agent Issues
Symptoms
- - Agent fails with "API unreachable" or timeout errors
- - Code completions and chat features stop working
- - Agent logs show HTTP 401, 403, or 502 responses
Causes
- - LLM API key expired, missing, or not injected into workspace
- - Corporate proxy or firewall blocking outbound HTTPS to LLM provider
- - Workspace network policy restricting egress to API endpoints
Solution
- 1. Verify API key is set:
echo $OPENAI_API_KEY | head -c 8(should show prefix) - 2. Test connectivity:
curl -s https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" | head - 3. Check proxy settings:
env | grep -i proxy - 4. For self-hosted LLMs, verify the inference server is running and reachable from the workspace subnet
- 5. Ask your platform team to allowlist LLM API domains in workspace network policies
Prevention
Inject LLM API keys via secrets manager (not hardcoded). Pre-configure proxy and egress rules in workspace templates. Use health-check scripts in workspace startup to fail fast if the LLM endpoint is unreachable.
Symptoms
- - Workspace becomes sluggish when AI agent is running
- - OOM kills triggered by agent processes
- - CPU pegged at 100% during agent task execution
- - Disk fills up with agent context caches or logs
Causes
- - Agent running unbounded loops (recursive file edits, retries)
- - Large context windows loading entire repos into memory
- - Multiple agent sessions running concurrently
- - Agent caching embeddings or conversation history to disk
Solution
- 1. Identify the process:
top -o %MEMortop -o %CPU - 2. Set resource limits:
ulimit -v 4194304(4GB virtual memory cap) - 3. Kill runaway agents:
pkill -f "agent-process-name" - 4. Clean agent caches: check
~/.cache/and/tmp/for large agent artifacts - 5. Request a workspace template with higher resource limits for AI workloads
Prevention
Use microVM or container-level resource limits (cgroups) for agent processes. Configure agent timeout and max-iteration settings. Use workspace templates sized for AI workloads (8GB+ RAM, 4+ vCPUs recommended). Set up monitoring alerts for agent resource spikes.
Symptoms
- - Agent fails with "Permission denied" on file operations
- - Agent cannot install packages or run build commands
- - Agent blocked from network access it needs
Causes
- - Sandbox restricts agent to read-only or scoped directory access
- - Network egress policy blocks agent-initiated connections
- - Workspace security context (seccomp, AppArmor) too restrictive
Solution
- 1. Check sandbox config for the agent tool you are using (Claude Code, Cursor, Copilot, etc.)
- 2. Verify workspace user has write access:
ls -la /workspace/ - 3. For Coder or Ona (formerly Gitpod) workspaces, check template security policies
- 4. Review Kubernetes securityContext if running in a pod-based CDE
- 5. Ask platform team to adjust sandbox permissions for legitimate agent workflows
Prevention
Define agent permission tiers in workspace templates - read-only for code review agents, read-write for coding agents, and network access for agents that need external APIs. Use microVM isolation for agents that require broader permissions.
Symptoms
- - Agent editing the wrong files or deleting code
- - Agent retrying the same failing operation repeatedly
- - Agent spawning excessive subprocesses
- - Git history shows dozens of rapid automated commits
Causes
- - Vague or ambiguous task instructions given to agent
- - Agent lacks sufficient context about the codebase
- - No guardrails or iteration limits configured
- - Agent tool permissions are too broad
Solution
- 1. Stop the agent immediately (Ctrl+C or kill the process)
- 2. Review changes:
git diffandgit log --oneline -20 - 3. Revert unwanted changes:
git checkout -- .orgit reset HEAD~N - 4. Re-run with more specific task instructions and smaller scope
- 5. Enable agent confirmation mode (require approval before writes)
Prevention
Always commit or stash your work before running autonomous agents. Use agent tools with built-in confirmation and diff-review modes. Set max-iteration limits. Scope agents to specific directories rather than the entire repo. For production-critical code, require human review of agent-generated changes before merge.
LLM Connectivity & Integration
Symptoms
- - HTTP 429 "Too Many Requests" in agent logs
- - Code completions stop working mid-session
- - Chat responses return errors after heavy usage
Causes
- - Multiple developers sharing a single API key or org quota
- - Agent sending rapid-fire requests without backoff
- - Token-per-minute (TPM) limits exceeded on large context windows
Solution
- 1. Check current rate limit status in your LLM provider dashboard
- 2. Switch to a higher-tier plan or request a rate limit increase
- 3. Use per-developer API keys so one user does not exhaust the org quota
- 4. Configure your agent/tool to use exponential backoff on 429 responses
- 5. Reduce context window size to lower token consumption per request
Prevention
Provision per-team or per-developer API keys with individual rate limits. Use an LLM gateway or proxy (like LiteLLM or a custom gateway) to enforce quotas and route between providers. Monitor token usage per workspace to catch runaway consumption early.
Symptoms
- - Completions fail with connection refused or timeout
- - Agent works on cloud-hosted LLMs but not internal ones
- - Self-hosted vLLM, Ollama, or TGI endpoints return 503
Causes
- - Workspace in a different VPC or subnet than the inference server
- - GPU node ran out of VRAM or the model failed to load
- - Internal DNS not resolving the inference endpoint from workspace
Solution
- 1. Test connectivity:
curl -v http://llm-server:8000/health - 2. Verify DNS resolution:
nslookup llm-server - 3. Check if inference server is running:
kubectl get pods -l app=vllm - 4. Verify GPU health on the inference node:
nvidia-smi - 5. Confirm workspace and inference server share network reachability (same VPC, peered VPC, or VPN)
Prevention
Deploy inference servers in the same cluster or VPC as CDE workspaces. Use Kubernetes service discovery for stable endpoints. Set up GPU monitoring and auto-restart for inference pods. Configure fallback to a cloud-hosted LLM if the self-hosted endpoint goes down.
Symptoms
- - Code completions take 5-10+ seconds to appear
- - Chat responses stream very slowly
- - Agent tasks take much longer than expected
Causes
- - Oversized context windows sending too many tokens per request
- - Inference server under heavy load from multiple users
- - Workspace routing traffic through a high-latency proxy path
Solution
- 1. Measure raw latency:
time curl -s $LLM_ENDPOINT/v1/models - 2. Reduce context size - exclude large files, vendor directories, and binaries
- 3. Switch to a faster model for completions (use larger models only for complex tasks)
- 4. For self-hosted LLMs, scale up replicas or use a model with lower VRAM requirements
- 5. Check if your CDE provider offers edge-cached LLM endpoints
Prevention
Use tiered model routing - fast, small models for inline completions and larger models for complex agent tasks. Deploy inference servers in the same region as workspaces. Monitor P99 latency on your LLM gateway and auto-scale inference capacity based on demand.
Connectivity Issues
Symptoms
- - "Connection refused" error
- - Timeout when connecting via SSH
- - VS Code Remote shows "connecting..." forever
Causes
- - Workspace stopped or failed to start
- - Network/firewall blocking connection
- - SSH agent not configured
Solution
- 1. Check workspace status in CDE dashboard - restart if stopped
- 2. Run
coder config-sshto refresh SSH config - 3. Test with
ssh -v coder.workspace-namefor verbose output - 4. Check if corporate VPN is blocking port 22 or the CDE domain
Prevention
Configure workspace auto-start, add CDE domains to VPN split-tunnel, set up SSH keepalive.
Symptoms
- - Noticeable delay when typing
- - Commands take seconds to execute
- - IDE feels sluggish
Causes
- - Workspace in distant region
- - Network congestion or poor WiFi
- - VPN routing all traffic
Solution
- 1. Run
ping workspace-urlto check latency (target: <50ms) - 2. Request workspace in closer region
- 3. Configure VPN split-tunnel for CDE traffic
- 4. Switch to wired connection if on WiFi
Prevention
Deploy CDEs in multiple regions, use edge networking, optimize SSH connection settings.
Symptoms
- - "Connection reset by peer"
- - IDE disconnects after idle time
- - Terminal freezes then disconnects
Causes
- - Firewall/NAT timeout killing idle connections
- - Network instability
- - Workspace auto-stop triggered
Solution
Add to ~/.ssh/config:
Host * ServerAliveInterval 60 ServerAliveCountMax 3 TCPKeepAlive yes
Prevention
Configure SSH keepalive globally, extend auto-stop timeout, use tmux/screen for persistent sessions.
Performance Problems
Symptoms
- - Characters appear after delay
- - Autocomplete is slow
- - File operations lag
Causes
- - High CPU usage by language server or AI agent
- - Too many extensions or AI tools enabled simultaneously
- - Large workspace with many files being indexed
Solution
- 1. Check CPU:
toporhtop - 2. Disable unused VS Code extensions and AI tools you are not actively using
- 3. Add large folders to
.gitignoreand exclude from search - 4. Request workspace with more CPU cores (4+ vCPUs for AI-heavy workflows)
Symptoms
- - Builds slower than local machine
- - npm install takes forever
- - Docker builds very slow
Causes
- - No build cache configured
- - Slow network storage
- - Insufficient CPU/memory allocation
Solution
- 1. Enable persistent build cache (node_modules, .gradle, target/)
- 2. Use prebuilds for dependency installation
- 3. Mount fast local SSD for build directories
- 4. Configure Docker BuildKit with cache mounts
Symptoms
- - "JavaScript heap out of memory"
- - Process killed by OOM killer
- - Workspace becomes unresponsive
Causes
- - Workspace memory limit too low for AI-augmented workflows
- - Memory leak in application or AI agent process
- - Too many processes running (IDE + language server + AI agent)
Solution
- 1. Check usage:
free -h - 2. For Node.js:
export NODE_OPTIONS="--max-old-space-size=4096" - 3. Request workspace template with more RAM (8GB+ for AI workflows)
- 4. Kill unused background processes and idle AI agent sessions
IDE & Editor Issues
Solution
- 1. Update VS Code and Remote SSH extension to latest
- 2. Delete server:
rm -rf ~/.vscode-serveron remote - 3. Check SSH config:
coder config-ssh --yes - 4. Try connecting via terminal first to verify SSH works
Solution
- 1. Ensure Coder plugin is installed in Gateway
- 2. Check Gateway version matches IDE backend version
- 3. Clear Gateway cache: Help > Delete Caches
- 4. Verify workspace has enough RAM (8GB+ recommended for JetBrains)
Solution
- 1. Some extensions run locally, others remotely - check extension docs
- 2. Reinstall extension on remote: Extensions > Install in SSH
- 3. Add required extensions to devcontainer.json for consistency
- 4. Check extension requires tools not in container (e.g., git, python, node)
- 5. For AI coding extensions (Copilot, Codeium, Continue), verify API key injection and network access from the workspace
Authentication Problems
Solution
- 1. Clear browser cookies for CDE domain
- 2. Try incognito/private browsing mode
- 3. Verify you're using correct identity provider
- 4. Contact admin if account not provisioned via SCIM
Solution
- 1. Re-authenticate:
coder login https://cde.company.com - 2. Generate new token in dashboard: Settings > Tokens
- 3. Update CLI:
coder update - 4. Check if your access was revoked by admin
Solution
- 1. Check if the key is injected as an environment variable:
env | grep -i api_key - 2. Verify your workspace template includes the secret from your CDE secrets manager
- 3. For Coder: check workspace parameters and secrets in your template
- 4. For Ona: verify environment variables in your workspace configuration
- 5. Rotate the key if it may have been exposed or expired
Workspace Lifecycle
Solution
- 1. Check workspace logs in dashboard for errors
- 2. Verify resource quota not exceeded
- 3. Check if base image is accessible
- 4. Try recreating workspace from template
- 5. Contact platform team if infrastructure issue
Important
Only data in persistent volumes survives restarts. Ephemeral storage is lost.
Solution
- 1. Check what paths are persistent in your template
- 2. Common persistent: /home/coder, project directories
- 3. Always commit code to git regularly
- 4. Request template with more persistent storage
- 5. AI agent conversation history may be lost - export or sync agent state to a persistent path
Solution
- 1. Check auto-stop setting in workspace config
- 2. Activity (SSH, web terminal) resets idle timer
- 3. Request longer TTL from platform team if needed
- 4. Use scheduled jobs to maintain activity for long-running builds or agent tasks
- 5. For autonomous AI agents running overnight, ensure the workspace TTL exceeds the expected task duration
Database & Services
Solution
- 1. Check if database is in same VPC as workspace
- 2. Verify security group allows workspace IP range
- 3. Use internal DNS names, not public endpoints
- 4. Check credentials in secrets manager
- 5. Test with:
nc -zv db-host 5432
Solution
- 1. Use Coder port forward:
coder port-forward workspace 3000:3000 - 2. Or SSH:
ssh -L 3000:localhost:3000 coder.workspace - 3. Bind app to 0.0.0.0, not just localhost
- 4. Check if CDE has built-in port forwarding UI
Understanding
In remote development, "localhost" in your code means the workspace, not your laptop. But your browser runs on your laptop.
Solution
- 1. Use port forwarding to access workspace ports from browser
- 2. VS Code auto-forwards ports - check Ports panel
- 3. For API calls between services in workspace, localhost works
- 4. For browser access, use forwarded port or CDE preview URL
Resource Issues
Solution
- 1. Find large files:
du -sh * | sort -h - 2. Clean Docker:
docker system prune -a - 3. Clear package caches (npm, pip, gradle)
- 4. Remove old log files and AI agent caches in
~/.cache/ - 5. Request workspace with larger disk
Solution
- 1. Verify workspace template includes GPU
- 2. Check GPU quota: you may need to request access
- 3. Run
nvidia-smito verify driver - 4. Ensure NVIDIA container toolkit is installed
- 5. GPUs may be in short supply - try different region or off-peak hours
- 6. For local LLM inference, verify the model fits in available VRAM
Solution
- 1. Stop unused workspaces to free quota
- 2. Use smaller workspace templates when possible
- 3. Request quota increase from platform team
- 4. Check if org/team has shared quota limits
- 5. Review if autonomous AI agents are consuming quota by leaving workspaces running after task completion
Still Stuck?
Check these additional resources or contact your platform team.
