Why is my CDE connection slow?

Common causes include: high network latency to the workspace region, insufficient workspace resources, VS Code extension conflicts, or file sync issues. Try selecting a closer region, increasing workspace specs, or using Mutagen for file sync.

Why won't VS Code Remote SSH connect to my workspace?

Check: SSH key configuration, workspace is running (not stopped), firewall rules allow SSH, correct hostname/IP, and VS Code Remote SSH extension is updated. Review workspace logs for specific errors.

How do I debug workspace startup failures?

Check workspace build logs for container build errors, verify base image availability, ensure sufficient cluster resources, check for DevContainer configuration errors, and verify network connectivity to container registries.

CDE Troubleshooting Guide

Common issues and solutions for cloud development environments

Quick Diagnostics

coder ping

Test workspace connectivity

df -h

Check disk space

free -m

Check memory usage

nvidia-smi

Check GPU status

curl -s $LLM_ENDPOINT/health

Test LLM API connectivity

ps aux | grep agent

Check running AI agents

cat /proc/meminfo | head

Detailed memory breakdown

ss -tlnp

List open ports and listeners

AI Agent Issues

Symptoms

- Agent fails with "API unreachable" or timeout errors
- Code completions and chat features stop working
- Agent logs show HTTP 401, 403, or 502 responses

Causes

- LLM API key expired, missing, or not injected into workspace
- Corporate proxy or firewall blocking outbound HTTPS to LLM provider
- Workspace network policy restricting egress to API endpoints

Solution

1. Verify API key is set: echo $OPENAI_API_KEY | head -c 8 (should show prefix)
2. Test connectivity: curl -s https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" | head
3. Check proxy settings: env | grep -i proxy
4. For self-hosted LLMs, verify the inference server is running and reachable from the workspace subnet
5. Ask your platform team to allowlist LLM API domains in workspace network policies

Prevention

Inject LLM API keys via secrets manager (not hardcoded). Pre-configure proxy and egress rules in workspace templates. Use health-check scripts in workspace startup to fail fast if the LLM endpoint is unreachable.

Symptoms

- Workspace becomes sluggish when AI agent is running
- OOM kills triggered by agent processes
- CPU pegged at 100% during agent task execution
- Disk fills up with agent context caches or logs

Causes

- Agent running unbounded loops (recursive file edits, retries)
- Large context windows loading entire repos into memory
- Multiple agent sessions running concurrently
- Agent caching embeddings or conversation history to disk

Solution

1. Identify the process: top -o %MEM or top -o %CPU
2. Set resource limits: ulimit -v 4194304 (4GB virtual memory cap)
3. Kill runaway agents: pkill -f "agent-process-name"
4. Clean agent caches: check ~/.cache/ and /tmp/ for large agent artifacts
5. Request a workspace template with higher resource limits for AI workloads

Prevention

Use microVM or container-level resource limits (cgroups) for agent processes. Configure agent timeout and max-iteration settings. Use workspace templates sized for AI workloads (8GB+ RAM, 4+ vCPUs recommended). Set up monitoring alerts for agent resource spikes.

Symptoms

- Agent fails with "Permission denied" on file operations
- Agent cannot install packages or run build commands
- Agent blocked from network access it needs

Causes

- Sandbox restricts agent to read-only or scoped directory access
- Network egress policy blocks agent-initiated connections
- Workspace security context (seccomp, AppArmor) too restrictive

Solution

1. Check sandbox config for the agent tool you are using (Claude Code, Cursor, Copilot, etc.)
2. Verify workspace user has write access: ls -la /workspace/
3. For Coder or Ona (formerly Gitpod) workspaces, check template security policies
4. Review Kubernetes securityContext if running in a pod-based CDE
5. Ask platform team to adjust sandbox permissions for legitimate agent workflows

Prevention

Define agent permission tiers in workspace templates - read-only for code review agents, read-write for coding agents, and network access for agents that need external APIs. Use microVM isolation for agents that require broader permissions.

Symptoms

- Agent editing the wrong files or deleting code
- Agent retrying the same failing operation repeatedly
- Agent spawning excessive subprocesses
- Git history shows dozens of rapid automated commits

Causes

- Vague or ambiguous task instructions given to agent
- Agent lacks sufficient context about the codebase
- No guardrails or iteration limits configured
- Agent tool permissions are too broad

Solution

1. Stop the agent immediately (Ctrl+C or kill the process)
2. Review changes: git diff and git log --oneline -20
3. Revert unwanted changes: git checkout -- . or git reset HEAD~N
4. Re-run with more specific task instructions and smaller scope
5. Enable agent confirmation mode (require approval before writes)

Prevention

Always commit or stash your work before running autonomous agents. Use agent tools with built-in confirmation and diff-review modes. Set max-iteration limits. Scope agents to specific directories rather than the entire repo. For production-critical code, require human review of agent-generated changes before merge.

LLM Connectivity & Integration

Symptoms

- HTTP 429 "Too Many Requests" in agent logs
- Code completions stop working mid-session
- Chat responses return errors after heavy usage

Causes

- Multiple developers sharing a single API key or org quota
- Agent sending rapid-fire requests without backoff
- Token-per-minute (TPM) limits exceeded on large context windows

Solution

1. Check current rate limit status in your LLM provider dashboard
2. Switch to a higher-tier plan or request a rate limit increase
3. Use per-developer API keys so one user does not exhaust the org quota
4. Configure your agent/tool to use exponential backoff on 429 responses
5. Reduce context window size to lower token consumption per request

Prevention

Provision per-team or per-developer API keys with individual rate limits. Use an LLM gateway or proxy (like LiteLLM or a custom gateway) to enforce quotas and route between providers. Monitor token usage per workspace to catch runaway consumption early.

Symptoms

- Completions fail with connection refused or timeout
- Agent works on cloud-hosted LLMs but not internal ones
- Self-hosted vLLM, Ollama, or TGI endpoints return 503

Causes

- Workspace in a different VPC or subnet than the inference server
- GPU node ran out of VRAM or the model failed to load
- Internal DNS not resolving the inference endpoint from workspace

Solution

1. Test connectivity: curl -v http://llm-server:8000/health
2. Verify DNS resolution: nslookup llm-server
3. Check if inference server is running: kubectl get pods -l app=vllm
4. Verify GPU health on the inference node: nvidia-smi
5. Confirm workspace and inference server share network reachability (same VPC, peered VPC, or VPN)

Prevention

Deploy inference servers in the same cluster or VPC as CDE workspaces. Use Kubernetes service discovery for stable endpoints. Set up GPU monitoring and auto-restart for inference pods. Configure fallback to a cloud-hosted LLM if the self-hosted endpoint goes down.

Symptoms

- Code completions take 5-10+ seconds to appear
- Chat responses stream very slowly
- Agent tasks take much longer than expected

Causes

- Oversized context windows sending too many tokens per request
- Inference server under heavy load from multiple users
- Workspace routing traffic through a high-latency proxy path

Solution

1. Measure raw latency: time curl -s $LLM_ENDPOINT/v1/models
2. Reduce context size - exclude large files, vendor directories, and binaries
3. Switch to a faster model for completions (use larger models only for complex tasks)
4. For self-hosted LLMs, scale up replicas or use a model with lower VRAM requirements
5. Check if your CDE provider offers edge-cached LLM endpoints

Prevention

Use tiered model routing - fast, small models for inline completions and larger models for complex agent tasks. Deploy inference servers in the same region as workspaces. Monitor P99 latency on your LLM gateway and auto-scale inference capacity based on demand.

Connectivity Issues

Symptoms

- "Connection refused" error
- Timeout when connecting via SSH
- VS Code Remote shows "connecting..." forever

Causes

- Workspace stopped or failed to start
- Network/firewall blocking connection
- SSH agent not configured

Solution

1. Check workspace status in CDE dashboard - restart if stopped
2. Run coder config-ssh to refresh SSH config
3. Test with ssh -v coder.workspace-name for verbose output
4. Check if corporate VPN is blocking port 22 or the CDE domain

Prevention

Configure workspace auto-start, add CDE domains to VPN split-tunnel, set up SSH keepalive.

Symptoms

- Noticeable delay when typing
- Commands take seconds to execute
- IDE feels sluggish

Causes

- Workspace in distant region
- Network congestion or poor WiFi
- VPN routing all traffic

Solution

1. Run ping workspace-url to check latency (target: <50ms)
2. Request workspace in closer region
3. Configure VPN split-tunnel for CDE traffic
4. Switch to wired connection if on WiFi

Prevention

Deploy CDEs in multiple regions, use edge networking, optimize SSH connection settings.

Symptoms

- "Connection reset by peer"
- IDE disconnects after idle time
- Terminal freezes then disconnects

Causes

- Firewall/NAT timeout killing idle connections
- Network instability
- Workspace auto-stop triggered

Solution

Add to ~/.ssh/config:

Host *
  ServerAliveInterval 60
  ServerAliveCountMax 3
  TCPKeepAlive yes

Prevention

Configure SSH keepalive globally, extend auto-stop timeout, use tmux/screen for persistent sessions.

Performance Problems

Symptoms

- Characters appear after delay
- Autocomplete is slow
- File operations lag

Causes

- High CPU usage by language server or AI agent
- Too many extensions or AI tools enabled simultaneously
- Large workspace with many files being indexed

Solution

1. Check CPU: top or htop
2. Disable unused VS Code extensions and AI tools you are not actively using
3. Add large folders to .gitignore and exclude from search
4. Request workspace with more CPU cores (4+ vCPUs for AI-heavy workflows)

Symptoms

- Builds slower than local machine
- npm install takes forever
- Docker builds very slow

Causes

- No build cache configured
- Slow network storage
- Insufficient CPU/memory allocation

Solution

1. Enable persistent build cache (node_modules, .gradle, target/)
2. Use prebuilds for dependency installation
3. Mount fast local SSD for build directories
4. Configure Docker BuildKit with cache mounts

Symptoms

- "JavaScript heap out of memory"
- Process killed by OOM killer
- Workspace becomes unresponsive

Causes

- Workspace memory limit too low for AI-augmented workflows
- Memory leak in application or AI agent process
- Too many processes running (IDE + language server + AI agent)

Solution

1. Check usage: free -h
2. For Node.js: export NODE_OPTIONS="--max-old-space-size=4096"
3. Request workspace template with more RAM (8GB+ for AI workflows)
4. Kill unused background processes and idle AI agent sessions

IDE & Editor Issues

Solution

1. Update VS Code and Remote SSH extension to latest
2. Delete server: rm -rf ~/.vscode-server on remote
3. Check SSH config: coder config-ssh --yes
4. Try connecting via terminal first to verify SSH works

Solution

1. Ensure Coder plugin is installed in Gateway
2. Check Gateway version matches IDE backend version
3. Clear Gateway cache: Help > Delete Caches
4. Verify workspace has enough RAM (8GB+ recommended for JetBrains)

Solution

1. Some extensions run locally, others remotely - check extension docs
2. Reinstall extension on remote: Extensions > Install in SSH
3. Add required extensions to devcontainer.json for consistency
4. Check extension requires tools not in container (e.g., git, python, node)
5. For AI coding extensions (Copilot, Codeium, Continue), verify API key injection and network access from the workspace

Authentication Problems

Solution

1. Clear browser cookies for CDE domain
2. Try incognito/private browsing mode
3. Verify you're using correct identity provider
4. Contact admin if account not provisioned via SCIM

Solution

1. Re-authenticate: coder login https://cde.company.com
2. Generate new token in dashboard: Settings > Tokens
3. Update CLI: coder update
4. Check if your access was revoked by admin

Solution

1. Check if the key is injected as an environment variable: env | grep -i api_key
2. Verify your workspace template includes the secret from your CDE secrets manager
3. For Coder: check workspace parameters and secrets in your template
4. For Ona: verify environment variables in your workspace configuration
5. Rotate the key if it may have been exposed or expired

Workspace Lifecycle

Solution

1. Check workspace logs in dashboard for errors
2. Verify resource quota not exceeded
3. Check if base image is accessible
4. Try recreating workspace from template
5. Contact platform team if infrastructure issue

Important

Only data in persistent volumes survives restarts. Ephemeral storage is lost.

Solution

1. Check what paths are persistent in your template
2. Common persistent: /home/coder, project directories
3. Always commit code to git regularly
4. Request template with more persistent storage
5. AI agent conversation history may be lost - export or sync agent state to a persistent path

Solution

1. Check auto-stop setting in workspace config
2. Activity (SSH, web terminal) resets idle timer
3. Request longer TTL from platform team if needed
4. Use scheduled jobs to maintain activity for long-running builds or agent tasks
5. For autonomous AI agents running overnight, ensure the workspace TTL exceeds the expected task duration

Database & Services

Solution

1. Check if database is in same VPC as workspace
2. Verify security group allows workspace IP range
3. Use internal DNS names, not public endpoints
4. Check credentials in secrets manager
5. Test with: nc -zv db-host 5432

Solution

1. Use Coder port forward: coder port-forward workspace 3000:3000
2. Or SSH: ssh -L 3000:localhost:3000 coder.workspace
3. Bind app to 0.0.0.0, not just localhost
4. Check if CDE has built-in port forwarding UI

Understanding

In remote development, "localhost" in your code means the workspace, not your laptop. But your browser runs on your laptop.

Solution

1. Use port forwarding to access workspace ports from browser
2. VS Code auto-forwards ports - check Ports panel
3. For API calls between services in workspace, localhost works
4. For browser access, use forwarded port or CDE preview URL

Resource Issues

Solution

1. Find large files: du -sh * | sort -h
2. Clean Docker: docker system prune -a
3. Clear package caches (npm, pip, gradle)
4. Remove old log files and AI agent caches in ~/.cache/
5. Request workspace with larger disk

Solution

1. Verify workspace template includes GPU
2. Check GPU quota: you may need to request access
3. Run nvidia-smi to verify driver
4. Ensure NVIDIA container toolkit is installed
5. GPUs may be in short supply - try different region or off-peak hours
6. For local LLM inference, verify the model fits in available VRAM

Solution

1. Stop unused workspaces to free quota
2. Use smaller workspace templates when possible
3. Request quota increase from platform team
4. Check if org/team has shared quota limits
5. Review if autonomous AI agents are consuming quota by leaving workspaces running after task completion

Still Stuck?

Check these additional resources or contact your platform team.

FAQ Glossary AI Agent Security Contact Us