Can I access GPUs through CDEs?

Yes, CDEs can provision GPU-enabled workspaces with NVIDIA A100, H100, V100, or consumer GPUs. This gives ML engineers access to enterprise-grade hardware without purchasing expensive local machines.

How do CDEs help with LLM fine-tuning?

CDEs provide on-demand access to high-memory GPU instances needed for LLM fine-tuning. Workspaces can be configured with PyTorch, Hugging Face Transformers, and sufficient VRAM for models like Llama, Mistral, or custom fine-tuning jobs.

Are CDEs cost-effective for AI/ML development?

Yes, CDEs with auto-stop features only charge for active compute time. Instead of buying $10,000+ GPU workstations, teams pay hourly rates for cloud GPUs and spin down resources when not in use.

AI/ML Development with CDEs - GPU & LLM Fine-Tuning

Why CDEs Transform AI/ML Development

Cloud Development Environments solve the unique challenges of AI/ML workflows

Enterprise GPU Access Without Hardware Investment

Train models on NVIDIA A100s, V100s, or H100s without purchasing expensive hardware. Scale from single GPU development to multi-node distributed training on demand.

No Capital Investment

$15K-$50K GPU costs avoided

Latest Hardware

A100, H100, L4 instantly available

Elastic Scaling

1 to 16+ GPUs on demand

Pay Per Second

Stop paying when training completes

Reproducible ML Environments

Eliminate "CUDA version mismatch" and "works on my machine" issues. Every data scientist gets identical Python versions, CUDA drivers, and library dependencies.

# DevContainer ensures consistency
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04

RUN pip install torch==2.1.0+cu121
RUN pip install transformers==4.35.0
# Everyone gets the same environment

Seamless Team Collaboration on Models

Share running Jupyter notebooks, debug models together, and handoff training runs without "send me your environment.yml" headaches.

Shared Workspaces

Team access to experiments

Version Control

Git integration built-in

Secure Training Data Management

Training data never leaves your VPC. Medical images, customer data, and proprietary datasets stay in centralized, audited storage.

Data Locality

Datasets in VPC only

Audit Trails

Track data access

GPU Cost Optimization

Auto-stop expensive GPU instances when idle. Use spot instances for non-critical training. Pay only for actual compute time.

Always-On GPU Workstation $8,760/mo

CDE with Auto-Stop (8h/day) $2,920/mo

67% cost savings with TTL policies

Built-In Experiment Tracking

Integrate MLflow, Weights & Biases, or Neptune.ai into your CDE templates. Track hyperparameters, metrics, and artifacts automatically.

MLflow pre-configured

Model registry integration

GPU Configuration Options

Choose the right GPU for your workload - from experimentation to production training

NVIDIA A100

Large Model Training

VRAM: 40GB / 80GB

Tensor Cores: 432

Best For: LLM Training

Cost: $1.50-$3.00/hr

NVIDIA V100

General ML Training

VRAM: 16GB / 32GB

Tensor Cores: 640

Best For: CV, NLP

Cost: $0.80-$1.50/hr

NVIDIA T4

Development & Inference

VRAM: 16GB

Tensor Cores: 320

Best For: Development

Cost: $0.35-$0.60/hr

NVIDIA H100

Cutting-Edge LLMs

VRAM: 80GB

Tensor Cores: 528 (Gen 4)

Best For: GPT-4 Scale

Cost: $4.00-$8.00/hr

NVIDIA A10G

Balanced Performance

VRAM: 24GB

Tensor Cores: 192

Best For: Fine-Tuning

Cost: $0.60-$1.20/hr

AMD MI250X

ROCm Alternative

VRAM: 128GB HBM2e

Compute: 95.7 TFLOPS

Best For: ROCm Stacks

Cost: $2.00-$3.50/hr

GPU Allocation in Terraform Templates

# Coder Terraform Template for GPU Workspace
resource "coder_agent" "main" {
  arch = "amd64"
  os   = "linux"
}

resource "kubernetes_pod" "workspace" {
  metadata {
    name = "ml-workspace-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}"
  }

  spec {
    container {
      name  = "dev"
      image = "nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04"

      resources {
        limits = {
          "nvidia.com/gpu" = "1"  # Single A100 GPU
          memory           = "64Gi"
          cpu              = "16"
        }
      }

      volume_mount {
        name       = "home"
        mount_path = "/home/coder"
      }
    }

    # Node selector for GPU instance type
    node_selector = {
      "node.kubernetes.io/instance-type" = "g5.4xlarge"  # AWS A10G
    }
  }
}

# Multi-GPU configuration
resource "kubernetes_pod" "distributed_training" {
  spec {
    container {
      resources {
        limits = {
          "nvidia.com/gpu" = "8"  # 8x A100 for distributed training
        }
      }
    }
  }
}

ML Framework Templates

Pre-configured environments for popular AI/ML frameworks

PyTorch Environment

Deep learning research and production

# DevContainer for PyTorch + CUDA
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

RUN pip install --no-cache-dir \
    torchvision==0.16.0 \
    torchaudio==2.1.0 \
    pytorch-lightning==2.1.0 \
    tensorboard==2.15.0

RUN pip install jupyter jupyterlab ipywidgets

# Install monitoring tools
RUN pip install wandb mlflow

ENV CUDA_VISIBLE_DEVICES=0

PyTorch 2.1 CUDA 12.1 Lightning

TensorFlow / Keras

Production ML pipelines

# TensorFlow GPU Environment
FROM tensorflow/tensorflow:2.15.0-gpu

RUN pip install --no-cache-dir \
    keras==2.15.0 \
    tensorflow-datasets==4.9.0 \
    tf-agents==0.17.0

# MLOps tools
RUN pip install \
    tensorflow-serving-api \
    tensorboard-plugin-profile

# Verify GPU
RUN python -c "import tensorflow as tf; \
    print(tf.config.list_physical_devices('GPU'))"

TensorFlow 2.15 Keras TF Serving

Hugging Face Transformers

NLP and LLM development

# Transformers Development Environment
FROM huggingface/transformers-pytorch-gpu:4.35.0

RUN pip install --no-cache-dir \
    datasets==2.15.0 \
    evaluate==0.4.1 \
    accelerate==0.25.0 \
    peft==0.7.0 \
    bitsandbytes==0.41.0

# LLM fine-tuning tools
RUN pip install \
    trl==0.7.0 \
    wandb \
    jupyter

Transformers PEFT Accelerate

JAX / Flax

High-performance research

# JAX Environment with GPU support
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04

RUN pip install --upgrade pip
RUN pip install --upgrade "jax[cuda12_pip]" \
    -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

RUN pip install \
    flax==0.7.5 \
    optax==0.1.7 \
    orbax==0.1.5

# Verify JAX sees GPU
RUN python -c "import jax; print(jax.devices())"

JAX Flax Optax

Jupyter Integration in CDEs

Run JupyterLab on cloud GPUs with seamless workspace integration

JupyterLab Server Configuration

# Coder startup script for Jupyter
resource "coder_agent" "main" {
  startup_script = <<-EOF
    #!/bin/bash
    # Start JupyterLab with token auth
    jupyter lab \
      --ip=0.0.0.0 \
      --port=8888 \
      --no-browser \
      --ServerApp.token='${data.coder_workspace.me.name}' \
      --ServerApp.allow_origin='*' \
      --ServerApp.root_dir='/home/coder/notebooks' &
  EOF
}

# Expose Jupyter via Coder proxy
resource "coder_app" "jupyter" {
  agent_id     = coder_agent.main.id
  display_name = "JupyterLab"
  slug         = "jupyter"
  url          = "http://localhost:8888"
  icon         = "/icon/jupyter.svg"
  share        = "owner"
  subdomain    = false
}

Jupyter Features in CDEs

GPU Kernel Access

torch.cuda.is_available() returns True

Custom Kernels

PyTorch, TensorFlow, JAX kernels pre-installed

Collaborative Notebooks

Share running notebooks with team members

Git Integration

nbstripout for clean commits

Extensions Supported

jupyterlab-git, jupyterlab-lsp, code formatter

Clean Git Commits with nbstripout

Prevent notebook output bloat in Git by stripping cell outputs before commits

# Install nbstripout in your CDE template
RUN pip install nbstripout

# Automatically clean notebooks on commit
RUN git config --global filter.nbstripout.clean 'nbstripout'
RUN git config --global filter.nbstripout.smudge cat
RUN git config --global filter.nbstripout.required true

# Add to .gitattributes
RUN echo '*.ipynb filter=nbstripout' >> .gitattributes

# Result: Only code is committed, not 50MB of outputs

Complete Data Science Workflows

Integrated tools for the full ML lifecycle

Data Versioning (DVC)

Track datasets like code

# DVC pre-configured in template
RUN pip install dvc[s3]

# Initialize DVC with S3 backend
RUN dvc init
RUN dvc remote add -d storage s3://ml-datasets

# Track large datasets
$ dvc add data/training_set.parquet
$ git add data/training_set.parquet.dvc
$ dvc push

# Team pulls same data version
$ dvc pull

Benefit: No 10GB datasets in Git. Everyone gets the right data version.

Experiment Tracking

MLflow, W&B, Neptune.ai

# MLflow tracking in workspace
import mlflow

mlflow.set_tracking_uri("http://mlflow.company.com")
mlflow.set_experiment("llm-fine-tuning")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 2e-5)
    mlflow.log_param("batch_size", 16)

    # Train model
    train_model()

    mlflow.log_metric("val_loss", 0.23)
    mlflow.pytorch.log_model(model, "model")

MLflow

W&B

Neptune

Model Registry

Production model management

Version Tagging

Staging, Production, Archived

Model Comparison

Compare metrics across versions

Rollback Support

Instantly revert to previous model

Pipeline Orchestration

Automated ML workflows

# Kubeflow Pipelines in CDE
from kfp import dsl, compiler

@dsl.pipeline(name='Training Pipeline')
def ml_pipeline():
    preprocess = dsl.ContainerOp(
        name='preprocess',
        image='company/preprocess:v1'
    )

    train = dsl.ContainerOp(
        name='train-model',
        image='company/train:v1'
    ).after(preprocess).set_gpu_limit(1)

compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml')

Large Language Model Development

Build, fine-tune, and deploy LLMs in cloud environments

LLM Fine-Tuning Environment

# DevContainer for LLM Fine-Tuning
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04

# Install PyTorch with CUDA support
RUN pip install torch==2.1.0+cu121 --index-url \
    https://download.pytorch.org/whl/cu121

# Fine-tuning libraries
RUN pip install \
    transformers==4.35.0 \
    accelerate==0.25.0 \
    peft==0.7.0 \
    bitsandbytes==0.41.0 \
    trl==0.7.0

# Quantization for memory efficiency
RUN pip install \
    auto-gptq \
    optimum

# Multi-GPU training
ENV NCCL_DEBUG=INFO
ENV CUDA_LAUNCH_BLOCKING=1

PEFT / LoRA

Memory-efficient fine-tuning

QLoRA

4-bit quantization + LoRA

Distributed Training Setup

# Multi-node distributed training
from accelerate import Accelerator

accelerator = Accelerator()

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Wrap model, optimizer, dataloader
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

# Accelerate handles multi-GPU, mixed precision
for batch in train_loader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

Strategy: DDP, FSDP, DeepSpeed

Nodes: 1-64 GPU instances

RAG Development Environment

# RAG Stack in DevContainer
RUN pip install \
    langchain==0.1.0 \
    chromadb==0.4.18 \
    sentence-transformers==2.2.2 \
    openai \
    tiktoken

# Vector store setup
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

vectorstore = Chroma(
    persist_directory="/workspace/vectors",
    embedding_function=embeddings
)

LangChain ChromaDB Embeddings

Model Serving from CDEs

vLLM Deployment

High-throughput inference server

TensorRT-LLM

NVIDIA optimized inference

Text Generation Inference

Hugging Face production server

GPU Cost Management Strategies

Control AI/ML infrastructure costs without sacrificing productivity

Auto-Stop Policies

Shut down GPU instances after 30 minutes of inactivity. A100s cost $2.50/hr - idle time adds up fast.

Save 60-70% on GPU costs

Spot Instances

Use preemptible GPUs for non-critical experiments. Same A100 for $0.75/hr instead of $2.50/hr.

70% cheaper than on-demand

Budget Quotas

Set monthly GPU budgets per team or user. Prevent surprise $10K bills from forgotten training runs.

Predictable monthly spend

Cost Alerts

Get notified when GPU usage exceeds thresholds. Catch runaway costs before they spiral.

Real-time monitoring

GPU Cost Comparison

NVIDIA A100 80GB

Large model training

On-Demand: $3.67/hr

Spot: $1.10/hr

Monthly (8h/day): $264/mo

NVIDIA V100 16GB

General ML workloads

On-Demand: $1.46/hr

Spot: $0.44/hr

Monthly (8h/day): $106/mo

NVIDIA T4 16GB

Development & inference

On-Demand: $0.526/hr

Spot: $0.16/hr

Monthly (8h/day): $38/mo

Pro Tip: Use T4 for development, A100 spot instances for training, save 80%+ on GPU costs

Complete Example Configurations

Production-ready templates you can deploy today - click to expand

# .devcontainer/devcontainer.json
{
  "name": "PyTorch ML Development",
  "image": "nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04",
  "runArgs": ["--gpus", "all"],

  "features": {
    "ghcr.io/devcontainers/features/python:1": {
      "version": "3.11"
    },
    "ghcr.io/devcontainers/features/git:1": {},
    "ghcr.io/devcontainers/features/github-cli:1": {}
  },

  "postCreateCommand": "bash .devcontainer/setup.sh",

  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "ms-azuretools.vscode-docker"
      ]
    }
  },

  "forwardPorts": [8888, 6006],
  "portsAttributes": {
    "8888": {"label": "JupyterLab"},
    "6006": {"label": "TensorBoard"}
  }
}

# .devcontainer/setup.sh
#!/bin/bash
set -e

# Install PyTorch with CUDA
pip install --no-cache-dir \
    torch==2.1.0+cu121 \
    torchvision==0.16.0+cu121 \
    torchaudio==2.1.0+cu121 \
    --index-url https://download.pytorch.org/whl/cu121

# ML Libraries
pip install --no-cache-dir \
    pytorch-lightning==2.1.0 \
    transformers==4.35.0 \
    datasets==2.15.0 \
    accelerate==0.25.0 \
    peft==0.7.0

# Data Science Tools
pip install --no-cache-dir \
    jupyter \
    jupyterlab \
    ipywidgets \
    matplotlib \
    seaborn \
    pandas \
    scikit-learn

# MLOps
pip install --no-cache-dir \
    mlflow \
    wandb \
    dvc[s3]

# Git configuration for notebooks
pip install nbstripout
nbstripout --install --attributes .gitattributes

# Verify GPU
python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'GPU Count: {torch.cuda.device_count()}')"

echo "Setup complete! GPU-accelerated PyTorch environment ready."

# Dockerfile for CPU-based data science
FROM python:3.11-slim

# System dependencies
RUN apt-get update && apt-get install -y \
    git \
    curl \
    vim \
    && rm -rf /var/lib/apt/lists/*

# Core data science stack
RUN pip install --no-cache-dir \
    jupyter \
    pandas==2.1.0 \
    numpy==1.24.0 \
    scikit-learn==1.3.0 \
    matplotlib==3.7.0 \
    seaborn==0.12.0 \
    plotly==5.17.0

# ML utilities
RUN pip install --no-cache-dir \
    xgboost==2.0.0 \
    lightgbm==4.1.0 \
    optuna==3.4.0

# Data versioning
RUN pip install dvc[s3]

WORKDIR /workspace

# Start Jupyter by default
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser"]

# Coder Terraform Template for LLM Fine-Tuning
resource "kubernetes_pod" "llm_workspace" {
  metadata {
    name = "llm-${data.coder_workspace.me.name}"
  }

  spec {
    container {
      name  = "llm-dev"
      image = "huggingface/transformers-pytorch-gpu:4.35.0"

      command = ["bash", "-c", <<-EOF
        pip install accelerate peft bitsandbytes trl
        pip install jupyterlab wandb

        # Start Jupyter
        jupyter lab --ip=0.0.0.0 --port=8888 --no-browser &

        # Keep container running
        tail -f /dev/null
      EOF
      ]

      resources {
        limits = {
          "nvidia.com/gpu" = "4"  # 4x A100 GPUs
          memory           = "256Gi"
          cpu              = "64"
        }
        requests = {
          "nvidia.com/gpu" = "4"
          memory           = "128Gi"
          cpu              = "32"
        }
      }

      env {
        name  = "NCCL_DEBUG"
        value = "INFO"
      }

      env {
        name  = "TRANSFORMERS_CACHE"
        value = "/workspace/.cache/huggingface"
      }

      volume_mount {
        name       = "workspace"
        mount_path = "/workspace"
      }

      volume_mount {
        name       = "shm"
        mount_path = "/dev/shm"
      }
    }

    # Shared memory for multi-GPU
    volume {
      name = "shm"
      empty_dir {
        medium     = "Memory"
        size_limit = "64Gi"
      }
    }

    # High-memory GPU node
    node_selector = {
      "node.kubernetes.io/instance-type" = "p4d.24xlarge"  # 8x A100
    }

    # Allow spot instances for cost savings
    toleration {
      key      = "nvidia.com/gpu"
      operator = "Exists"
      effect   = "NoSchedule"
    }
  }
}

resource "coder_app" "jupyter" {
  agent_id     = coder_agent.main.id
  display_name = "JupyterLab"
  url          = "http://localhost:8888"
  icon         = "https://jupyter.org/favicon.ico"
}

resource "coder_app" "wandb" {
  agent_id     = coder_agent.main.id
  display_name = "Weights & Biases"
  url          = "https://wandb.ai"
  external     = true
  icon         = "https://wandb.ai/favicon.ico"
}

Ready to Scale Your AI/ML Workflows?

Compare CDE Tools for AI/ML See All CDE Benefits

AI/ML Development in Cloud Environments