AI/ML Development in Cloud Environments
Access enterprise GPUs, train large language models, and build AI applications without local hardware constraints
From Jupyter notebooks to distributed training - run cutting-edge AI/ML workloads in reproducible cloud environments
Why CDEs Transform AI/ML Development
Cloud Development Environments solve the unique challenges of AI/ML workflows
Enterprise GPU Access Without Hardware Investment
Train models on NVIDIA A100s, V100s, or H100s without purchasing expensive hardware. Scale from single GPU development to multi-node distributed training on demand.
Reproducible ML Environments
Eliminate "CUDA version mismatch" and "works on my machine" issues. Every data scientist gets identical Python versions, CUDA drivers, and library dependencies.
# DevContainer ensures consistency
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04
RUN pip install torch==2.1.0+cu121
RUN pip install transformers==4.35.0
# Everyone gets the same environment
Seamless Team Collaboration on Models
Share running Jupyter notebooks, debug models together, and handoff training runs without "send me your environment.yml" headaches.
Secure Training Data Management
Training data never leaves your VPC. Medical images, customer data, and proprietary datasets stay in centralized, audited storage.
GPU Cost Optimization
Auto-stop expensive GPU instances when idle. Use spot instances for non-critical training. Pay only for actual compute time.
Built-In Experiment Tracking
Integrate MLflow, Weights & Biases, or Neptune.ai into your CDE templates. Track hyperparameters, metrics, and artifacts automatically.
GPU Configuration Options
Choose the right GPU for your workload - from experimentation to production training
NVIDIA A100
Large Model Training
NVIDIA V100
General ML Training
NVIDIA T4
Development & Inference
NVIDIA H100
Cutting-Edge LLMs
NVIDIA A10G
Balanced Performance
AMD MI250X
ROCm Alternative
GPU Allocation in Terraform Templates
# Coder Terraform Template for GPU Workspace
resource "coder_agent" "main" {
arch = "amd64"
os = "linux"
}
resource "kubernetes_pod" "workspace" {
metadata {
name = "ml-workspace-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}"
}
spec {
container {
name = "dev"
image = "nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04"
resources {
limits = {
"nvidia.com/gpu" = "1" # Single A100 GPU
memory = "64Gi"
cpu = "16"
}
}
volume_mount {
name = "home"
mount_path = "/home/coder"
}
}
# Node selector for GPU instance type
node_selector = {
"node.kubernetes.io/instance-type" = "g5.4xlarge" # AWS A10G
}
}
}
# Multi-GPU configuration
resource "kubernetes_pod" "distributed_training" {
spec {
container {
resources {
limits = {
"nvidia.com/gpu" = "8" # 8x A100 for distributed training
}
}
}
}
}
ML Framework Templates
Pre-configured environments for popular AI/ML frameworks
PyTorch Environment
Deep learning research and production
# DevContainer for PyTorch + CUDA
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
RUN pip install --no-cache-dir \
torchvision==0.16.0 \
torchaudio==2.1.0 \
pytorch-lightning==2.1.0 \
tensorboard==2.15.0
RUN pip install jupyter jupyterlab ipywidgets
# Install monitoring tools
RUN pip install wandb mlflow
ENV CUDA_VISIBLE_DEVICES=0
TensorFlow / Keras
Production ML pipelines
# TensorFlow GPU Environment
FROM tensorflow/tensorflow:2.15.0-gpu
RUN pip install --no-cache-dir \
keras==2.15.0 \
tensorflow-datasets==4.9.0 \
tf-agents==0.17.0
# MLOps tools
RUN pip install \
tensorflow-serving-api \
tensorboard-plugin-profile
# Verify GPU
RUN python -c "import tensorflow as tf; \
print(tf.config.list_physical_devices('GPU'))"
Hugging Face Transformers
NLP and LLM development
# Transformers Development Environment
FROM huggingface/transformers-pytorch-gpu:4.35.0
RUN pip install --no-cache-dir \
datasets==2.15.0 \
evaluate==0.4.1 \
accelerate==0.25.0 \
peft==0.7.0 \
bitsandbytes==0.41.0
# LLM fine-tuning tools
RUN pip install \
trl==0.7.0 \
wandb \
jupyter
JAX / Flax
High-performance research
# JAX Environment with GPU support
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04
RUN pip install --upgrade pip
RUN pip install --upgrade "jax[cuda12_pip]" \
-f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip install \
flax==0.7.5 \
optax==0.1.7 \
orbax==0.1.5
# Verify JAX sees GPU
RUN python -c "import jax; print(jax.devices())"
Jupyter Integration in CDEs
Run JupyterLab on cloud GPUs with seamless workspace integration
JupyterLab Server Configuration
# Coder startup script for Jupyter
resource "coder_agent" "main" {
startup_script = <<-EOF
#!/bin/bash
# Start JupyterLab with token auth
jupyter lab \
--ip=0.0.0.0 \
--port=8888 \
--no-browser \
--ServerApp.token='${data.coder_workspace.me.name}' \
--ServerApp.allow_origin='*' \
--ServerApp.root_dir='/home/coder/notebooks' &
EOF
}
# Expose Jupyter via Coder proxy
resource "coder_app" "jupyter" {
agent_id = coder_agent.main.id
display_name = "JupyterLab"
slug = "jupyter"
url = "http://localhost:8888"
icon = "/icon/jupyter.svg"
share = "owner"
subdomain = false
}
Jupyter Features in CDEs
Clean Git Commits with nbstripout
Prevent notebook output bloat in Git by stripping cell outputs before commits
# Install nbstripout in your CDE template
RUN pip install nbstripout
# Automatically clean notebooks on commit
RUN git config --global filter.nbstripout.clean 'nbstripout'
RUN git config --global filter.nbstripout.smudge cat
RUN git config --global filter.nbstripout.required true
# Add to .gitattributes
RUN echo '*.ipynb filter=nbstripout' >> .gitattributes
# Result: Only code is committed, not 50MB of outputs
Complete Data Science Workflows
Integrated tools for the full ML lifecycle
Data Versioning (DVC)
Track datasets like code
# DVC pre-configured in template
RUN pip install dvc[s3]
# Initialize DVC with S3 backend
RUN dvc init
RUN dvc remote add -d storage s3://ml-datasets
# Track large datasets
$ dvc add data/training_set.parquet
$ git add data/training_set.parquet.dvc
$ dvc push
# Team pulls same data version
$ dvc pull
Experiment Tracking
MLflow, W&B, Neptune.ai
# MLflow tracking in workspace
import mlflow
mlflow.set_tracking_uri("http://mlflow.company.com")
mlflow.set_experiment("llm-fine-tuning")
with mlflow.start_run():
mlflow.log_param("learning_rate", 2e-5)
mlflow.log_param("batch_size", 16)
# Train model
train_model()
mlflow.log_metric("val_loss", 0.23)
mlflow.pytorch.log_model(model, "model")
Model Registry
Production model management
Pipeline Orchestration
Automated ML workflows
# Kubeflow Pipelines in CDE
from kfp import dsl, compiler
@dsl.pipeline(name='Training Pipeline')
def ml_pipeline():
preprocess = dsl.ContainerOp(
name='preprocess',
image='company/preprocess:v1'
)
train = dsl.ContainerOp(
name='train-model',
image='company/train:v1'
).after(preprocess).set_gpu_limit(1)
compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml')
Large Language Model Development
Build, fine-tune, and deploy LLMs in cloud environments
LLM Fine-Tuning Environment
# DevContainer for LLM Fine-Tuning
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# Install PyTorch with CUDA support
RUN pip install torch==2.1.0+cu121 --index-url \
https://download.pytorch.org/whl/cu121
# Fine-tuning libraries
RUN pip install \
transformers==4.35.0 \
accelerate==0.25.0 \
peft==0.7.0 \
bitsandbytes==0.41.0 \
trl==0.7.0
# Quantization for memory efficiency
RUN pip install \
auto-gptq \
optimum
# Multi-GPU training
ENV NCCL_DEBUG=INFO
ENV CUDA_LAUNCH_BLOCKING=1
Distributed Training Setup
# Multi-node distributed training
from accelerate import Accelerator
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Wrap model, optimizer, dataloader
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
# Accelerate handles multi-GPU, mixed precision
for batch in train_loader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
RAG Development Environment
# RAG Stack in DevContainer
RUN pip install \
langchain==0.1.0 \
chromadb==0.4.18 \
sentence-transformers==2.2.2 \
openai \
tiktoken
# Vector store setup
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="all-MiniLM-L6-v2"
)
vectorstore = Chroma(
persist_directory="/workspace/vectors",
embedding_function=embeddings
)
Model Serving from CDEs
GPU Cost Management Strategies
Control AI/ML infrastructure costs without sacrificing productivity
Auto-Stop Policies
Shut down GPU instances after 30 minutes of inactivity. A100s cost $2.50/hr - idle time adds up fast.
Spot Instances
Use preemptible GPUs for non-critical experiments. Same A100 for $0.75/hr instead of $2.50/hr.
Budget Quotas
Set monthly GPU budgets per team or user. Prevent surprise $10K bills from forgotten training runs.
Cost Alerts
Get notified when GPU usage exceeds thresholds. Catch runaway costs before they spiral.
GPU Cost Comparison
Pro Tip: Use T4 for development, A100 spot instances for training, save 80%+ on GPU costs
Complete Example Configurations
Production-ready templates you can deploy today - click to expand