Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide
Implementation
Architecture Patterns DevContainers Language Quickstarts IDE Integration AI/ML Workloads Advanced DevContainers
Operations
Performance Optimization High Availability & DR Monitoring Capacity Planning Troubleshooting Runbooks
Security
Security Deep Dive Secrets Management Vulnerability Management Network Security IAM Guide Compliance Guide
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis Vendor Evaluation Training Resources Team Structure Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

AI/ML Development in Cloud Environments

Access enterprise GPUs, train large language models, and build AI applications without local hardware constraints

From Jupyter notebooks to distributed training - run cutting-edge AI/ML workloads in reproducible cloud environments

Why CDEs Transform AI/ML Development

Cloud Development Environments solve the unique challenges of AI/ML workflows

Enterprise GPU Access Without Hardware Investment

Train models on NVIDIA A100s, V100s, or H100s without purchasing expensive hardware. Scale from single GPU development to multi-node distributed training on demand.

No Capital Investment
$15K-$50K GPU costs avoided
Latest Hardware
A100, H100, L4 instantly available
Elastic Scaling
1 to 16+ GPUs on demand
Pay Per Second
Stop paying when training completes

Reproducible ML Environments

Eliminate "CUDA version mismatch" and "works on my machine" issues. Every data scientist gets identical Python versions, CUDA drivers, and library dependencies.

# DevContainer ensures consistency
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04

RUN pip install torch==2.1.0+cu121
RUN pip install transformers==4.35.0
# Everyone gets the same environment

Seamless Team Collaboration on Models

Share running Jupyter notebooks, debug models together, and handoff training runs without "send me your environment.yml" headaches.

Shared Workspaces
Team access to experiments
Version Control
Git integration built-in

Secure Training Data Management

Training data never leaves your VPC. Medical images, customer data, and proprietary datasets stay in centralized, audited storage.

Data Locality
Datasets in VPC only
Audit Trails
Track data access

GPU Cost Optimization

Auto-stop expensive GPU instances when idle. Use spot instances for non-critical training. Pay only for actual compute time.

Always-On GPU Workstation $8,760/mo
CDE with Auto-Stop (8h/day) $2,920/mo
67% cost savings with TTL policies

Built-In Experiment Tracking

Integrate MLflow, Weights & Biases, or Neptune.ai into your CDE templates. Track hyperparameters, metrics, and artifacts automatically.

MLflow pre-configured
Model registry integration

GPU Configuration Options

Choose the right GPU for your workload - from experimentation to production training

NVIDIA A100

Large Model Training

VRAM: 40GB / 80GB
Tensor Cores: 432
Best For: LLM Training
Cost: $1.50-$3.00/hr

NVIDIA V100

General ML Training

VRAM: 16GB / 32GB
Tensor Cores: 640
Best For: CV, NLP
Cost: $0.80-$1.50/hr

NVIDIA T4

Development & Inference

VRAM: 16GB
Tensor Cores: 320
Best For: Development
Cost: $0.35-$0.60/hr

NVIDIA H100

Cutting-Edge LLMs

VRAM: 80GB
Tensor Cores: 528 (Gen 4)
Best For: GPT-4 Scale
Cost: $4.00-$8.00/hr

NVIDIA A10G

Balanced Performance

VRAM: 24GB
Tensor Cores: 192
Best For: Fine-Tuning
Cost: $0.60-$1.20/hr

AMD MI250X

ROCm Alternative

VRAM: 128GB HBM2e
Compute: 95.7 TFLOPS
Best For: ROCm Stacks
Cost: $2.00-$3.50/hr

GPU Allocation in Terraform Templates

# Coder Terraform Template for GPU Workspace
resource "coder_agent" "main" {
  arch = "amd64"
  os   = "linux"
}

resource "kubernetes_pod" "workspace" {
  metadata {
    name = "ml-workspace-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}"
  }

  spec {
    container {
      name  = "dev"
      image = "nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04"

      resources {
        limits = {
          "nvidia.com/gpu" = "1"  # Single A100 GPU
          memory           = "64Gi"
          cpu              = "16"
        }
      }

      volume_mount {
        name       = "home"
        mount_path = "/home/coder"
      }
    }

    # Node selector for GPU instance type
    node_selector = {
      "node.kubernetes.io/instance-type" = "g5.4xlarge"  # AWS A10G
    }
  }
}

# Multi-GPU configuration
resource "kubernetes_pod" "distributed_training" {
  spec {
    container {
      resources {
        limits = {
          "nvidia.com/gpu" = "8"  # 8x A100 for distributed training
        }
      }
    }
  }
}

ML Framework Templates

Pre-configured environments for popular AI/ML frameworks

PyTorch Environment

Deep learning research and production

# DevContainer for PyTorch + CUDA
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

RUN pip install --no-cache-dir \
    torchvision==0.16.0 \
    torchaudio==2.1.0 \
    pytorch-lightning==2.1.0 \
    tensorboard==2.15.0

RUN pip install jupyter jupyterlab ipywidgets

# Install monitoring tools
RUN pip install wandb mlflow

ENV CUDA_VISIBLE_DEVICES=0
PyTorch 2.1 CUDA 12.1 Lightning

TensorFlow / Keras

Production ML pipelines

# TensorFlow GPU Environment
FROM tensorflow/tensorflow:2.15.0-gpu

RUN pip install --no-cache-dir \
    keras==2.15.0 \
    tensorflow-datasets==4.9.0 \
    tf-agents==0.17.0

# MLOps tools
RUN pip install \
    tensorflow-serving-api \
    tensorboard-plugin-profile

# Verify GPU
RUN python -c "import tensorflow as tf; \
    print(tf.config.list_physical_devices('GPU'))"
TensorFlow 2.15 Keras TF Serving

Hugging Face Transformers

NLP and LLM development

# Transformers Development Environment
FROM huggingface/transformers-pytorch-gpu:4.35.0

RUN pip install --no-cache-dir \
    datasets==2.15.0 \
    evaluate==0.4.1 \
    accelerate==0.25.0 \
    peft==0.7.0 \
    bitsandbytes==0.41.0

# LLM fine-tuning tools
RUN pip install \
    trl==0.7.0 \
    wandb \
    jupyter
Transformers PEFT Accelerate

JAX / Flax

High-performance research

# JAX Environment with GPU support
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04

RUN pip install --upgrade pip
RUN pip install --upgrade "jax[cuda12_pip]" \
    -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

RUN pip install \
    flax==0.7.5 \
    optax==0.1.7 \
    orbax==0.1.5

# Verify JAX sees GPU
RUN python -c "import jax; print(jax.devices())"
JAX Flax Optax

Jupyter Integration in CDEs

Run JupyterLab on cloud GPUs with seamless workspace integration

JupyterLab Server Configuration

# Coder startup script for Jupyter
resource "coder_agent" "main" {
  startup_script = <<-EOF
    #!/bin/bash
    # Start JupyterLab with token auth
    jupyter lab \
      --ip=0.0.0.0 \
      --port=8888 \
      --no-browser \
      --ServerApp.token='${data.coder_workspace.me.name}' \
      --ServerApp.allow_origin='*' \
      --ServerApp.root_dir='/home/coder/notebooks' &
  EOF
}

# Expose Jupyter via Coder proxy
resource "coder_app" "jupyter" {
  agent_id     = coder_agent.main.id
  display_name = "JupyterLab"
  slug         = "jupyter"
  url          = "http://localhost:8888"
  icon         = "/icon/jupyter.svg"
  share        = "owner"
  subdomain    = false
}

Jupyter Features in CDEs

GPU Kernel Access
torch.cuda.is_available() returns True
Custom Kernels
PyTorch, TensorFlow, JAX kernels pre-installed
Collaborative Notebooks
Share running notebooks with team members
Git Integration
nbstripout for clean commits
Extensions Supported
jupyterlab-git, jupyterlab-lsp, code formatter

Clean Git Commits with nbstripout

Prevent notebook output bloat in Git by stripping cell outputs before commits

# Install nbstripout in your CDE template
RUN pip install nbstripout

# Automatically clean notebooks on commit
RUN git config --global filter.nbstripout.clean 'nbstripout'
RUN git config --global filter.nbstripout.smudge cat
RUN git config --global filter.nbstripout.required true

# Add to .gitattributes
RUN echo '*.ipynb filter=nbstripout' >> .gitattributes

# Result: Only code is committed, not 50MB of outputs

Complete Data Science Workflows

Integrated tools for the full ML lifecycle

Data Versioning (DVC)

Track datasets like code

# DVC pre-configured in template
RUN pip install dvc[s3]

# Initialize DVC with S3 backend
RUN dvc init
RUN dvc remote add -d storage s3://ml-datasets

# Track large datasets
$ dvc add data/training_set.parquet
$ git add data/training_set.parquet.dvc
$ dvc push

# Team pulls same data version
$ dvc pull
Benefit: No 10GB datasets in Git. Everyone gets the right data version.

Experiment Tracking

MLflow, W&B, Neptune.ai

# MLflow tracking in workspace
import mlflow

mlflow.set_tracking_uri("http://mlflow.company.com")
mlflow.set_experiment("llm-fine-tuning")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 2e-5)
    mlflow.log_param("batch_size", 16)

    # Train model
    train_model()

    mlflow.log_metric("val_loss", 0.23)
    mlflow.pytorch.log_model(model, "model")
MLflow
W&B
Neptune

Model Registry

Production model management

Version Tagging
Staging, Production, Archived
Model Comparison
Compare metrics across versions
Rollback Support
Instantly revert to previous model

Pipeline Orchestration

Automated ML workflows

# Kubeflow Pipelines in CDE
from kfp import dsl, compiler

@dsl.pipeline(name='Training Pipeline')
def ml_pipeline():
    preprocess = dsl.ContainerOp(
        name='preprocess',
        image='company/preprocess:v1'
    )

    train = dsl.ContainerOp(
        name='train-model',
        image='company/train:v1'
    ).after(preprocess).set_gpu_limit(1)

compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml')

Large Language Model Development

Build, fine-tune, and deploy LLMs in cloud environments

LLM Fine-Tuning Environment

# DevContainer for LLM Fine-Tuning
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04

# Install PyTorch with CUDA support
RUN pip install torch==2.1.0+cu121 --index-url \
    https://download.pytorch.org/whl/cu121

# Fine-tuning libraries
RUN pip install \
    transformers==4.35.0 \
    accelerate==0.25.0 \
    peft==0.7.0 \
    bitsandbytes==0.41.0 \
    trl==0.7.0

# Quantization for memory efficiency
RUN pip install \
    auto-gptq \
    optimum

# Multi-GPU training
ENV NCCL_DEBUG=INFO
ENV CUDA_LAUNCH_BLOCKING=1
PEFT / LoRA
Memory-efficient fine-tuning
QLoRA
4-bit quantization + LoRA

Distributed Training Setup

# Multi-node distributed training
from accelerate import Accelerator

accelerator = Accelerator()

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Wrap model, optimizer, dataloader
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

# Accelerate handles multi-GPU, mixed precision
for batch in train_loader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
Strategy: DDP, FSDP, DeepSpeed
Nodes: 1-64 GPU instances

RAG Development Environment

# RAG Stack in DevContainer
RUN pip install \
    langchain==0.1.0 \
    chromadb==0.4.18 \
    sentence-transformers==2.2.2 \
    openai \
    tiktoken

# Vector store setup
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

vectorstore = Chroma(
    persist_directory="/workspace/vectors",
    embedding_function=embeddings
)
LangChain ChromaDB Embeddings

Model Serving from CDEs

vLLM Deployment
High-throughput inference server
TensorRT-LLM
NVIDIA optimized inference
Text Generation Inference
Hugging Face production server

GPU Cost Management Strategies

Control AI/ML infrastructure costs without sacrificing productivity

Auto-Stop Policies

Shut down GPU instances after 30 minutes of inactivity. A100s cost $2.50/hr - idle time adds up fast.

Save 60-70% on GPU costs

Spot Instances

Use preemptible GPUs for non-critical experiments. Same A100 for $0.75/hr instead of $2.50/hr.

70% cheaper than on-demand

Budget Quotas

Set monthly GPU budgets per team or user. Prevent surprise $10K bills from forgotten training runs.

Predictable monthly spend

Cost Alerts

Get notified when GPU usage exceeds thresholds. Catch runaway costs before they spiral.

Real-time monitoring

GPU Cost Comparison

NVIDIA A100 80GB
Large model training
On-Demand: $3.67/hr
Spot: $1.10/hr
Monthly (8h/day): $264/mo
NVIDIA V100 16GB
General ML workloads
On-Demand: $1.46/hr
Spot: $0.44/hr
Monthly (8h/day): $106/mo
NVIDIA T4 16GB
Development & inference
On-Demand: $0.526/hr
Spot: $0.16/hr
Monthly (8h/day): $38/mo

Pro Tip: Use T4 for development, A100 spot instances for training, save 80%+ on GPU costs

Complete Example Configurations

Production-ready templates you can deploy today - click to expand

Ready to Scale Your AI/ML Workflows?