Skip to main content
InfraGap.com Logo
Home
Getting Started
Core Concept What is a CDE? How It Works Benefits CDE Assessment Getting Started Guide CDEs for Startups
AI & Automation
AI Coding Assistants Agentic AI AI-Native IDEs Agentic Engineering AI Agent Orchestration AI Governance AI-Assisted Architecture Shift-Left AI LLMOps Autonomous Development AI/ML Workloads GPU Computing
Implementation
Architecture Patterns DevContainers Advanced DevContainers Language Quickstarts IDE Integration CI/CD Integration Platform Engineering Developer Portals Container Registry Multi-CDE Strategies Remote Dev Protocols Nix Environments
Operations
Performance Optimization High Availability & DR Disaster Recovery Monitoring Capacity Planning Multi-Cluster Development Troubleshooting Runbooks Ephemeral Environments
Security
Security Deep Dive Zero Trust Architecture Secrets Management Vulnerability Management Network Security IAM Guide Supply Chain Security Air-Gapped Environments AI Agent Security MicroVM Isolation Compliance Guide Governance
Planning
Pilot Program Design Stakeholder Communication Risk Management Migration Guide Cost Analysis FinOps GreenOps Vendor Evaluation Training Resources Developer Onboarding Team Structure DevEx Metrics Industry Guides
Resources
Tools Comparison CDE vs Alternatives Case Studies Lessons Learned Glossary FAQ

CDEs for Data Science

Unified development environments for ML engineers and data scientists. GPU access, notebook environments, experiment tracking, and MLOps integration in Cloud Development Environments.

Why Data Scientists Need CDEs

Data science workflows have unique infrastructure challenges that traditional development environments were never designed to handle.

Data scientists and ML engineers face a unique set of environment challenges that make their daily work more frustrating than it needs to be. Python dependency management is notoriously fragile - a project that requires TensorFlow 2.15 with CUDA 12.1 and specific NumPy versions will conflict with another project using PyTorch 2.2 and a different CUDA toolkit. Local virtual environments and conda installations quickly become tangled, and "it works on my machine" is an even more common complaint in data science than in traditional software development because of the complexity of GPU drivers, scientific computing libraries, and framework interdependencies.

GPU access is another persistent bottleneck. Most data scientists do not have dedicated GPU workstations, and even those who do are limited to whatever single card was purchased 2-3 years ago. When a model training job requires an A100 or H100 for reasonable iteration speed, the typical workaround involves SSH-ing into a shared server, fighting with other team members for GPU time, and manually managing CUDA installations across a machine that multiple people are configuring differently. CDEs solve this by providing on-demand GPU workspaces with pre-configured drivers and frameworks, provisioned in minutes and released when training completes.

Notebook version control and collaboration remain unsolved problems for many data science teams. Jupyter notebooks contain a mix of code, outputs, and metadata in JSON format that produces noisy, unreadable diffs in Git. Data scientists often resort to emailing notebooks, sharing them on Slack, or using shared drives - none of which provide the collaboration, review, and reproducibility workflows that software engineers take for granted. CDE-hosted notebooks solve this by providing real-time collaboration, clean version control integration, and reproducible execution environments that ensure every team member gets identical results from the same notebook.

Reproducibility is the foundation of credible data science, yet it is remarkably difficult to achieve with local development setups. A model that produces specific results on one machine may behave differently on another due to subtle differences in library versions, random seed handling, GPU precision modes, or operating system behavior. CDEs provide bit-for-bit reproducible environments defined in code (devcontainers or Docker images), ensuring that every experiment can be reproduced by any team member at any time - a requirement for both scientific rigor and regulatory compliance.

Dependency Hell Solved

Isolated, pre-configured environments with pinned library versions eliminate Python dependency conflicts entirely.

On-Demand GPUs

Provision A100, H100, or H200 GPU workspaces in minutes. No hardware purchases, no shared server conflicts.

Notebook Version Control

Clean Git integration for Jupyter notebooks with readable diffs, code review, and collaboration workflows.

Reproducible Results

Environment-as-code ensures identical results across team members, time, and infrastructure. Critical for compliance.

Unified Delivery Pipelines

Breaking down silos between application development, ML engineering, and data science through a single internal developer platform.

Historically, data science teams have operated on completely separate infrastructure from application development teams. Software engineers use GitHub, CI/CD pipelines, and container orchestration. Data scientists use Jupyter servers, experiment tracking tools, and ad-hoc GPU clusters. ML engineers sit uncomfortably in between, trying to productionize models that were developed in an entirely different ecosystem. This separation creates handoff friction, duplicated infrastructure costs, and organizational silos that slow down the delivery of ML-powered features.

Modern platform engineering teams are converging these workflows into unified delivery pipelines served by a single Internal Developer Platform (IDP). In this model, a data scientist writes and validates a model in a CDE workspace with GPU access, an ML engineer packages it using the same CI/CD pipeline that application developers use, and the model deploys through the same infrastructure-as-code tooling as any other service. The CDE is the unifying layer - it provides consistent workspace experiences whether you are writing a React frontend, training a transformer model, or building a data pipeline in Apache Spark.

This convergence delivers measurable benefits. Organizations that unify their developer platforms report 40-60% reductions in time-to-production for ML models, because the handoff between "notebook experiment" and "production service" no longer requires rewriting code, reconfiguring infrastructure, or translating between incompatible toolchains. Data scientists gain access to software engineering best practices (code review, automated testing, deployment automation), while application developers gain access to ML tooling (experiment tracking, model registries, feature stores) through the same platform they already use daily.

The Unified Platform Model

Application Development

Standard CDE workspaces with language runtimes, CI/CD integration, preview environments, and deployment automation.

ML Engineering

GPU-enabled CDE workspaces with training frameworks, model registries, serving infrastructure, and monitoring tools.

Data Science

Notebook-first CDE workspaces with data access, experiment tracking, visualization libraries, and collaboration features.

Notebook Environments in CDEs

Production-grade notebook experiences with GPU acceleration, real-time collaboration, and seamless integration into development workflows.

CDE-hosted notebook environments transform the traditional Jupyter experience from an isolated local tool into a fully integrated development platform. Instead of running JupyterLab on a laptop with limited RAM and no GPU, data scientists access notebooks running on cloud infrastructure with configurable compute resources. A typical CDE notebook workspace might include 32GB+ RAM, dedicated GPU access, pre-installed data science libraries (pandas, scikit-learn, TensorFlow, PyTorch, JAX), and direct access to data warehouses and object storage - all available within seconds of workspace launch.

VS Code's notebook support has become a compelling alternative to standalone JupyterLab for many data scientists. CDE platforms like Coder and Ona (formerly Gitpod) provide VS Code (or its open-source variant) as the primary IDE, with full notebook rendering, IntelliSense, debugging, and integrated terminal access. This means data scientists get the exploratory notebook workflow they prefer alongside the full-featured IDE capabilities (refactoring, Git integration, extensions) that software engineers expect. GPU-attached notebooks execute cells on cloud GPUs while the interface runs in the browser, eliminating the need for local GPU hardware entirely.

Collaboration in CDE-hosted notebooks goes far beyond sharing files. Multiple team members can work on the same notebook simultaneously with real-time cursors and cell-level conflict resolution. Code review workflows treat notebooks as first-class citizens with tools like ReviewNB or nbdime providing meaningful diffs that separate code changes from output changes. Shared CDE workspaces also mean that when a data scientist says "run this notebook," their colleague launches an identical environment and gets identical results - no environment debugging required.

JupyterLab

Full JupyterLab IDE with extensions, custom kernels, and multi-language support running on cloud infrastructure with GPU access.

VS Code Notebooks

Native notebook rendering in VS Code with IntelliSense, debugging, Git integration, and the full extension ecosystem.

Real-Time Collaboration

Multiple data scientists edit the same notebook simultaneously with live cursors, shared terminals, and cell-level conflict handling.

GPU-Attached Notebooks

Execute notebook cells on A100, H100, or H200 GPUs. Browser-based interface, cloud-powered compute. No local GPU needed.

GPU Access Patterns for Data Science

Efficient GPU provisioning strategies that balance training performance, cost optimization, and multi-tenant resource sharing.

On-demand GPU provisioning is the primary access pattern for data science CDEs. Instead of maintaining always-on GPU instances that sit idle between training runs, CDE platforms provision GPU-attached workspaces when a data scientist requests them and automatically scale them down when the training job completes or the workspace times out. This pattern can reduce GPU costs by 60-80% compared to dedicated GPU servers, because most data science workflows involve significant non-GPU work (data preparation, feature engineering, result analysis) that does not require expensive accelerator hardware.

GPU sharing technologies enable multiple data scientists to use a single physical GPU simultaneously, further improving cost efficiency. NVIDIA Multi-Instance GPU (MIG) technology partitions a single A100 or H100 into up to seven isolated GPU instances, each with dedicated memory and compute resources. This is ideal for development and small-scale experimentation where a full GPU is overkill. Multi-Process Service (MPS) offers a lighter-weight sharing model where multiple CUDA processes share a single GPU with time-slicing, suitable for inference workloads and notebook exploration where each user needs occasional GPU bursts rather than sustained compute.

Cost optimization for GPU workloads requires a layered strategy. Development and experimentation should use smaller GPUs (T4, L4) or shared MIG instances at $1-3/hour. Training runs should scale up to dedicated A100 or H100 instances for the duration of the training job, then scale back down. For sustained training workloads, reserved or committed-use instances from cloud providers offer 40-60% discounts over on-demand pricing. CDE platforms can automate this tiered approach by offering different workspace templates for "exploration" (small/shared GPU), "training" (dedicated GPU), and "production inference" (optimized GPU serving). For a deep dive into GPU hardware options and pricing, see our GPU computing guide.

On-Demand Provisioning

GPU workspaces spin up in minutes and release when idle. Pay only for actual GPU usage during training and experimentation.

60-80% cost savings

GPU Sharing (MIG/MPS)

Partition a single GPU into isolated instances for multiple data scientists. Ideal for exploration and small-scale experiments.

Up to 7 isolated instances

Tiered Cost Strategy

Small GPUs for exploration, dedicated GPUs for training, reserved capacity for sustained workloads. Automated by CDE templates.

40-60% reserved discounts

Experiment Tracking and MLOps

Integrated experiment management, model versioning, and production deployment pipelines within CDE workspaces.

Experiment tracking is the backbone of systematic data science, and CDEs make it dramatically easier to implement consistently. Tools like MLflow, Weights and Biases (W&B), and DVC can be pre-configured in CDE workspace images so that every training run automatically logs hyperparameters, metrics, artifacts, and environment details without requiring each data scientist to manually set up tracking infrastructure. When a CDE workspace launches, it connects to the organization's central experiment tracking server, inherits the correct project context, and begins logging from the first training call. This "tracking by default" approach ensures that no experiment goes unrecorded - a critical requirement for reproducibility, audit trails, and institutional knowledge preservation.

Model registry integration within CDEs bridges the gap between experimentation and production. When a data scientist identifies a promising model, they can promote it to the model registry directly from their CDE workspace, triggering automated validation pipelines that test the model against production data, run fairness and bias checks, and verify serving performance requirements. This workflow replaces the traditional "throw it over the wall" handoff where a data scientist exports a model file and an ML engineer spends days figuring out how to deploy it. CDE-based MLOps makes the path from experiment to production a structured, automated pipeline.

Data versioning and lineage tracking complete the reproducibility picture. DVC (Data Version Control) integrates with Git to version large datasets and model files alongside code, while tools like Pachyderm and Delta Lake provide data lineage tracking that records exactly which data was used to train each model. When these tools are pre-configured in CDE workspaces, data scientists can trace any model prediction back through the model version, training code, hyperparameters, and training dataset that produced it - a chain of custody that is essential for regulated industries like healthcare and financial services.

MLflow

Open-source platform for the complete ML lifecycle. Experiment tracking, model registry, and deployment serving. Deep integration with all major ML frameworks.

Weights and Biases

Collaborative experiment tracking with rich visualization dashboards, hyperparameter sweep automation, and model evaluation reports. Strong team collaboration features.

DVC

Git-based data and model versioning. Version large datasets and ML artifacts alongside code. Pipeline reproducibility with dependency graph tracking.

Best Practice: Configure experiment tracking tools in your CDE workspace image so that logging begins automatically when training starts. Data scientists should never need to think about setting up tracking - it should be invisible infrastructure that "just works" in every workspace.