Setting up and maintaining a functional, consistent local development environment poses significant challenges, particularly within educational institutions. Laboratory environments, often comprising hundreds of computers, demand identical and reliable setups. This difficulty is amplified in Data Science (DS) and Machine Learning (ML) labs, where experiments rely on a complex web of potentially hundreds of software packages with stringent version dependencies.
The current model is fragile:
Version Drift and Inconsistency: Managing package versions for different courses across every machine is a logistical nightmare. Inadvertent changes made by students further destabilize these setups.
Maintenance Overhead: Installing, updating, and patching packages on hundreds of individual computers consumes vast amounts of Internet bandwidth and IT staff time.
Inefficient Virtualization: The common practice of using full Virtual Machines (VMs) like Oracle VirtualBox simply for OS compatibility (e.g., to run native Linux workloads) is resource-intensive and adds an unnecessary layer of software maintenance and complexity.
This article proposes a fundamental shift in managing educational DS/ML labs: adopting a containerized solution. Containers offer a robust and efficient way to standardize development environments, directly addressing the core concerns of version control, maintenance, and resource consumption in large-scale academic settings.
Image Credit: Docker (https://www.docker.com/resources/what-container/)
A container is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
Unlike a traditional VM, which virtualizes the entire hardware stack to run a full guest Operating System (OS), a container virtualizes the OS kernel. This means multiple containers can share the host OS kernel.
The container architecture typically involves:
Host Operating System (OS): The base OS (e.g., Windows, macOS, Linux) that the physical machine runs.
Container Engine (e.g., Docker): The software that builds, runs, and manages containers. It interfaces with the host OS kernel.
Container Image: A read-only template containing the application and all its dependencies.
Container: The live, running instance of an image. It is an isolated process running on the host OS.
Because containers do not carry the overhead of a full OS, they are significantly more lightweight and faster to start than VMs.
Containers are an ideal fit for educational labs due to their inherent properties:
Immutability of the Image: A running container is created from a read-only image. When a student makes changes (e.g., installs a new package or deletes a file) inside the container, these changes are confined to the container layer and do not affect the original image.
Instant Recovery: If a student inadvertently breaks their environment, the container can be instantly removed and a pristine, working environment can be re-created from the original image in seconds. This eliminates troubleshooting and reinstallation time.
Quick Restoration and Distribution: The complete lab environment is packaged into a single image file. This single file can be rapidly distributed across the lab network, eliminating the need to install and maintain an individual development environment on each computer. If an image needs updating, a new image is built, archived, and deployed quickly.
Auto-Removal: By default, containers can be configured for auto-removal upon exit (e.g., using the --rm flag in Docker). This is crucial for education institutions as it prevents the lab computers from accumulating a large number of inactive containers created by hundreds of students.
Persistent Storage via Volume Binding: Containers can establish Volume Binding with the host machine. A student working on a Windows host machine can mount a local directory into the Linux-based container. This allows the student to:
Easily receive artifacts (data files, assignment code) from external sources.
Persistently save their work (code, notebooks, results) back to the host machine, ensuring their progress is saved even after the container is deleted.
Excellent Tooling Support:
Popular IDEs like Visual Studio Code (VS Code) offer first-class container support through extensions, allowing students to code directly inside the container as if it were a local environment.
Containers support port-forwarding, allowing browser-based tools like Jupyter Notebooks or JupyterLab (standard for DS/ML) running inside the container to be accessed directly from the host computer's browser.
Docker is the most popular container technology and is an excellent choice for creating and managing lab environments.
The workflow involves creating an initial container from a base image, configuring the environment, and then saving those changes into a new, customized image.
Start with a Base Image: Use a minimal, clean base image suitable for the task, such as the lightweight Anaconda distribution:
Bash
# Pulls the base image
docker pull continuumio/miniconda3
Create and Configure the Environment: Create a container from the base image and start making the necessary modifications (e.g., installing required data science packages like scikit-learn, PyTorch, or setting up SSH). The -it flags allow interactive terminal access, and --name gives the container an easy-to-reference name.
Bash
# Creates and start a container from the base image
docker run -it --name lab_setup_temp continuumio/miniconda3 /bin/bash
# [Inside the container]
# Installs required packages (e.g., installing the full Anaconda environment or specific packages)
# Example:
conda install -y numpy pandas scikit-learn jupyterlab
# Cleans up installation files
conda clean -a
# Exits the container
exit
Commit Changes to a Target Image: The changes made to the running container are saved into a new, custom image, effectively freezing the perfectly configured state.
Bash
# Commits the changes made in the container 'lab_setup_temp' to a new image 'ds-ml-lab-env:v1.0'
docker commit lab_setup_temp ds-ml-lab-env:v1.0
Clean Up: Remove the temporary container used for configuration.
Bash
# Removes the temporary container
docker rm lab_setup_temp
Archiving for Distribution: The final image is saved as a compressed archive file (e.g., a .tar file). This file is the portable lab environment that can be easily shared or stored on a network drive.
Bash
# Saves the target image to a .tar archive
docker save ds-ml-lab-env:v1.0 -o ds_ml_lab_env_v1.0.tar
User Deployment (in the Lab): Students load the image from the archive onto their lab machine and start working.
Bash
# Loads the image archive on the lab computer
docker load -i ds_ml_lab_env_v1.0.tar
# Creates an interactive container with name "ds-ml-lab" based on image "ds-ml-lab-env:v1.0" and
# maping the host's directory "C:\Users\Student" to the container's '/home/host' directory,
# port 8888 for Jupyter and 6006 for TensorBoard, and
# auto-remove on exit.
docker run -i -t --mount type=bind,source=/mnt/c/Users/Student,target=/home/host --name ds-ml-lab -p 8888:8888 -p 6006:6006 --rm -w /home ds-ml-lab-env:v1.0
Containers can also be used in Windows computers by setting Docker engine over a Linux distribution on Windows Subsystem for Linux (WSL) by following the instructions below.
Login in Windows as a standard user.
Open PowerShell as administrator and execute the following commands for WSL installation.
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart # Enables the Virtual Machine Platform (optional Windows feature)
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart # Installs WSL (optional Windows feature).
Restart computer for changes to take effect and login as an administrator once again.
Execute wsl_update_x64.msi as administrator and follow the online instruction to get installed WSL updated to 2.6.1. The installer is available at https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi. Refer https://learn.microsoft.com/en-us/windows/wsl/install-manual for more details.
Check if the WSL version has been upgraded to 2.6.1 by executing command wsl --version. If it does not show the expected version, then get WSL upgraded directly from the web by executing command wsl --update --web-download. Check the version post installation.
Reopen PowerShell as standard user and execute the following commands to install a Linux distribution on WSL. Ubuntu is considered here in this article.
mkdir C:\Users\Student\wslDistroStorage\Ubuntu # Creates storage `wslDistroStorage` for Ubuntu distribution in a folder `C:\Users\Student`.
wsl --import Ubuntu C:\Users\Student\wslDistroStorage\Ubuntu "<drive:\...>\ubuntu-24.04.3-wsl-amd64.wsl" # Imports Ubuntu installation from file archive. It available at https://releases.ubuntu.com/noble/ubuntu-24.04.3-wsl-amd64.wsl for download. For more details, refer https://ubuntu.com/desktop/wsl.
wsl --list --verbose # Checks the installed distributions. Newly installed distribution should be added in the list.
Exit and reopen PowerShell as standard user and continue following the instructions
wsl -d Ubuntu # Runs imported Linux distribution with working directory as C:\Users\<username>. Alternatively, command wsl ~ -d Ubuntu starts instance with Ubuntu `home` as the working directory.
sudo apt update && sudo apt upgrade -y # Updates and upgrades Ubunutu packages
Continue executing the following commands in already opened PowerShell as standard user for Docker Installation using the apt repository.
Set up Docker's apt repository [https://docs.docker.com/engine/install/ubuntu/]:
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
Install the Docker packages
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Check if docker service was auto-started
docker images # Should show no images are available immediately after installation
Load Docker Image
If docker image is available in split files due to its large size, first execute the command in the terminal to concatenate the binary split files back into a archive (tar) file. This step may take several seconds to complete depending upon the size of the splits. Change director the the one containing the split files and execute the following command.
copy /b <split file name #1> + <split file name #2> + ... + <split file name #n> <tar-file-name>.tar # '/b' is indicator for binary files.
Load the docker image from the compressed archive. This steps may take several mintues to complete. Do not exit from the console.
docker load --input <tar-file-name>
The lab setup is now complete. Exit from the PowerShell console. Ubuntu app should now be available in the Windows menu to start working with. If app menu "Ubuntu" is not available, opening "WSL" app will start the Ubunutu as a default Linux distribution.
The container technology, exemplified by Docker, represents a paradigm shift from brittle, high-maintenance local development setups to portable, reproducible, and instantly recoverable environments. For educational institutions, this solution is especially appropriate as it drastically reduces the IT overhead associated with package management and version control in large computer labs. By ensuring every student begins with an identical, perfectly configured development environment and by leveraging features like image immutability and volume binding for persistent work, containers pave the way for a more efficient, reliable, and frustration-free learning experience in Data Science and Machine Learning.
docker images # Lists all local images on the host machine.
docker pull <image-name>:<tag> # Fetches and downloads an image from a repository (like Docker Hub).
docker image history <image name> # Lists layers in the image
docker container run [OPTIONS] IMAGE [COMMAND] [ARG...] # Creates and runs a new container from an image
Example:
docker run -i -t --mount type=bind,source=<host system dir to map>,target=<docker file system dir to map> --name <container name> -p <host port>:<container port> --rm -w <work dir on docker file system> <image name>[:tag] /bin/bash # Creates and starts a new container interactively (-i, -t) and it gets removed after user exits (--rm)
docker ps # Lists all currently running containers.
docker ps -a # Lists all containers (running and stopped).
docker stop <container-id/name> # Gracefully stops a running container.
docker rmi <image-id/name> # Removes a local image.
docker commit <container-name> <new-image-name>:<tag> # Creates a new image by committing changes from a running container.
docker save <image-name> -o <file-name.tar> # Archives a local image to a .tar file for sharing/distribution.
docker load -i <file-name.tar> # Loads an image from a .tar archive back into the local image registry.
docker rmi <image-id/name> # Removes a stopped container.
docker ps # Lists all the running containers
docker ps --all # Lists all the containers including the stopped ones
docker rmi <container_id/name> # Removes docker container
docker start -ai <container_id> # Starts container in interactive mode
docker stop <container_id> # Stops container
docker attach <container_id> # Attaches the terminal & stdin to container
docker exec -it <container_id> /bin/bash # Allows executing commands in a running container over an interactive bash shell
docker commit <container_id> <image_name>[:<tag>] # Creates a new image from a container's changes
docker inspect <object_id> # Provides detailed, low-level information about Docker objects such as containers, images, networks, volumes, and more.
docker export --output <output file name>.tar <container> # Exports a container's filesystem as a tar archive
docker import <tar file> <image name>[:tag] # Imports image from a tar archive (use this command if a container was exported as tar file)
docker volume ls # Lists volumes
docker volume inspect <volume name> # Inspects a volume
docker volume create <volume name> # Creates a volume
docker run --mount type=volume[,src=<volume-name>],dst=<mount-path>[,<key>=<value>...] # Mounts a volume
docker run --mount type=bind,src=<host-path>,dst=<container-path>[,bind-propagation=rshared] # Mounts a bind [Note: Bind propagation is an advanced topic and many users never need to configure it.]
docker volume rm <volume name> # Removes a volume
docker volume prune # Removes unused volumes
Splitting large files and combining back.
split -b 4G <input file> <prefix> # Splits larger files into smaller files containing consecutive or interleaved sections of input
Example:
split -b 4G <input file name>.tar <input file name>.tar.part.
To combine the split files, simply select all these files, right click, select "TAR file" from "Compress to..." context menu. Refer to Windows provided graphical progress.