Skip to content

Hands-On: Your First ML Job on the Cluster

Goal: Run a semantic segmentation model (DeepLabV3) that labels every pixel in a photograph — using both Conda and Singularity. Same script, two environments.


0 — Directory Convention

Path Purpose Backed up?
$HOME Scripts, configs, small files No
$WORK Datasets, model caches, results No

Rule of thumb: Code lives in $HOME, data lives in $WORK.


1 — Get an Interactive GPU Session

# Adjust partition/account to your allocation
srun --partition=2080-galvani --gres=gpu:1 --mem=16G --time=02:00:00 --reservation=hands-on --pty bash

Verify you landed on a GPU node:

nvidia-smi

2 — Prepare the container and files

mkdir -p $WORK/ml-tutorial
cd $WORK/ml-tutorial

rsync -a --progress /mnt/lustre/datasets/hands-on/ $WORK/ml-tutorial

4A — Run with Conda

One-time setup (ALREADY DONE!)

# Install Miniconda (if you haven't already)
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $WORK/miniconda3
eval "$($WORK/miniconda3/bin/conda shell.bash hook)"

# Create environment — use pip for PyTorch (ships self-contained CUDA/MKL libs)
conda create -y -n segdemo python=3.11 numpy pillow
conda activate segdemo
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Run

eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda activate segdemo

# Cache model weights on $WORK to avoid filling $HOME
export TORCH_HOME=$WORK/.cache/torch

python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results

Expected output (first run downloads ~40 MB, then):

Device: cuda
Loading DeepLabV3-MobileNetV3-Large …
  Model ready in 2.3s
Running inference …
  Inference done in 0.034s

Detected classes:
     background   42.1%
        bicycle    1.8%
            car   11.3%
         person   18.7%
           ...

Saved: /path/to/results/input_segmentation.png
Saved: /path/to/results/input_overlay.png

Clean up

conda deactivate

4B — Run with Singularity (NGC Container)

One-time setup

# Pull an NGC PyTorch container (includes torchvision)
# Store the .sif on $WORK — it's ~6 GB
singularity pull $WORK/ml-tutorial/pytorch_24.01.sif \
  docker://nvcr.io/nvidia/pytorch:24.01-py3

This only needs to happen once. The .sif is reusable.

Run

singularity exec --nv \
  --bind $WORK:$WORK \
  --bind $HOME:$HOME \
  --env TORCH_HOME=$WORK/.cache/torch \
  $WORK/ml-tutorial/pytorch_24.01.sif \
  python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results
Flag Purpose
--nv Exposes host NVIDIA drivers/GPUs inside the container
--bind Makes host directories visible inside the container
--env Sets environment variables inside the container

You should see identical output to the Conda run above.


5 — Inspect Results

Open the result images (e.g. copy them via scp, rsync to your local machine):

  • input_segmentation.png — every pixel coloured by class (person = dark green, car = dark blue, bicycle = green, …)
  • input_overlay.png — the original photo blended with the segmentation map

6 — Quick Comparison

Conda Singularity
Setup effort Create env, install packages Pull one container image
Reproducibility Depends on conda solver; Exact image hash; fully reproducible
Disk usage ~3 GB (env) ~9 GB (.sif)
Startup time Fast (native) Slightly slower (container init)
Best for Rapid prototyping, custom stacks Production runs, shared pipelines

7 — What to Try Next

  • Use your own image: python $WORK/ml-tutorial/segment.py $WORK/my_photo.jpg
  • Batch job: Wrap the python … call in an sbatch script for non-interactive runs.
  • Different model: Swap deeplabv3_mobilenet_v3_large for deeplabv3_resnet101 — larger model, better accuracy, same interface.

Cheat Sheet

# Interactive GPU session
srun --partition=<PARTITION> --gres=gpu:1 --mem=16G --time=01:00:00 --pty bash

# Conda
conda activate segdemo
export TORCH_HOME=$WORK/.cache/torch
python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results

# Singularity
singularity exec --nv --bind $WORK:$WORK --bind $HOME:$HOME \
  --env TORCH_HOME=$WORK/.cache/torch \
  $WORK/pytorch_24.01.sif \
  python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results

Last update: March 6, 2026
Created: March 5, 2026