Skip to content

Hands-On: Your First ML Job on the Cluster

Goal: Run a semantic segmentation model (DeepLabV3) that labels every pixel in a photograph using Singularity as a batch job for later processing.

0 — Submitting Batch Jobs

So far we ran everything interactively. In practice, you'll submit jobs that run unattended — especially for longer training runs or processing many images. This is where Singularity really shines: the job script is fully self-contained and reproducible.

1 — Batch Job with Singularity

Save this as $WORK/ml-tutorial/segment_batch.sh:

#!/bin/bash
#SBATCH --job-name=segment
#SBATCH --partition=2080-galvani
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=00:30:00
#SBATCH --output=$WORK/ml-tutorial/logs/%j_segment.out
#SBATCH --error=$WORK/ml-tutorial/logs/%j_segment.err

# ── Create log directory (SLURM won't do this for you) ──────────
mkdir -p $WORK/ml-tutorial/logs

# ── Run inside the container ────────────────────────────────────
singularity exec --nv \
  --bind $WORK:$WORK \
  --bind $HOME:$HOME \
  --env TORCH_HOME=$WORK/.cache/torch \
  $WORK/pytorch_24.01.sif \
  python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results

echo "Job finished at $(date)"

Submit and monitor:

# Submit
sbatch $WORK/ml-tutorial/segment_batch.sh

# Check status
squeue --me

# Once completed, check the output
cat $WORK/ml-tutorial/logs/<JOBID>_segment.out

2 — Anatomy of an sbatch Script

#!/bin/bash                          ← shell (always bash)
#SBATCH --job-name=segment           ← name shown in squeue
#SBATCH --partition=<PARTITION>      ← which queue to use
#SBATCH --gres=gpu:1                 ← request 1 GPU
#SBATCH --mem=16G                    ← RAM limit
#SBATCH --time=00:30:00              ← walltime limit (HH:MM:SS)
#SBATCH --output=.../%j.out          ← stdout (%j = job ID)
#SBATCH --error=.../%j.err           ← stderr
#SBATCH --mail-type=END,FAIL         ← optional: email on completion
#SBATCH --mail-user=you@example.com  ← optional: your email

# Everything below the #SBATCH block is a normal bash script.
# It runs on the allocated compute node.

Tip: #SBATCH lines must come before any executable command. A blank line or any non-comment line ends the SLURM header.

3 — Processing Multiple Images

A real use case: segment every .jpg in a directory. This version uses a job array — one job per image, all running in parallel:

First, create a directory and place all your images in it:

mkdir -p $WORK/ml-tutorial/images
#!/bin/bash
#SBATCH --job-name=segment-array
#SBATCH --partition=2080-galvani
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH --output=$WORK/ml-tutorial/logs/%A_%a.out
#SBATCH --error=$WORK/ml-tutorial/logs/%A_%a.err
#SBATCH --array=0-4   # adjust to number of images minus 1

mkdir -p $WORK/ml-tutorial/logs

# Build an array of input images
IMAGES=($WORK/ml-tutorial/images/*.jpg)
INPUT=${IMAGES[$SLURM_ARRAY_TASK_ID]}

echo "Task $SLURM_ARRAY_TASK_ID processing: $INPUT"

singularity exec --nv \
  --bind $WORK:$WORK \
  --bind $HOME:$HOME \
  --env TORCH_HOME=$WORK/.cache/torch \
  $WORK/pytorch_24.01.sif \
  python $WORK/ml-tutorial/segment.py "$INPUT" $WORK/ml-tutorial/results

echo "Task $SLURM_ARRAY_TASK_ID finished at $(date)"
# Prepare some images first
mkdir -p $WORK/ml-tutorial/images
cp $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/images/street1.jpg
# ... add more images ...

# Submit the array
sbatch $WORK/ml-tutorial/segment_array.sh

# Monitor all array tasks
squeue --me
#SBATCH flag Purpose
--array=0-4 Launch 5 parallel jobs (indices 0 through 4)
%A in output path Replaced with the array master job ID
%a in output path Replaced with the array task index
$SLURM_ARRAY_TASK_ID Available inside the script as the current index

4 — What to Try Next

  • Use your own image: python $WORK/ml-tutorial/segment.py $WORK/my_photo.jpg
  • Different model: Swap deeplabv3_mobilenet_v3_large for deeplabv3_resnet101 — larger model, better accuracy, same interface.
  • Multi-GPU: Add --gres=gpu:2 and modify the script to use torch.nn.DataParallel.
  • Email notifications: Add --mail-type=END,FAIL and --mail-user= to get notified when jobs finish.

Cheat Sheet

# ── Interactive ──────────────────────────────────────────────────
srun --partition=<PARTITION> --gres=gpu:1 --mem=16G --time=01:00:00 --pty bash

# ── Batch ────────────────────────────────────────────────────────
sbatch $WORK/ml-tutorial/segment_batch.sh         # single job (Singularity)

# ── Monitoring ───────────────────────────────────────────────────
squeue --me                           # your running/pending jobs
scancel <JOBID>                       # cancel a job
sacct -j <JOBID> --format=JobID,State,Elapsed,MaxRSS   # job stats

Last update: March 6, 2026
Created: March 5, 2026