Hands-On: Your First ML Job on the Cluster
Goal: Run a semantic segmentation model (DeepLabV3) that labels every pixel in a photograph using Singularity as a batch job for later processing.
0 — Submitting Batch Jobs
So far we ran everything interactively. In practice, you'll submit jobs that run unattended — especially for longer training runs or processing many images. This is where Singularity really shines: the job script is fully self-contained and reproducible.
1 — Batch Job with Singularity
Save this as $WORK/ml-tutorial/segment_batch.sh:
#!/bin/bash
#SBATCH --job-name=segment
#SBATCH --partition=2080-galvani
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=00:30:00
#SBATCH --output=$WORK/ml-tutorial/logs/%j_segment.out
#SBATCH --error=$WORK/ml-tutorial/logs/%j_segment.err
# ── Create log directory (SLURM won't do this for you) ──────────
mkdir -p $WORK/ml-tutorial/logs
# ── Run inside the container ────────────────────────────────────
singularity exec --nv \
--bind $WORK:$WORK \
--bind $HOME:$HOME \
--env TORCH_HOME=$WORK/.cache/torch \
$WORK/pytorch_24.01.sif \
python $WORK/ml-tutorial/segment.py $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/results
echo "Job finished at $(date)"
Submit and monitor:
# Submit
sbatch $WORK/ml-tutorial/segment_batch.sh
# Check status
squeue --me
# Once completed, check the output
cat $WORK/ml-tutorial/logs/<JOBID>_segment.out
2 — Anatomy of an sbatch Script
#!/bin/bash ← shell (always bash)
#SBATCH --job-name=segment ← name shown in squeue
#SBATCH --partition=<PARTITION> ← which queue to use
#SBATCH --gres=gpu:1 ← request 1 GPU
#SBATCH --mem=16G ← RAM limit
#SBATCH --time=00:30:00 ← walltime limit (HH:MM:SS)
#SBATCH --output=.../%j.out ← stdout (%j = job ID)
#SBATCH --error=.../%j.err ← stderr
#SBATCH --mail-type=END,FAIL ← optional: email on completion
#SBATCH --mail-user=you@example.com ← optional: your email
# Everything below the #SBATCH block is a normal bash script.
# It runs on the allocated compute node.
Tip:
#SBATCHlines must come before any executable command. A blank line or any non-comment line ends the SLURM header.
3 — Processing Multiple Images
A real use case: segment every .jpg in a directory. This version uses a job array — one job per image, all running in parallel:
First, create a directory and place all your images in it:
mkdir -p $WORK/ml-tutorial/images
#!/bin/bash
#SBATCH --job-name=segment-array
#SBATCH --partition=2080-galvani
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH --output=$WORK/ml-tutorial/logs/%A_%a.out
#SBATCH --error=$WORK/ml-tutorial/logs/%A_%a.err
#SBATCH --array=0-4 # adjust to number of images minus 1
mkdir -p $WORK/ml-tutorial/logs
# Build an array of input images
IMAGES=($WORK/ml-tutorial/images/*.jpg)
INPUT=${IMAGES[$SLURM_ARRAY_TASK_ID]}
echo "Task $SLURM_ARRAY_TASK_ID processing: $INPUT"
singularity exec --nv \
--bind $WORK:$WORK \
--bind $HOME:$HOME \
--env TORCH_HOME=$WORK/.cache/torch \
$WORK/pytorch_24.01.sif \
python $WORK/ml-tutorial/segment.py "$INPUT" $WORK/ml-tutorial/results
echo "Task $SLURM_ARRAY_TASK_ID finished at $(date)"
# Prepare some images first
mkdir -p $WORK/ml-tutorial/images
cp $WORK/ml-tutorial/input.jpg $WORK/ml-tutorial/images/street1.jpg
# ... add more images ...
# Submit the array
sbatch $WORK/ml-tutorial/segment_array.sh
# Monitor all array tasks
squeue --me
#SBATCH flag |
Purpose |
|---|---|
--array=0-4 |
Launch 5 parallel jobs (indices 0 through 4) |
%A in output path |
Replaced with the array master job ID |
%a in output path |
Replaced with the array task index |
$SLURM_ARRAY_TASK_ID |
Available inside the script as the current index |
4 — What to Try Next
- Use your own image:
python $WORK/ml-tutorial/segment.py $WORK/my_photo.jpg - Different model: Swap
deeplabv3_mobilenet_v3_largefordeeplabv3_resnet101— larger model, better accuracy, same interface. - Multi-GPU: Add
--gres=gpu:2and modify the script to usetorch.nn.DataParallel. - Email notifications: Add
--mail-type=END,FAILand--mail-user=to get notified when jobs finish.
Cheat Sheet
# ── Interactive ──────────────────────────────────────────────────
srun --partition=<PARTITION> --gres=gpu:1 --mem=16G --time=01:00:00 --pty bash
# ── Batch ────────────────────────────────────────────────────────
sbatch $WORK/ml-tutorial/segment_batch.sh # single job (Singularity)
# ── Monitoring ───────────────────────────────────────────────────
squeue --me # your running/pending jobs
scancel <JOBID> # cancel a job
sacct -j <JOBID> --format=JobID,State,Elapsed,MaxRSS # job stats
Created: March 5, 2026