Submitting Jobs
Galvani Login Nodes
To access Galvani and submit jobs, use one of the available login nodes:
galvani-login.mlcloud.uni-tuebingen.de
or
134.2.168.114
Galvani Slurm Partitions
Queue Name | Max Job Duration |
Min/Max Nodes per Job |
Max Nodes per User |
Max Jobs per User |
Cost |
---|---|---|---|---|---|
cpu-galvani |
30 days | 2 | ** | ** | CPU=0.218, Mem=0.0126G |
a100-galvani |
3 days | 1 | ** | ** | CPU=1.406, Mem=0.1034G, gres/gpu=11.25 |
2080-galvani |
3 days | 2 | ** | ** | CPU=0.278, Mem=0.0522G, gres/gpu=2.5 |
2080-preemptable-galvani |
3 days | 4 | ** | ** | CPU=0.0695, Mem=0.01305G, gres/gpu=0.625 |
a100-preemptable-galvani |
3 days | 1 | ** | ** | CPU=0.3515, Mem=0.02585G, gres/gpu=2.813 |
a100-fat-galvani |
3 days | 1 | ** | ** | CPU=0.4922,Mem=0.07238G,gres/gpu=14 |
Important
Queues and limits are subject to change. Entries marked with ** are presently in a fair-decision phase and will be updated when this process has completed.
QOS and the Job Queue
To ensure fair availability of resources, there is a limit on the number of jobs a user may have running simultaneously.
A job whose status in squeue
is listed as (QOSMaxJobsPerUserLimit)
is not executing because the user has reached the limit of simultaneously executing jobs.
The details of QOS are presently in a fair-decision phase and will be updated when this process has completed.
Job Cost
A compute job's cost is determined by which resources it uses and how long it uses them, and is booked against your fairshare allocation. The more jobs you submit, the more of your fairshare allocation you will use.
The cost of your job is calculated as the maximum of the cost of the resources you consume.
For example if you submit a job to the 2080-galvani
partition that used 10 CPUs, 50G RAM, and 1 GPU then:
cost=MAX(10 * 0.278,50 * 0.0522,1 * 2.5)=2.78
These costs are used to calculate your fairshare allocation and fairshare consumption. The details of this calculation are currently in a fair-decision phase and will be updated when this process is completed.
Warning
Accounting and fairshare will be based on the amount of resources you are blocking and not on what you reserve: E.g. requesting #gpu=1
and #cpu=64
on a node with 64 CPU cores and 8 GPUs will still charge your fairshare for the equivalent of a whole GPU node, because nobody else can use it.
Submit a Job to Galvani
Before submitting a job (interactive job or through batch), determine the appropriate partition given your computational requirements from the partitions list (see above).
You can also find the relevant information about current partitions by entering sinfo -S+P -o "%18P %8a %20F"
on your terminal when connected to Galvani.
Additionally you can find out the available GPUs on the cluster by running scontrol show node | grep "CfgTRES.*gres"
.
There are everal ways of submitting jobs with slurm, using sbatch
, srun
or salloc
:
Sbatch
sbatch
is the most common way to run a job on the cluster, through a pre-written script invoked using sbatch batch.script.sh
.
A sbatch
script will run a non-interactive computation, returning both output and errors to specified directories.
When using sbatch
, the scheduler will run the job once there are resources available.
Until your batch job begins, it will wait in a queue.
You do not need to remain connected while the job is waiting or executing.
Note that the scheduler does not start jobs on a first-come-first-served basis; many variables are computed to keep the servers busy while balancing the competing needs of all users.
An sbatch
script also has many of its command-line parameters defined within the script itself, using lines that start with #SBATCH
.
This enables running sbatch batch-script.sh
.
Putting these into the sbatch
script itself is vastly preferable to entering them on the command-line for consistency and replicability, especially if something goes wrong.
Use the script below with your modifications (note the output
and error
entries especially need modification):
#!/bin/bash
# Sample Slurm job script for Galvani
#SBATCH -J changeme # Job name
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=8 # Number of CPU cores per task
#SBATCH --nodes=1 # Ensure that all cores are on the same machine with nodes=1
#SBATCH --partition=2080-galvani # Which partition will run your job
#SBATCH --time=0-00:05 # Allowed runtime in D-HH:MM
#SBATCH --gres=gpu:2 # (optional) Requesting type and number of GPUs
#SBATCH --mem=50G # Total memory pool for all cores (see also --mem-per-cpu); exceeding this number will cause your job to fail.
#SBATCH --output=/CHANGE/THIS/PATH/TO/WORK/myjob-%j.out # File to which STDOUT will be written - make sure this is not on $HOME
#SBATCH --error=/CHANGE/THIS/PATH/TO/WORK/myjob-%j.err # File to which STDERR will be written - make sure this is not on $HOME
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ENTER_YOUR_EMAIL # Email to which notifications will be sent
# Diagnostic and Analysis Phase - please leave these in.
scontrol show job $SLURM_JOB_ID
pwd
nvidia-smi # only if you requested gpus
ls $WORK # not necessary just here to illustrate that $WORK is available here
# Setup Phase
# add possibly other setup code here, e.g.
# - copy singularity images or datasets to local on-compute-node storage like /scratch_local
# - loads virtual envs, like with anaconda
# - set environment variables
# - determine commandline arguments for `srun` calls
# Compute Phase
srun python3 runfile.py # srun will automatically pickup the configuration defined via `#SBATCH` and `sbatch` command line arguments
Interactive sessions with srun
Begin an interactive session using srun
.
This will log into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine.
An interactive session is a great way to develop, test, and debug code.
The srun
command submits a new batch job on your behalf, providing interactive access once the job starts.
You will need to remain logged in until the interactive session begins.
A simple way to start an interactive job is to run:
srun --job-name "InteractiveJob" --partition=2080-galvani --ntasks=1 --nodes=1 --gres=gpu:1 --time 1:00:00 --pty bash
- Basic
srun
example:
In this example, a user starts an interactive job on a compute node (galvani-cn107
), activates a conda
environment (p312x
), checks the status of the allocated GPUs (nvidia-smi
) and the nvcc
CUDA compiler, then runs a basic Python interactive shell. Finally, the user logs out of the interactive job, terminating the job and returning to galvani-login
.
[usr123@galvani-login.sdn]
[Mon Jun 24 12:10:09] ~ $ srun --job-name "bash-prompt" --partition=2080-galvani --ntasks=1 --nodes=1 --gres=gpu:1 --time 1:00:00 --pty bash
[usr123@galvani-cn107 ~]$ conda activate p312x
(p312x) [usr123@galvani-cn107 ~]$ nvidia-smi
Mon Jun 24 12:11:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:DB:00.0 Off | N/A |
| 0% 35C P8 38W / 250W | 1MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
(p312x) [usr123@galvani-cn107 ~]$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
(p312x) [usr123@galvani-cn107 ~]$ python3
Python 3.6.8 (default, Oct 23 2023, 19:59:56)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>>
(p312x) [usr123@galvani-cn107 ~]$ exit
[usr123@galvani-login.sdn]
[Mon Jun 24 12:15:32] ~ $
Salloc
salloc
functions similarly to srun --pty bash
in that it will add your resource request to the queue. However once the allocation starts, a new bash session will start up on the login node. To run commands on the allocated node you need to use srun
.
If you connection is lost for some reason you can use salloc --no-shell
to resume shell/jobs sessions.
Important
You can find more about Slurm and additional useful commands from the Slurm Tutorial.
Created: September 9, 2024