Skip to content

Submitting Jobs

Galvani Login Nodes

To access Galvani and submit jobs, use one of the available login nodes:

galvani-login.mlcloud.uni-tuebingen.de

or

134.2.168.114

Galvani Slurm Partitions

Queue Name Max Job
Duration
Min/Max Nodes
per Job
Max Nodes
per User
Max Jobs
per User
Cost
cpu-galvani 30 days 2 ** ** CPU=0.218, Mem=0.0126G
a100-galvani 3 days 1 ** ** CPU=1.406, Mem=0.1034G, gres/gpu=11.25
2080-galvani 3 days 2 ** ** CPU=0.278, Mem=0.0522G, gres/gpu=2.5
2080-preemptable-galvani 3 days 4 ** ** CPU=0.0695, Mem=0.01305G, gres/gpu=0.625
a100-preemptable-galvani 3 days 1 ** ** CPU=0.3515, Mem=0.02585G, gres/gpu=2.813
a100-fat-galvani 3 days 1 ** ** CPU=0.4922,Mem=0.07238G,gres/gpu=14

Important

Queues and limits are subject to change. Entries marked with ** are presently in a fair-decision phase and will be updated when this process has completed.

QOS and the Job Queue

To ensure fair availability of resources, there is a limit on the number of jobs a user may have running simultaneously. A job whose status in squeue is listed as (QOSMaxJobsPerUserLimit) is not executing because the user has reached the limit of simultaneously executing jobs. The details of QOS are presently in a fair-decision phase and will be updated when this process has completed.

Job Cost

A compute job's cost is determined by which resources it uses and how long it uses them, and is booked against your fairshare allocation. The more jobs you submit, the more of your fairshare allocation you will use.

The cost of your job is calculated as the maximum of the cost of the resources you consume. For example if you submit a job to the 2080-galvani partition that used 10 CPUs, 50G RAM, and 1 GPU then:

cost=MAX(10 * 0.278,50 * 0.0522,1 * 2.5)=2.78

These costs are used to calculate your fairshare allocation and fairshare consumption. The details of this calculation are currently in a fair-decision phase and will be updated when this process is completed.

Warning

Accounting and fairshare will be based on the amount of resources you are blocking and not on what you reserve: E.g. requesting #gpu=1 and #cpu=64 on a node with 64 CPU cores and 8 GPUs will still charge your fairshare for the equivalent of a whole GPU node, because nobody else can use it.

Submit a Job to Galvani

Before submitting a job (interactive job or through batch), determine the appropriate partition given your computational requirements from the partitions list (see above). You can also find the relevant information about current partitions by entering sinfo -S+P -o "%18P %8a %20F" on your terminal when connected to Galvani. Additionally you can find out the available GPUs on the cluster by running scontrol show node | grep "CfgTRES.*gres".

There are everal ways of submitting jobs with slurm, using sbatch, srun or salloc:

Sbatch

sbatch is the most common way to run a job on the cluster, through a pre-written script invoked using sbatch batch.script.sh. A sbatch script will run a non-interactive computation, returning both output and errors to specified directories.

When using sbatch, the scheduler will run the job once there are resources available. Until your batch job begins, it will wait in a queue. You do not need to remain connected while the job is waiting or executing. Note that the scheduler does not start jobs on a first-come-first-served basis; many variables are computed to keep the servers busy while balancing the competing needs of all users.

An sbatch script also has many of its command-line parameters defined within the script itself, using lines that start with #SBATCH. This enables running sbatch batch-script.sh. Putting these into the sbatch script itself is vastly preferable to entering them on the command-line for consistency and replicability, especially if something goes wrong.

Use the script below with your modifications (note the output and error entries especially need modification):

#!/bin/bash

# Sample Slurm job script for Galvani 

#SBATCH -J changeme                # Job name
#SBATCH --ntasks=1                 # Number of tasks
#SBATCH --cpus-per-task=8          # Number of CPU cores per task
#SBATCH --nodes=1                  # Ensure that all cores are on the same machine with nodes=1
#SBATCH --partition=2080-galvani   # Which partition will run your job
#SBATCH --time=0-00:05             # Allowed runtime in D-HH:MM
#SBATCH --gres=gpu:2               # (optional) Requesting type and number of GPUs
#SBATCH --mem=50G                  # Total memory pool for all cores (see also --mem-per-cpu); exceeding this number will cause your job to fail.
#SBATCH --output=/CHANGE/THIS/PATH/TO/WORK/myjob-%j.out       # File to which STDOUT will be written - make sure this is not on $HOME
#SBATCH --error=/CHANGE/THIS/PATH/TO/WORK/myjob-%j.err        # File to which STDERR will be written - make sure this is not on $HOME
#SBATCH --mail-type=ALL            # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ENTER_YOUR_EMAIL   # Email to which notifications will be sent

# Diagnostic and Analysis Phase - please leave these in.
scontrol show job $SLURM_JOB_ID
pwd
nvidia-smi # only if you requested gpus
ls $WORK # not necessary just here to illustrate that $WORK is available here

# Setup Phase
# add possibly other setup code here, e.g.
# - copy singularity images or datasets to local on-compute-node storage like /scratch_local
# - loads virtual envs, like with anaconda
# - set environment variables
# - determine commandline arguments for `srun` calls

# Compute Phase
srun python3 runfile.py  # srun will automatically pickup the configuration defined via `#SBATCH` and `sbatch` command line arguments  

Interactive sessions with srun

Begin an interactive session using srun. This will log into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. The srun command submits a new batch job on your behalf, providing interactive access once the job starts. You will need to remain logged in until the interactive session begins.

A simple way to start an interactive job is to run:

srun --job-name "InteractiveJob" --partition=2080-galvani --ntasks=1 --nodes=1 --gres=gpu:1 --time 1:00:00 --pty bash
  • Basic srun example:

In this example, a user starts an interactive job on a compute node (galvani-cn107), activates a conda environment (p312x), checks the status of the allocated GPUs (nvidia-smi) and the nvcc CUDA compiler, then runs a basic Python interactive shell. Finally, the user logs out of the interactive job, terminating the job and returning to galvani-login.

[usr123@galvani-login.sdn]
[Mon Jun 24 12:10:09] ~ $ srun --job-name "bash-prompt" --partition=2080-galvani --ntasks=1 --nodes=1 --gres=gpu:1 --time 1:00:00 --pty bash
[usr123@galvani-cn107 ~]$ conda activate p312x
(p312x) [usr123@galvani-cn107 ~]$ nvidia-smi
Mon Jun 24 12:11:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  | 00000000:DB:00.0 Off |                  N/A |
|  0%   35C    P8              38W / 250W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
(p312x) [usr123@galvani-cn107 ~]$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
(p312x) [usr123@galvani-cn107 ~]$ python3
Python 3.6.8 (default, Oct 23 2023, 19:59:56) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.eye(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
>>> 
(p312x) [usr123@galvani-cn107 ~]$ exit
[usr123@galvani-login.sdn]
[Mon Jun 24 12:15:32] ~ $ 

Salloc

salloc functions similarly to srun --pty bash in that it will add your resource request to the queue. However once the allocation starts, a new bash session will start up on the login node. To run commands on the allocated node you need to use srun.

If you connection is lost for some reason you can use salloc --no-shell to resume shell/jobs sessions.

Important

You can find more about Slurm and additional useful commands from the Slurm Tutorial.


Last update: September 9, 2024
Created: September 9, 2024