Skip to content

Storage

Backups

Warning

Perform regular backups of your data to a safe location. The ML Cloud does not perform data backups.

The $SCRATCH File System

The $SCRATCH file system, as its name indicates, is a temporary storage space. Files that have not been accessed in ten days are subject to purge without warning. Assume that $SCRATCH is purged when your compute job finishes, so only use it as a temporary workspace.

Best Practices on Lustre

Lustre is a high-performance distributed cluster file system. Lustre is designed to appear as a single file system to the end user. However, this means that a few command-line operations are inefficient on Lustre compared to single-drive file systems.

Avoid accessing attributes of files and directories

Try to use the lfs-style Lustre-based commands where possible to access attributes of files and directories, such as lfs find or lfs df instead of find or df.

In Lustre, accessing metadata information including file attributes (i.a. type, ownership, protection, size, dates) is resource intensive. These commands can degrade the filesystem performance, especially when performed frequently or over large directories.

  • Avoid using commands such as ls -R, find, locate, du, df and similar. These commands walk the filesystem recursively and/or perform heavy metadata operations. Because they are very intensive on accessing filesystem metadata, these commands can degrade badly the overall file system performance. If walking the filesystem recursively is absolutely required, then use the Lustre-optimized lfs findinstead of find and similar tool. To minimize the number of Lustre RPC calls, whenever possible use the lfs commands instead of the system provided commands:

lfs df instead of df

lfs find instead of find

Avoid having a large number of files in a single directory

When a file is accessed, Lustre places a lock on the parent directory. When many files in the same directory are to be opened this creates contention. Writing thousands of files to a single directory produces massive load on Lustre metadata servers, often resulting on taking filesystems offline. Accessing a single directory containing thousands of files can cause heavy resource contention degrading the filesystem performance.

One alternative is to organize the data into multiple sub-directories and split the files across them.
A common approach is to use the square root of the number of files, for instance for 90000 files the square root would be 300, therefore 300 directories should be created containing 300 files each.

Best is to reorganize your data in files such as webdata, tar files. For experts, when data is read-only, another alternative is to create a disk image and mount it read-only through loopback in each cluster node as described in Handling Data Tutorial. Container tools such as singularity can also enable the use of loopback mounted disk images.

File Size: Use Large Files

Lustre performance is dependent on the size of the files being accessed. A read operation that accesses many small files will be slower than one accessing a few large files, because the metadata for each file needs to be read individually. Therefore, for optimal performance, try using file formats that containerize/package data within a large single file that includes the ability to perform direct subfile access.

Lustre Architecture & Striping, In Brief

The Lustre file system looks and acts like a single logical hard disk, but is actually a sophisticated integrated system involving hundreds of physical drives. Lustre stripes large files over several physical disks by breaking it into chunks and storing it on multiple physical drives, making it possible to deliver the high performance needed to service input/output (I/O) requests from hundreds of users across thousands of nodes. Lustre is managed from a MGS (Management Server) that interfaces to where the data and metadata are stored. There are two types of Lustre storage nodes in our configuration: MDTs and OSTs. Object Storage Targets (OSTs) manage the file system's data space: a file with 11 stripes, for example, is distributed across 11 OSTs for storage. Metadata Targets (MDT) track the OSTs assigned to a file, as well as storing the file system's descriptive metadata space. The Lustre MGS, MDTs, and OSTs work together to provide optimal filesystem services for client nodes.

Available Datasets

Some commonly used datasets have been provided for the users:

Dataset Name Location
ImageNet-ffcv /mnt/lustre/datasets/ImageNet-ffcv
CLEVR_v1.0 /mnt/lustre/datasets/CLEVR_v1.0
coco /mnt/lustre/datasets/coco
Falcor3D_down128 /mnt/lustre/datasets/Falcor3D_down128
ffcv_imagenet_data /mnt/lustre/datasets/ffcv_imagenet_data
kitti /mnt/lustre/datasets/kitti
laion400m /mnt/lustre/datasets/laion400m
laion_aesthetics /mnt/lustre/datasets/laion_aesthetics
ModelNet40 /mnt/lustre/datasets/ModelNet40
NMR_Dataset /mnt/lustre/datasets/NMR_Dataset
stl10_binary /mnt/lustre/datasets/stl10_binary
WeatherBench2 /mnt/lustre/datasets/weatherbench2
PUG Dataset /mnt/lustre/datasets/PUG
C4 (en, noclean) /mnt/lustre/datasets/c4
synthclip /mnt/lustre/datasets/SynthCLIP
gobjaverse (tar version) /mnt/lustre/datasets/gobjaverse
mlcommons /mnt/lustre/datasets/mlcommons
objaverse /mnt/lustre/datasets/objaverse
ImageNet-C /mnt/lustre/datasets/ImageNet-C
Imagenet2012 /mnt/lustre/datasets/ImageNet2012
Imagenet-r /mnt/lustre/datasets/imagenet-r
Imagenet-r /mnt/lustre/datasets/imagenet-r

Do You Want A New Dataset?

If you would like an additional dataset installed for general use, please use the following form and/or contact us though the ticketing system.

CEPH S3

Ceph provides large amounts of slow storage for archival purposes, also known as cold storage. Ceph is accessible through the S3 protocol, which means you must have both S3 credentials and an S3 client.

Each user group in the MLCloud environment has a dedicated Ceph S3 bucket named <group>0. For example, the group mladm maps to the bucket mladm0.

Users may only write to their personal prefix within their group's bucket: <group>0/<username>/. For instance, user mfa624 in group mladm can only upload data to mladm0/mfa624/.

Each bucket has an initial quota of 35 TB.

How to access Galvani Ceph S3 Buckets:

1. Generating access credentials for Galvani Ceph S3 buckets

This document describes how to generate Ceph S3 access credentials for the Galvani cluster.

Log in to galvani-login or galvani-login2. First, generate a token by running the following script. When prompted, enter your LDAP username and password — these are the same credentials you use for Nextcloud. If you have forgotten your password, you can reset it via Nextcloud.

/usr/share/custom-scripts/generate_token.sh

NOTE : While entering your password no characters will appear on screen.

The output command for this is your Access Key, which you will need in the next step. Your Access Key is a string like this (this is a random example, not a valid key):

s9uk2E00L8Y6O741jDV9d61h73092LPk498Miq8AA1n2nX7142z24u1D376tk4346734d63Qvf36U23n50891w60P818ze98Du5116b38E94Z00M3rM1u5mpO0z0j64PC4aD5EOb87vgQTGb1v801181G3IeY2GM286r34s09349125Sjn3x85a=

NOTE: DO NOT SHARE THIS TOKEN WITH ANYONE, BECAUSE IT CONTAINS CLUSTER LOGIN INFORMATION.

How to Interact with Ceph buckets:

The Ceph cluster is reachable through two Endpoints :

  • Public Endpoint (Reachable from anywhere): https://s3.mlcloud.uni-tuebingen.de (traffic outside our network is encrypted), higher latency and slower transfer speed.

  • Private Endpint (Reachable from any compute node within Galvani as well as galvani-login and galvani-login2): http://s3.mlcloud.uni-tuebingen.de traffic is not encrypted , lower latency and faster transfer speed.

There are multiple CLI tools that you can use for interacting with S3 buckets, two of the most popular tools are s3cmd and awscli. The configuration of both tools is outlined below.

How to use s3cmd

If you want to use the private endpoint , you can replace s3.mlcloud.uni-tuebingen.de with s3.mlcloud.uni-tuebingen.de and set Use HTTPS protocol to No

  • Generate the ~/.s3cfg configuration file:

Execute command s3cmd --configure and answering the following prompts:

Enter new values or accept defaults in brackets with Enter.

Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.

Access Key: the Token that you generated in previous step.

Secret Key: Enter the word secret

Default Region [US]: Press Enter

S3 Endpoint [s3.amazonaws.com]: s3.mlcloud.uni-tuebingen.de

DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: s3.mlcloud.uni-tuebingen.de

Encryption password is used to protect your files from reading by unauthorized persons while in transfer to S3

Encryption password: Press Enter

Path to GPG program [/usr/bin/gpg]: Press Enter

Use HTTPS protocol [Yes]: Press Enter

HTTP Proxy server name: Press Enter

Test access with supplied credentials?: y

Save settings?: y

  • Verify if your access is working fine

First, try the following command:

s3cmd ls

If this command did not output an error, then your configuration is valid. Now you can interact with the bucket.

s3cmd Examples:

  • List bucket contents:

s3cmd ls s3://<bucket>

  • Upload a file to a bucket:

s3cmd put /mnt/lustre/home/group/user/local/file/path/filename.txt s3://<bucket>/<user>/

  • Download a file from a bucket:

s3cmd get s3://<bucket>/<user>/<file-name> /mnt/lustre/home/group/user/local/file/path/

How to use awscli

  • To create the aws config and credentials files, run

aws --profile <profile-name> configure

<profile-name> can be any identifier , for example your ldap username.

For example, if your LDAP username is mfa624:

aws --profile mfa624 configure AWS Access Key ID [None]: "the Token that you generated in previous step" AWS Secret Access Key [None]: "" yes, add only 2 double quotations Default region name [None]: us-east-1 Default output format [None]: json`

Decide which endpoint you want to use and verify if your access is working fine.

aws --profile mfa624 --endpoint-url https://s3.mlcloud.uni-tuebingen.de s3 ls

If this command did not output an error, then your configuration is valid. If you are within Galvani, you should use the private endpoint. In that case, --endpoint-url is http://s3.mlcloud.uni-tuebingen.de.

aws Examples:

  • List bucket contents:

aws --profile <profile-name> --endpoint-url https://s3.mlcloud.uni-tuebingen.de s3 ls s3://<bucket>

  • Upload a file to a bucket:

aws --profile <profile-name> --endpoint-url https://s3.mlcloud.uni-tuebingen.de s3 cp /mnt/lustre/home/group/user/local/file/path/filename.txt s3://<bucket>/<user>

  • Download a file from a bucket:

aws --profile <profile-name> --endpoint-url https://s3.mlcloud.uni-tuebingen.de s3 cp s3://<bucket>/<user>/<file-name> /mnt/lustre/home/group/user/local/file/path/

AWSCLI Compatibility Issue

AWSCLI added new headers to the authentication request (checksum headers) and ceph is rejecting it The fix is to disable these checksum headers in .aws/config add profile specific configuration

I am using AWS Profile name mfa624, you should add the same name you are using in aws --profile

[profile mfa887]
request_checksum_calculation = when_required
response_checksum_validation = when_required
s3 =
  addressing_style = path
  preferred_transfer_client = classic

Last update: May 8, 2026
Created: June 21, 2024