Frequently Asked Questions

Here you can find a list of frequently askes questions about the ML Cloud. Such a list comes from user problems, submitted through the ticketing system. We will periodically update the FAQ section with such information.

One possible problem for you experiencing login issues can be due to maxed out quota on $HOME. Please do NOT store input data, results or models on $HOME. Please do NOT let your jobs write to $HOME. If this happens, send us a ticket to the ML Cloud Team - we will temporarily increase your quota, so you can log in and clean space.

My SLURM job should be showing live output, but it isn't.

This is due to output buffering, so you need to make sure your program flushes the buffer. This problem can happen with Python inside or outside a container, and potentially with other programs as well. To fix this, where you've written python script.py, change it to python -u script.py. For other programs, search for how to flush the stdout buffer.

I can't activate `conda` in a SLURM job

Either write out the explicit path to the conda_env executable (/mnt/lustre/.../conda_envs/my_env/bin/python) or include a source ~/.bashrc line in your sbatch script before you try conda activate .... You can also check how to do this from the Pytorch and conda tutorial. Same applies for Tensorflow.

An individual node is having problems that other nodes are not experiencing

To fix your SLURM job to a specific compute node, use --nodelist=galvani-cn101,galvani-cn104 etc. To exclude specific nodes, use --exclude=galvani-cn108,galvani-cn109. Once you've verified that your job crashes on a specific node or nodes but works on others, please report it as a bug, including with the node-specific information. The admins will need the exact commands you used; please include those in your bug report, as well as log locations and if the admins can use your account to diagnose the problem. Possible admin fixes may include --gpu-reset.

csh availability

csh is not available. Recently we have provided tcsh installation on all slurm nodes.

I don't see a specific package, why?

The current strategy of the ML Cloud Team is that unless a package is required, we do not install by default. Thus, if you have a particular need, please open a request for package through the ticketing system.

How to change my default shell to zsh

Please contact the support team with your request and list the username and shell you want, so we can change your default shell.

.....

debug1: No credentials were supplied, or the credentials were
unavailable or inaccessible
No Kerberos credentials available (default cache: FILE:/tmp/krb5cc_1000)

.....

This probably pertains to a client problem on login node. Inform the ML Cloud Team so we can fix the client problem.

Problem with your conda

If you see an error such as this one below:

Collecting package metadata (repodata.json): failed

 >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 387, in _load
        raw_repodata_str = fetch_repodata_remote_request(
      File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 858, in fetch_repodata_remote_request
        raise Response304ContentUnchanged()
    conda.core.subdir_data.Response304ContentUnchanged

    During handling of the above exception, another exception occurred:

.....

you should remove ~/.condarc from your directory and initiate again.

Possible QB-client mount

If you receive error similar to the ones below:

[user@galvani-login ~]$ srun --job-name "InteractiveJob" --ntasks=1 --nodes=1 --time 1-00:00:00 --gres=gpu:1 --pty bash
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/user': Transport endpoint is not connected: going to /tmp instead
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/group': Transport endpoint is not connected: going to /tmp instead
bash: /home/group/user/.bashrc: Transport endpoint is not connected
bash-5.1$

or

/var/spool/slurmd/job1234/slurm_script: line 20:
/home/group/user/.bashrc: Transport endpoint is not
connected

this is probably related to qb-client mounting error on the node where your job has been submitted to. Please open a ticket and we will fix the problem.

Strange wget error

If you encounter strange wget errors simply delete your ~/.wget-hsts which will solve the issue.