Skip to content

Announcements

Problem: 2080ti card suddenly disappeared / nvidia-smi errors

Sometimes, a 2080Ti GPU will have a hardware problem and the GPU will crash. This manifests as a job failing, or nvidia-smi returning errors and/or not showing your GPU. This also can manifest as an Nvidia Xid=79 error.

This occurs unfortunately frequently with this hardware. The only way to tackle the issue is for you to cancel your job and rerun the job on another node. In the meantime, the ML Cloud Team will take care of the problematic card and return the node to service.

Problem: I failed to deploy my SSH public key in time / my old SSH key is no longer accessible

Sometimes, users fail to copy their SSH public key within the initial window provided when setting up their account, or they no longer have access to their previous SSH private key. Thus, they cannot access the server.

To resolve this, please file a support ticket requesting an additional day of password access. Use this new window to upload your SSH public key.

If your SSH private key has been compromised, please notify the ML Cloud Team immediately via a support ticket.

Problem: My disk quota is full.

Disk space is a shared-and-finite resource among all cluster users. Adding quota to one user reduces the amount available to other users. Thus, decisions on quota allocation must balance fairness for all users. Should it be necessary, please direct requests for additional quota to the support ticketing system, but understand there is a high probability that your quota extension will be denied.

Problem: I am experiencing login issues

One common reason for a user experiencing login issues is maxed-out quota on $HOME. All $HOME directories have a very low quota by design.

conda by default places its pkgs storage on `$HOME/.conda. To move this to $WORK, run the following command twice to route pkg downloads to $WORK:

conda config --add pkgs_dirs $WORK/.conda/pkgs/

If you need the space, you may then mv $HOME/.conda/pkgs $WORK/.conda.

Please do NOT store input data, results or models on $HOME. Please never let your jobs write to $HOME. If you run out of quota on $HOME, please file a support ticket with the ML Cloud Team so we can temporarily resolve the issue.

Possible Quobyte-client mount error

If you receive error similar to the ones below:

[user@galvani-login ~]$ srun --job-name "InteractiveJob" --ntasks=1 --nodes=1 --time 1-00:00:00 --gres=gpu:1 --pty bash
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/user': Transport endpoint is not connected: going to /tmp instead
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/group': Transport endpoint is not connected: going to /tmp instead
bash: /home/group/user/.bashrc: Transport endpoint is not connected
[user@galvani-login ~]$

or

/var/spool/slurmd/job1234/slurm_script: line 20:
/home/group/user/.bashrc: Transport endpoint is not
connected

this is probably related to the quobyte filesystem client failing to automatically mount properly on the node where your job has been submitted to. Please open a support ticket and we will fix the problem.

Changing user shell

If you wish to change your shell from the default (bash) to zsh, please open a support ticket and request this. We do not support all shells currently; if you wish to use a different shell, please enquire for more details.

I don't see a specific package, why?

The current strategy of the ML Cloud Team is that we only install required packages. Thus, if you have a particular need, please file a support ticket requesting the package and we will contact you to discuss further.

Problem with your conda

If you see an error such as this one below:

Collecting package metadata (repodata.json): failed

 >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 387, in _load
        raw_repodata_str = fetch_repodata_remote_request(
      File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 858, in fetch_repodata_remote_request
        raise Response304ContentUnchanged()
    conda.core.subdir_data.Response304ContentUnchanged

    During handling of the above exception, another exception occurred:

.....

try removing ~/.condarc from your directory and try again.

Strange wget error

If you encounter strange wget errors simply delete your ~/.wget-hsts which will solve the issue.


Last update: September 9, 2024
Created: September 9, 2024