Announcements
Problem: 2080ti card suddenly disappeared / nvidia-smi
errors
Sometimes, a 2080Ti GPU will have a hardware problem and the GPU will crash. This manifests as a job failing, or nvidia-smi
returning errors and/or not showing your GPU. This also can manifest as an Nvidia Xid=79 error.
This occurs unfortunately frequently with this hardware. The only way to tackle the issue is for you to cancel your job and rerun the job on another node. In the meantime, the ML Cloud Team will take care of the problematic card and return the node to service.
Problem: I failed to deploy my SSH public key in time / my old SSH key is no longer accessible
Sometimes, users fail to copy their SSH public key within the initial window provided when setting up their account, or they no longer have access to their previous SSH private key. Thus, they cannot access the server.
To resolve this, please file a support ticket requesting an additional day of password access. Use this new window to upload your SSH public key.
If your SSH private key has been compromised, please notify the ML Cloud Team immediately via a support ticket.
Problem: My disk quota is full.
Disk space is a shared-and-finite resource among all cluster users. Adding quota to one user reduces the amount available to other users. Thus, decisions on quota allocation must balance fairness for all users. Should it be necessary, please direct requests for additional quota to the support ticketing system, but understand there is a high probability that your quota extension will be denied.
Problem: I am experiencing login issues
One common reason for a user experiencing login issues is maxed-out quota on $HOME
. All $HOME
directories have a very low quota by design.
conda
by default places its pkgs
storage on `$HOME/.conda
.
To move this to $WORK
, run the following command twice to route pkg downloads to $WORK:
conda config --add pkgs_dirs $WORK/.conda/pkgs/
If you need the space, you may then mv $HOME/.conda/pkgs $WORK/.conda
.
Please do NOT store input data, results or models on $HOME
. Please never let your jobs write to $HOME
.
If you run out of quota on $HOME
, please file a support ticket with the ML Cloud Team so we can temporarily resolve the issue.
Possible Quobyte-client mount error
If you receive error similar to the ones below:
[user@galvani-login ~]$ srun --job-name "InteractiveJob" --ntasks=1 --nodes=1 --time 1-00:00:00 --gres=gpu:1 --pty bash
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/user': Transport endpoint is not connected: going to /tmp instead
slurmstepd: error: couldn't chdir to `/mnt/qb/home/group/group': Transport endpoint is not connected: going to /tmp instead
bash: /home/group/user/.bashrc: Transport endpoint is not connected
[user@galvani-login ~]$
or
/var/spool/slurmd/job1234/slurm_script: line 20:
/home/group/user/.bashrc: Transport endpoint is not
connected
this is probably related to the quobyte filesystem client failing to automatically mount properly on the node where your job has been submitted to. Please open a support ticket and we will fix the problem.
Changing user shell
If you wish to change your shell from the default (bash
) to zsh
, please open a support ticket and request this.
We do not support all shells currently; if you wish to use a different shell, please enquire for more details.
I don't see a specific package, why?
The current strategy of the ML Cloud Team is that we only install required packages. Thus, if you have a particular need, please file a support ticket requesting the package and we will contact you to discuss further.
Problem with your conda
If you see an error such as this one below:
Collecting package metadata (repodata.json): failed
>>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 387, in _load
raw_repodata_str = fetch_repodata_remote_request(
File "/usr/lib/python3.9/site-packages/conda/core/subdir_data.py", line 858, in fetch_repodata_remote_request
raise Response304ContentUnchanged()
conda.core.subdir_data.Response304ContentUnchanged
During handling of the above exception, another exception occurred:
.....
try removing ~/.condarc
from your directory and try again.
Strange wget error
If you encounter strange wget
errors simply delete your ~/.wget-hsts
which will solve the issue.
Created: June 21, 2024