Known Issues
This page collects known issues affecting the ML Cloud systems and application software.
Note
The following list of known issue is intended to provide a quick reference for users experiencing problems on the ML Cloud systems. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to ML Cloud Support.
GPU nodes may crash but still be available for scheduling jobs
Added: 2023-01-28
Affects: Galvani
Description: the GPU nodes may not find the GPUs on the nodes. This results in an immediate CUDA not found
error and unfortunately, many jobs may get send to this node since it appears free.
Status: Open
Workaround/Suggested Action: Open a ticket. The node needs to be restarted and only the ML Cloud admins can do this.
Slurm Configuration File
Added: 2023-01-28
Affects: Galvani
Description: This can happen during slurm package updates. You will see an error like:
sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sbatch: error: fetch_config: DNS SRV lookup failed
sbatch: error: _establish_config_source: failed to fetch config
sbatch: fatal: Could not establish a configuration source
Workaround/Suggested Action: Just restart the current shell (log out and log in again) or use export SLURM_CONF=/etc/slurm/slurm.conf
to set the path manually.