Known Issues

This page collects known issues affecting the ML Cloud systems and application software.

Note

The following list of known issue is intended to provide a quick reference for users experiencing problems on the ML Cloud systems. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to ML Cloud Support.

GPU nodes may crash but still be available for scheduling jobs

Added: 2023-01-28

Affects: Galvani

Description: the GPU nodes may not find the GPUs on the nodes. This results in an immediate CUDA not found error and unfortunately, many jobs may get send to this node since it appears free.

Status: Open

Workaround/Suggested Action: Open a ticket. The node needs to be restarted and only the ML Cloud admins can do this.

Slurm Configuration File

Added: 2023-01-28

Affects: Galvani

Description: This can happen during slurm package updates. You will see an error like:

sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sbatch: error: fetch_config: DNS SRV lookup failed
sbatch: error: _establish_config_source: failed to fetch config
sbatch: fatal: Could not establish a configuration source

Workaround/Suggested Action: Just restart the current shell (log out and log in again) or use export SLURM_CONF=/etc/slurm/slurm.conf to set the path manually.