Good Conduct on the ML Cloud

Follow the guidelines and rules on this page when interacting with the ML Cloud in order to be respectful of your fellow users.

You share the ML Cloud with many of other users, and what you do on the system affects others. All users must follow a set of good practices which entail limiting activities that may impact the system for other users. Exercise good conduct to ensure that your activity does not adversely impact the system and the research community with whom you share it.

The ML Cloud staff is developing the following guidelines to good conduct on usage of the ML Cloud. Please familiarize yourself especially with the first two mandates. The next sections discuss best practices, and we provide job submission tips when constructing job scripts to help minimize wait times in the queues.

Do Not Run Jobs on the Login Nodes

ML Cloud's login nodes are shared among all users. Dozens of users may be logged on at one time accessing the file systems. Think of the login nodes as a prep area, where users may edit and manage files, compile code, perform file management, issue transfers, submit new and track existing batch jobs etc. The login nodes provide an interface to the "back-end" compute nodes.

The compute nodes are where actual computations occur and where research is done. Tens of jobs may be running on all compute nodes, with many more queued up to run, especially before deadline submissions. All batch jobs and executables, as well as development and debugging sessions, must be run on the compute nodes.

Running jobs on the login nodes is a sure way to impact performance for other users. Instead, run such jobs if necessary on the compute nodes via an interactive session or by submitting a batch job.

Dos & Don'ts on the Login Nodes

  • Do not run research applications on the login nodes;. If you need interactive access, use the interactive session utility or Slurm's srun to schedule one or more compute nodes.

    DO THIS: Start an interactive session on a compute node.

     

  • That script you wrote to poll job status should probably do so once every few minutes rather than several times a second.

Do Not Stress the Shared File Systems

More File System Tips

File System Usage Recommendations

Optimize Input/Output (I/O) Activity

File Transfer Guidelines

Job Submission Tips

  • Request Only the Resources You Need Make sure your job scripts request only the resources that are needed for that job. Don't ask for more time or more nodes than you really need. The scheduler will have an easier time finding a slot for a job requesting 2 nodes for 2 hours, than for a job requesting 4 nodes for 24 hours. This means shorter queue waits times for you and everybody else.

  • Test your submission scripts. Start small: make sure everything works on 2 nodes before you try 20. Work out submission bugs and kinks with 5 minute jobs that won't wait long in the queue and involve short, simple substitutes for your real workload: simple test problems.

  • Respect memory limits and other system constraints. If your application needs more memory than is available, your job will fail, and may leave nodes in unusable states. Use ML Cloud Guide sections on the clusters hardware composition to understand the systems limits. Additional dashboard will be available to users to track resource availability.