ROAR User Guide   »   Using Slurm on Roar Collab
Feedback [ + ]

Using Slurm on Roar Collab

The Roar Collab (RC) computing cluster is a shared computational resource. To perform computationally-intensive tasks, users must request compute resources and be provided access to those resources. The request/provision process allows the tasks of many users to be scheduled and carried out efficiently to avoid resource contention. Slurm (Simple Linux Utility for Resource Management) is utilized by RC as the job scheduler and resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters, and Slurm is rapidly rising in popularity and many other HPC systems use Slurm as well. Its primary functions are to

  • Allocate access to compute resources to users for some duration of time
  • Provide a framework for starting, executing, and monitoring work on the set of allocated compute resources
  • Arbitrate contention for resources by managing a queue of pending work

 

Warning

Do not perform computationally intensive tasks on submit nodes. Submit a resource request via Slurm for computational resources so your computational task can be performed on a compute node.

 

Slurm Resource Directives

Resource directives are used to request specific compute resources for a compute session.

 

Resource Directive Description
-J or --job-name Specify a name for the job
-A or --account Charge resources used by a job to specified account
-p or --partition Request a partition for the resource allocation
-N or --nodes Request a number of nodes
-n or --ntasks Request a number of tasks
--ntasks-per-node Request a number of tasks per allocated node
--mem Specify the amount of memory required per node
--mem-per-cpu Specify the amount of memory required per CPU
-t or --time Set a limit on the total run time
-C or --constraint Specify any required node features
-e or --error Connect script’s standard error to a non-default file
-o or --output Connect script’s standard output to a non-default file
--requeue Specify that the batch job should be eligible for requeuing
--exclusive Require exclusive use of nodes reserved for job

 

Both standard output and standard error are directed to the same file by default, and the file name is slurm-%j.out, where the %j is replaced by the job ID. The output and error filenames are customizable, however, using the table of symbols below.

 

Symbol Description
%j Job ID
%x Job name
%u Username
%N Hostname where the job is running
%A Job array’s master job allocation number
%a Job array ID (index) number

 

Slurm makes use of environment variables within the scope of a job, and utilizing these variables can be beneficial in many cases.

 

Environment Variable Description
SLURM_JOB_ID ID of the job
SLURM_JOB_NAME Name of job
SLURM_NNODES Number of nodes
SLURM_NODELIST List of nodes
SLURM_NTASKS Total number of tasks
SLURM_NTASKS_PER_NODE Number of tasks per node
SLURM_QUEUE Queue (partition)
SLURM_SUBMIT_DIR Directory of job submission

 

Further details on the available resource directives for Slurm are defined by Slurm in the documentation of the salloc and sbatch commands.