Using Slurm on Roar Collab
The Roar Collab (RC) computing cluster is a shared computational resource. To perform computationally-intensive tasks, users must request compute resources and be provided access to those resources. The request/provision process allows the tasks of many users to be scheduled and carried out efficiently to avoid resource contention. Slurm (Simple Linux Utility for Resource Management) is utilized by RC as the job scheduler and resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters, and Slurm is rapidly rising in popularity and many other HPC systems use Slurm as well. Its primary functions are to
- Allocate access to compute resources to users for some duration of time
- Provide a framework for starting, executing, and monitoring work on the set of allocated compute resources
- Arbitrate contention for resources by managing a queue of pending work
Warning
Do not perform computationally intensive tasks on submit nodes. Submit a resource request via Slurm for computational resources so your computational task can be performed on a compute node.
Slurm Resource Directives
Resource directives are used to request specific compute resources for a compute session.
Resource Directive | Description |
---|---|
-J or --job-name |
Specify a name for the job |
-A or --account |
Charge resources used by a job to specified account |
-p or --partition |
Request a partition for the resource allocation |
-N or --nodes |
Request a number of nodes |
-n or --ntasks |
Request a number of tasks |
--ntasks-per-node |
Request a number of tasks per allocated node |
--mem |
Specify the amount of memory required per node |
--mem-per-cpu |
Specify the amount of memory required per CPU |
-t or --time |
Set a limit on the total run time |
-C or --constraint |
Specify any required node features |
-e or --error |
Connect script’s standard error to a non-default file |
-o or --output |
Connect script’s standard output to a non-default file |
--requeue |
Specify that the batch job should be eligible for requeuing |
--exclusive |
Require exclusive use of nodes reserved for job |
Both standard output and standard error are directed to the same file by default, and the file name is slurm-%j.out
, where the %j
is replaced by the job ID. The output and error filenames are customizable, however, using the table of symbols below.
Symbol | Description |
---|---|
%j |
Job ID |
%x |
Job name |
%u |
Username |
%N |
Hostname where the job is running |
%A |
Job array’s master job allocation number |
%a |
Job array ID (index) number |
Slurm makes use of environment variables within the scope of a job, and utilizing these variables can be beneficial in many cases.
Environment Variable | Description |
---|---|
SLURM_JOB_ID |
ID of the job |
SLURM_JOB_NAME |
Name of job |
SLURM_NNODES |
Number of nodes |
SLURM_NODELIST |
List of nodes |
SLURM_NTASKS |
Total number of tasks |
SLURM_NTASKS_PER_NODE |
Number of tasks per node |
SLURM_QUEUE |
Queue (partition) |
SLURM_SUBMIT_DIR |
Directory of job submission |
Further details on the available resource directives for Slurm are defined by Slurm in the documentation of the salloc and sbatch commands.