Slurm is used to submit jobs to a specified set of compute resources, which are variously called queues or partitions. Slurm uses the term partition. Partitions, their defaults, limits and purposes are listed on each cluster page. To see more details about how jobs are scheduled see our Job Scheduling documentation.
Please be a good Cluster Citizen. We find many jobs on the clusters that allocate multiple CPUs/nodes, but unfortunately only actually run on a single CPU, often seriously overloading that CPU. This is a problem for two reasons. The job runs much more slowly, and resources that could be used by others are wasted. Here are some tips for avoiding that problem:
- Do not use the slurm option
--ntasks/-nto ask for multiple CPUs except for MPI programs. Other than MPI, there are almost no cases where this does what you want.
- Programs that are either sequential (using only one CPU) or multi-threaded (using multiple CPUs) can only run on a single node. Allocating CPUs on multiple nodes will not speed them up magically.
- Once your job is running, double check that it is actually using all the CPUs you allocated. Use
squeue -u <netid>to determine the nodes your job is running on.
sshto the node(s) and run
top -u <netid>to check CPU usage. You should see your processes running on each node. Typically if your program didn't spread onto the entire allocation, the first node in your list will be overloaded and the rest idle.
Common Slurm Commands
Submit a submission script (see below for details)
List queued and running jobs
Cancel a queued job or kill a running job
Check status of individual job (including failed or completed)
sacct -j <job_id>
Interactive jobs can be used for testing and troubleshooting code. By requesting an interactive job, you will be allocated resources and logged onto the node in a shell.
srun --pty -p interactive bash
This will assign a free node to you, allocating the requested number of cpus, walltime, and memory, and put you within a shell on that node. You can run any number of commands within that shell. To free the allocated node, exit from the shell.
When using an interactive shell under slurm, your job is vulnerable to being killed if you lose your network connection. We recommend using
tmux alleviate this. When using
tmux, please be sure to keep track of your allocations and free those no longer needed!
To use a GUI application (such as Matlab), when in an interactive job, use the
srun --pty --x11 -p interactive [additional slurm options] bash
Note that for X11 forwarding to work, you need to have your local machine setup properly. Please see our X11 setup guide for more info.
To submit a job via Slurm, you first write a simple shell script called a "submission script" that wraps your job. A submission script is comprised of three parts:
- The program that should run the script. This is normally
- The "directives" that tell the scheduler how to setup the computational resources for your job.These lines must appear before any other commands or definitions, otherwise they will be ignored.
- The actual "script" portion, which are the commands you want executed during your job.
Here is an example script.sh that runs a job on one cpu on single node:
#!/bin/bash #SBATCH --partition=general #SBATCH --job-name=my_job #SBATCH --ntasks=1 --nodes=1 #SBATCH --mem-per-cpu=6000 #SBATCH --time=12:00:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=email ./myprog -p 20 arg1 arg2 arg3 ...
As shown in the above example, "directives" are comprised of
#SBATCH followed by Slurm options. Most commonly used options include:
||Custom job name|
||Partition to run on|
||Total number of nodes|
||Number of "tasks". For use with distributed parallelism. See below.|
||# of cpus allocated to each task. For use with shared memory parallelism.|
||Number of "tasks" per node. For use with distributed parallelism. See below.|
||Maximum walltime of the job in Days-Hours:Minutes:Sec|
||specific node architecture (if applicable)|
||Memory requested per cpu in MB|
||Memory requested per node in MB|
||Mail address (alternatively, put your email address in ~/.forward)|
||Send emails to user on all job events|
Additional options can be found on in the official Slurm documentation.
Resource Limit Enforcement
Slurm uses the linux cgroup feature to enforce limits on cpus, gpus, and memory. Jobs are only permitted to run on a node if they have a valid allocation, and only within the limits specified by that limitation. Thus, if you request a single core from slurm (the default) and start a job that runs 20 parallel threads, those threads will be packed into a single cpu, and run very slowly. Similarly, if you do not explicitly request memory, your job will be granted a fairly modest default per cpu, and if your job attempts to exceed that amount, it will be killed.
Requesting Cores and Nodes
Slurm is very explicit in how one requests cores and nodes. While extremely powerful, the three flags,
--cpus-per-task can be a bit confusing at first. We attempt to disambiguate them below.
The term "task" in this context can be thought of as a "process". Therefore, a multi-process program (e.g. MPI) is comprised of multiple tasks. And a multi-threaded program is comprised of a single task, which can in turn use multiple CPUs. In Slurm, tasks are requested with the
--ntasks flag. CPUs, for the multithreaded programs, are requested with the
1. Multi-threaded & multi-process programs
To request CPUs for your multi-threaded program, use the
--cpus-per-task flag. Individual tasks cannot be split across multiple compute nodes, so requesting a number of CPUs with
--cpus-per-task flag will always result in all your CPUs allocated on the same compute node.
2. MPI programs
In Slurm, the
--ntasks flag specifies the number of MPI tasks created for your job. Note that, even within the same job, multiple tasks do not necessarily run on a single node. Therefore, requesting the same number of CPUs as above, but with the
--ntasks flag, could result in those CPUs being allocated on several, distinct compute nodes.
For many users, differentiating between
--cpus-per-task is sufficient. However, for more control over how Slurm lays out your job, you can add the
--nodes specifies how many nodes to allocate to your job. Slurm will allocate your requested number of cores to a minimal number of nodes on the cluster, so it is extremely likely if you request a small number of tasks that they will all be allocated on the same node. However, to ensure they are on the same node, set
--nodes=1 (obviously this is contingent on the number of CPUs on your cluster's nodes and requesting too many may result in a job that will never run). Conversely, if you would like to ensure a specific layout, such as one task per node for memory, I/O or other reasons, you can also set
--ntasks-per-node=1. Note that the following must be true:
ntasks-per-node * nodes <= ntasks
3. Hybrid (MPI+OpenMP) programs
For the most predictable performance for hybrid codes, you will need to use all three of the
--nodes flags, where
--ntasks equals the number of MPI tasks,
--cpus-per-task equals the number of OMP_NUM_THREADS and
--nodes is the number of nodes required to fit
--ntasks * --cpus-per-task.
Some of our clusters have nodes that contain GPU co-processors. Please refer to the cluster-specifc documentation regarding the node configurations that include gpus. In order for your job to be able to access gpus, you must request them as a slurm "Generic Resource" or gres. You spcify the gres configuration for a job with the
--gres flag, optionally a type of resource, and a number of gpus. For example, to request one nvidia p100 for your job, you would use the flag
--gres=gpu:p100:1. In cases where there are multiple gpus on a node, it is often a good idea to also specify the
--gres-flags=enforce-binding flag, which tells slurm to force the cpu(s) your job is allocated to share a PCIe Host Bridge with the gpu(s). Depending on your application, this can lead to improved performance, especially in transferring data to and from the gpu.
Example submission scripts
1. Multi-threaded programs
#!/bin/bash #SBATCH --job-name=omp_job #SBATCH --output=omp_job.txt #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --time=10:00 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./hello.omp
2. Multi-process programs
On Omega, try to make ntasks equal to a multiple of 8.
#!/bin/bash #SBATCH --job-name=mpi #SBATCH --output=mpi_job.txt #SBATCH --ntasks=4 #SBATCH --time=10:00 mpirun hello.mpi
3. Hybrid (MPI+OpenMP) programs
#!/bin/bash #SBATCH --job-name=hybrid #SBATCH --output=hydrid_job.txt #SBATCH --ntasks=8 #SBATCH --cpus-per-task=5 #SBATCH --nodes=2 #SBATCH --time=10:00 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun hello_hybrid.mpi
3. GPU job
#!/bin/bash #SBATCH --job-name=deep_learn #SBATCH --output=gpu_job.txt #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:k80:2 #SBATCH --partition=gpu #SBATCH --gres-flags=enforce-binding #SBATCH --time=10:00 module load CUDA module load cuDNN # using your anaconda environment source activate deep-learn python my_tensorflow.py
Monitoring Job Status
To check on the status of your jobs, do:
$ squeue -l -u <netid>
To kill a running job, do:
$ scancel <job_id>
To check on the status of a partition(to see how many nodes are free, for example), do:
$ squeue -p partition
Using Private Partitions
If you have special permission to submit to a partition that belongs to another group, you may be asked to assign a special "account" to your jobs in that partition. You will be given the name of this account when you get access the partition and then simple add the
-A <account> flag to your submission command or as an additional directive in your submission script.
Features and Constraints
You may want to run programs that require more specific hardware than slurm may be willing to allocate to your job. To ensure your job runs on specific types of nodes, use the
--constraint flag. You can use the processor type (e.g.
E5-2660_v3) or processor codename (e.g.
haswell) to limit your job to specific node types. You can also specify an instruction set (e.g.
avx) to require that no matter what cpu your job runs on, it must understand at least these instructions. See the individual cluster pages for the exact tags for the different node types.
# run on a node with a haswell codenamed cpu (e.g. a E5-2660 v3) sbatch --constraint=haswell submit.sh # only run on nodes with E5-2660 v4 cpus sbatch --constraint=E5-2660_v4 submit.sh # run on any node that understands avx instructions # Your job could also run on an avx2 node sbatch --constraint=avx submit.sh
Use the command
scontrol show node hostname, replacing hostname with the node's hostname you're insterested in, to see more information about it including its features.
Migrating from another queuing system to Slurm
Most commands and batch script can be easily converted to slurm. SchedMD created a "rosetta stone" showing equivalent operations in various batch queuing systems. Here are some specific hints:
Migrating from Torque:
- You do not need to change directory by doing cd $PBS_O_WORKDIR. Slurm runs jobs in the submission directory by default
- You must have an interpreter command as the first line of your slurm batch script. Usually #!/bin/bash is what you want.