- Your job willl adapt during the run to the available resources. As more cpus and/or memory are available, it will grow up (up to the per-user limit)
- Your job will only use the resources needed to complete remaining tasks. It will shrink as your tasks finish, giving you and your peers better access to compute resources.
- When run on the scavenge partition, only the subtasks are preempted, and the job as a whole will continue. You can then use dSQAutopsy to create a new task file that has only the tasks that didn't complete.
- All you need is Python 2.7 or higher (Python 3 works too!)
dSQ is not recommended for situations where the initialiazation of the task takes most of its execution time and it is re-usable. These situations are much better handled by a worker-based job handler.
First, you'll need to generate a task file. Each line of this task file needs to specify exactly what you want run for each task, including any modules that need to be loaded or modifications to your environment variables. Empty lines or lines that begin with
# will be ignored when submitting your job array. Note: slurm jobs begin in the directory from which your job was submitted, so be wary of relative paths. This also means that you don't need to `cd` to the working directory if you submit your job there.
Load Dead Simple Queue onto your path with:
module load dSQ
dSQ.py takes a few arguments, then passes the rest directly to sbatch, either by writing a script to stdout or by directly submitting the job for you. Unlike SimpleQueue, the resources you request will be given to each job in the array (each line in your task file), e.g. requesting 2 GB of RAM with dSQ will run each individual task with a separate 2 GB of RAM available. Without specifying any additional sbatch arguments, some defaults will be set. run
sbatch --help or see https://slurm.schedmd.com/sbatch.html for more info on sbatch options.
dSQ.py --taskfile taskfile [dSQ args] [slurm args] Required dSQ arguments: --taskfile TASKFILE Task file, one task per line Optional dSQ arguments: -h, --help show this help message and exit --version show program's version number and exit --submit Submit the job array on the fly instead of printing to stdout. --max-tasks MAX_TASKS Maximum number of simultaneously running tasks from the job array
Managing Your Array
You can refer to any portion of your job array with
jobid_index syntax, or the entire array with its jobid. The index Dead Simple Queue uses starts at zero, so the 3rd line in your task file will have an index of 2. You can also specify ranges.
#to cancel task 4 for array job 14567 scancel 14567_4 #to cancel tasks 3,5 and 10-20 for job 14567: scancel 14567_[3,5,10-20]
You can monitor the status of your tasks in Slurm by using
squeue -u <netid>.
dSQ creates a file named
job_<jobid>_status.tsv, which will report the success or failure of each task as it finishes. Note this file will not contain information for any tasks that were canceled (e.g. by the user with scancel) before they began. This file contains details about the completed tasks in the following tab-separated columns:
- Task_ID: the zero-based line number from your task file
- Exit_Code: exit code returned from your task (non-zero number generally indicates a failed task)
- Time_Started: time started, formatted as year-month-day hour:minute:second
- Time_Ended: time started, formatted as year-month-day hour:minute:second
- Time_Elapsed: in seconds
- Task: the line from your task file
Additionally, Slurm will honor the
-i,--input arguments you provide to capture stdout and stderr. By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number and array index, which is conveniently also the 0-based line number from your task file. We recommend inspecting these outputs for troubleshooting individual failed tasks.
Once the dSQ job is finished, you can use dSQAutopsy to create both a report of the run, as well as a new taskfile that contains just the tasks that failed.
$ dSQAutopsy --help usage: dSQAutopsy taskfile status.tsv Dead Simple Queue Autopsy v0.4 https://github.com/ycrc/dSQ A helper script for analyzing the success state of your tasks after a dSQ run has completed. Specify the taskfile and the status.tsv file generated by the dSQ job and dSQAutopsy will print the tasks that didn't run or completed with non-zero exit codes. It will also report count of each to stderr. positional arguments: taskfile Task file, one task per line statusfile The status.tsv file generated from your dSQ run optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit
We recommend that you use redirection to separate the report from the failed tasks:
dSQAutopsy taskfile.txt job_2629186_status.tsv > failedtasks.txt 2> report.txt