Optimize Job I/O


Input/Output (I/O) is one of the major areas of difficulty on our clusters. Because storage, including your home directory and any project or scratch areas you may have, is mounted consistently across the login and compute nodes, it may not be obvious when I/O issues could be impactful and thus are often the major performance issue on the cluster. By understanding the storage system and using some relatively simple techniques you can avoid imposing unnecessary loads on the storage, which will make your (and others’) codes run faster.

I will discuss four techniques: linux pipes, named pipes, RAM filesystems and local disk filesystems below.

Linux Pipes

The liberal use of temporary files can have a dramatic impact on I/O. We see such use of temporary files in genomics pipelines. Since the files tend to be very large and the computation done is small, I/O contention is created most dramatically.

Consider the following example.

$ gunzip test_export.txt.gz # produces test_export.txt

$ perl filter.pl test_export.txt test_filtered.txt # does some kind of filtering on the reads

$ perl illumina_export2sam.pl --read1=test_filtered.txt > test_filtered.sam # convert to sam format

$ samtools view -bS -t hg19.fa.fai test_filtered.sam -o test_filtered.bam # convert to bam format

$ samtools sort test_filtered.bam test_sorted # sort bam file

A pipeline designed this way takes one file, produces 4 intermediate files, and one final output file, all of roughly the same size as the input file. In addition, the “samtools sort” will almost certainly create a large number of temp files for partial results during the sort. Each of those files has to be sent across the network, written to the filesystem, and then read back from the filesystem and back across the network. Since each export file can be several gigabytes, and typical runs can involve hundreds of input files, it places a huge load on the network and storage system.

Linux pipes in contrast, directly connect the output from one command to the input to another, completely avoiding the filesystem. In the shell, they are represented by a vertical bar “|”. Several commands can be strung together this way:

command1 | command2 | command

The example above can be rewritten as follows:

$ gunzip -c test.export.txt.gz \

| perl filter.pl - - \

| perl illumina_export2sam.pl --read1=- \

| samtools view -bS -t hg19.fa.fai - \

| samtools sort - test.sorted

Thus using pipes converts 5 commands into a single command line. To make this more readable, we used the linux linebreak character “\” to break the command into 5 lines (The “\” MUST be the last character on the line).

A few notes of caution when using Linux pipes

  • The individual invocations had to be modified slightly to cause them to read from standard input (stdin) and write to standard output (stdout) rather than files. For gunzip we use the “-c”. For many programs, it is a convention to use “-” as signifying stdin and stdout. Since programs may handle this differently, you will need to figure out how to do this for your programs.
  • A nice side effect of using pipes in the above example is that all 5 commands in the pipeline run in parallel. As each read flows through the pipe, it can be processed by the next step. Assuming you have several cores allocated to the job, the entire pipeline will run in the time required by the slowest step, rather than the total of all the steps. The fact that all 5 commands run in parallel implies that you should make sure to allocate an appropriate number of cores so that you are not stealing cycles from other users sharing the same node.

Named Pipes

Occasionally you may run into a program that does not know how to handle stdin and stdout, and stubbornly insists on files. In that case, you can create a “named pipe”, which looks just like a file but is actually a pipe. Named pipes are also called fifos, and are created by the command mkfifo.

Imagine, for example, that illumina_export2sam.pl required files, and you couldn’t or didn’t want to modify it to use stdin/stdout. Here is the modified pipeline:

$ mkfifo e2si
$ mkfifo e2so

$ gunzip -c test.export.txt.gz \
| perl filter.pl - e2si  \
| perl illumina_export2sam.pl --read1=e2s1 --output=e2so \
| samtools view -bS -t hg19.fa.fai e2so \
| samtools sort - test.sorted

$ rm e2si e2so

Note a few things:

  • We explicitly create and delete the named pipes using mkfifo and rm.
  • Although illumina_export2sam.pl is connected by | pipes to the commands before and after it, no data is transmitted via those pipes. The | pipes allow all 5 commands to run in parallel.

RAM and Local Filesystems

Although pipes are very useful they are only useful if your program uses data that is read sequentially. If you application or program accesses data in a file randomly, pipes will not work. In these cases, consider using either a RAM filesystem or a local disk filesystem as described below.

RAM filesystem

On each compute node, there is a filesystem (/dev/shm) that lives in memory (RAM). Directories and files can be created in the usual way. However using these files will be extremely fast, and will not place any load on the network or the global filesystem. In addition, any number of files and directories can be created, just like a normal filesystem, and all the usual file operations will work, including random access, permissions, etc.

The same example using a RAM filesystem is shown below.

mkdir /dev/shm/rdb9
gunzip -c test.export.txt.gz \
| perl filter.pl - /dev/shm/rdb9/tmp1 \
| perl illumina_export2sam.pl --read1=/dev/shm/rdb9/tmp1 --output=/dev/shm/rdb9/tmp1 \
| samtools view -bS -t hg19.fa.fai /dev/shm/rdb9/tmp1 \
| samtools sort - test.sorted
rm -rf /dev/shm/rdb9

A few things to keep in mind:

  • It is a good practice to create a subdirectory (preferably using the convention of your netid).
  • It is very important to delete any files you created in /dev/shm when you are done. These files will consume RAM until the node happens to be rebooted since these is no automated clean up utility currently in place.
  • RAM used in this way will be unavailable for other purposes, such as data structures for your program or to other users so please make sure you have enough memory to run your program and be considerate to other users by not allocating to much RAM to files.

Local Filesystem

Each compute node also has a local hard drive, mounted as /tmp. This filesystem is not as fast as /dev/shm, since it is a physical hard drive. However, it does have the advantage of being local to the node, and not shared. You will probably see substantially better performance using it instead of the global filesystem. Just like the RAM filesystem, you can create directories and files there in the usual way.

Using our standard example with a local filesystem is shown below.

mkdir /tmp/rdb9

gunzip -c test.export.txt.gz \

| perl filter.pl - /tmp/rdb9/tmp1 \

| perl illumina_export2sam.pl --read1=/tmp/rdb9/tmp1 --output=/tmp/rdb9/tmp2 \

| samtools view -bS -t hg19.fa.fai /tmp/rdb9/tmp2 \

| samtools sort - test.sorted

rm -rf /tmp/rdb9

A few things to keep in mind:

  • As with /dev/shm, please create a subdirectory in /tmp using your netid.
  • /tmp is also known as /state/partition1. However, we prefer that you refer to it as /tmp.
  • It is very important that you delete any files you create in /tmp when you are done. Otherwise these files will consume space in the local drive until it fills and we manually clean it up. We don’t currently have a way to do this automatically.
  • Space used in this way will be unavailable to other users. If you are sharing a node with other users (because you did not allocate a full node) be careful not to use more than your fair share.

Concluding remarks

In the examples used throughout this tutorial, we explicitly uncompressed the input file. Many tools accept compressed files directly. Check to see if your tools can too, or rewrite them to handle compressed files.

Samtools sort does read and write sequentially, but internally it needs to keep lists of partially sorted reads. It will try to do that in memory, but if the input file is too large, it will create temporary files named .%d.bam in the current working directory. There are a couple of ways to avoid creating these temp files on the global filesystem:

1. use the -m flag to samtools sort to increase the amount of RAM set aside for the sort.

2. execute samtools sort with its working directory set to either the RAM filesystem or the /tmp filesystem as shown below:

gunzip -c test.export.txt.gz \

| perl filter.pl - /tmp/rdb9/tmp1 \

| perl illumina_export2sam.pl --read1=/tmp/rdb9/tmp1 --output=/tmp/rdb9/tmp1 \

| samtools view -bS -t hg19.fa.fai /tmp/rdb9/tmp1 \

| (cdrevision /dev/shm; mkdir rdb9; cd rdb9; samtools sort - /mysample_sorted)