Archived Sequencing Data on Ruddle

Retrieving Archived Data

In the archive, a directory exists for each run, holding one or more tarfiles. There is a main tarfile, plus a tarfile for each project directory. Most users only need the project tarfile corresponding to their data.

Although the archive actually exists on tape, you can treat it as a regular directory tree. Many operations, such as listing directories, cd’ing, etc. are very fast, since directory structures and file metadata are on a disk cache. However, when you actually read the contents of files, the tape is mounted and the file is read into the

Archived runs are stored in the following locations:

Original location Archive location
/panfs/sequencers* /SAY/archive/YCGA-729009-YCGA/archive/panfs/sequencers*
/ycga-ba/ba_sequencers* /SAY/archive/YCGA-729009-YCGA/archive/ycga-ba/ba_sequencers*
/ycga-gpfs/sequencers/illumina/sequencers /SAY/archive/YCGA-729009-YCGA/archive/ycga-gpfs/sequencers/illumina/sequencers

You can directly copy or untar the project tarfile into a scratch directory.

$ cd ~/scratch60/somedir
$ tar –xvf /SAY/archive/YCGA-729009-YCGA/archive/path/to/file.tar 

Inside the project tarfiles are the fastq files, which have been compressed using quip. If your pipeline cannot read quip files directly, you will need to uncompress them before using them:

$ module load Quip
$ quip –d M20_ACAGTG_L008_R1_009.fastq.qp

For your convenience, we have a tool that will download a tarfile, untar it, and uncompress all quip files:

$ module load ycga-public
$ restore –t /SAY/archive/YCGA-729009-YCGA/archive/path/to/file.tar 

If you have trouble locating your files, you can use the utility locateRun, using any substring of the original run name. locateRun is in the same module as restore.

$ locateRun C9374AN

Restore spends most of the time running quip. You can parallelize and thereby speed up that process by doing:

$ restore –n 20 ...

When doing this, make sure to:

  • Run on a compute node, NOT the login node
  • Provide sufficient cpus using –c 20 to sbatch/srun
  • Request sufficient memory: e.g. --mem=100G