General workflow

eXpress is a general abundance estimation tool that can can work with any set of target sequences (an annotation) and high-throughput sequencing reads. A target sequence can be any genomic region, such as a transcript in RNA-Seq. Thus, the general workflow is:

  1. Choose your favorite dataset
  2. Generate a set of target sequences
  3. Align the fragments to target sequences
  4. Input the fragments to eXpress and perform target sequence abundance estimation
  5. ... additional downstream analysis ...

Or seen pictorially here:

general flowchart.png

In this tutorial we will use the following tools/resources:


Other useful tools (not just for RNA-Seq) can be found here.



Use case: No (or poor?) reference genome and no annotation: The yak

In some cases, you will be studying an organism that does not have a reference genome, or has a poor reference genome. This typically means that you also do not have a transcriptome. The next step is usually to de novo assemble the transcriptome. For this example we will be using Bowtie 2 to perform fragment alignments.

Obtaining the data

To acquire this data, we will use the GEO accession number. If you don't have a GEO accession number and would simply like to browse for data, you can follow the tutorial in the Appendix. This is actually how I found the data.

For this part of the demo, we will be looking at the yak transcriptome, or Bos grunniens.

yak.jpg


Why the yak? It so happens the yak is naturally odorless. In fact, yak wool is naturally odorless. Don't believe me? Check out their Wikipedia page. The reference is legit. :P

To download the data, simply go to GEO and enter the accession number "GSE33300" into the query box for "GEO accession" and click "GO".

geoAccession.png

This will take you directory to the main experiment page. There are six samples here from different organs. We will be looking at "GSM823609 brain". Click the experiment, and download the SRA file again by clicking (ftp) on the same row as "SRA Experiment," clicking the directory, and downloading SRR361433.sra. This is a paired end experiment, so we will extract it using

$ fastq-dump --split-3 SRR361433.sra

The result is two files, SRR361433_1.fastq and SRR361433_2.fastq. Note the use of --split-3. This is absolutely necessary ONLY if you have paired end reads.

Annotation by de novo assembly

Assembly is a beast on its own, so we will not cover it here. For this demo, we spent some time and made a completely de novo assembly without a genome using Trinity.

We ran Trinity with the following parameters:

$ Trinity.pl --seqType fq --JM 200G --left SRR361433_1.fastq --right SRR361433_2.fastq --CPU 20

Within a few hours, we had a number of files output in trinity_out_dir, including the annotation Trinity.fasta. This serves as your new annotation. From here on, the analysis is exactly the same if you have a reference genome (our next use case).

You can download the assembly here.

Alignment

Building an index

Before you can do any alignments, you first have to create an index of the target sequences.
$ cd trinity_out_dir
$ bowtie2-build --offrate 1 Trinity.fasta Trinity

Which will build an index in trinity_out_dir with base name Trinity from the de novo assembly. This required index allows Bowtie 2 to map quickly to the target sequences. Offrate 1 increases the speed of the alignment at the cost of hard disk space.

Performing the alignments

We can run Bowtie 2 with one line now,
$ bowtie2 -a -X 800 -p 4 -x trinity_out_dir/Trinity \
    -1 SRR361433_1.fastq -2 SRR361433_2.fastq | samtools view -Sb - > hits.bam

  • -a - because we want Bowtie 2 to report all alignments (eXpress will handle the multi-mapping issues)
  • -X 800 - Set the fragment length to no more than 800. This is the largest fragment your typically RNA-Seq experiment will have that we will consider
  • -p 4 - Use four CPUs to perform the alignments. You should use as many as available.
  • -x ... - The Bowtie 2 index
  • -1 ... -2 ... - The left and right reads from the RNA-Seq experiment

After a few minutes (or hours, depending on how many CPUs), you should get something like:

[samopen] SAM header is present: 165714 sequences.
32959992 reads; of these:
  32959992 (100.00%) were paired; of these:
    4385612 (13.31%) aligned concordantly 0 times
    22571641 (68.48%) aligned concordantly exactly 1 time
    6002739 (18.21%) aligned concordantly >1 times
    ----
    4385612 pairs aligned concordantly 0 times; of these:
      352368 (8.03%) aligned discordantly 1 time
    ----
    4033244 pairs aligned 0 times concordantly or discordantly; of these:
      8066488 mates make up the pairs; of these:
        5942057 (73.66%) aligned 0 times
        1356189 (16.81%) aligned exactly 1 time
        768242 (9.52%) aligned >1 times
90.99% overall alignment rate

It took me about 40 minutes to run this data through Bowtie 2 using 40 cores.

Running eXpress


We can now run eXpress. Just a quick recap, you should have the following:

  • A set of target references (annotation) in multi-FASTA format
  • Alignments in the BAM format

Running express is as simple as,

$ express -o xprs_out trinity_out_dir/Trinity.fa hits.bam

Argument -o makes a new directory and stores the results in xprs_out. The next argument is the multi-FASTA set of target sequences. Finally, hits.bam is the set of aligned reads.

You will see some output that looks like the following:

Attempting to read 'hits.bam' in BAM format...
Parsing BAM header...
Loading target sequences and measuring bias background...
 
Processing input fragment alignments...
Synchronized parameter tables.
 
Fragments Processed (hits.bam): 1000000          Number of Bundles: 145443
Synchronized parameter tables.
Fragments Processed (hits.bam): 2000000          Number of Bundles: 144074

Eventually it will finish and you will have the files
  • results.xprs
  • params.xprs

On our machine, using only two cores, eXpress completed the analysis in only 30 minutes with 28,613,101 fragments.

Analyzing results.xprs

results.xprs is simply a tab-delimited file that contains:

  • bundle_id - The bundle that the particular target belongs to
  • target_id - The name of the target from the multi-FASTA annotation
  • length - The true length of the target sequence
  • effective length - The length adjusted for biases
  • total counts - The number of fragments that mapped to this target
  • unique counts - The number of unique fragments that mapped to this target
  • estimated counts - The number of counts estimated by the online EM algorithm
  • effective counts - The number of fragments adjusted for biases: est_counts * length / effective_length
  • alpha - Beta-Binomial alpha parameter
  • beta - Beta-binomial beta parameter
  • FPKM - Fragments Per Kilobase per Million Mapped. Proportional to (estimated counts) / (effective length)
  • FPKM low
  • FPKM high
  • solveable flag - Whether or not the likelihood has a unique solution

You can then take this file (results.xprs and import it as a spreadsheet into Excel (or LibreOffice, etc). You then might be interested in sorting by FPKM:

assemblyResults.png

You can results here



Use case: Big data with an existing annotation

Here, we will look at an example using very deeply sequenced data. We will be looking at the data from Peng et. al. in Nature Biotechnology. In total, there are ~520 million reads. We will be looking only at the poly(A)- dataset, which is 402 million reads.

Obtaining the data

If you already have a dataset in mind, then typically you can find an accession number associated with the dataset. In this case, the authors provide us with the Short Read Archive (SRA) accession number at the end of the publication. If you click it, or search the SRA for the number (SRA043767.1) you will have no luck. Simply chop off the '.1' at the end and you will see the data.
sraSearch.png
There are many datasets here, but we are only interested in the poly(A)- ones for now. Sometimes you can figure out which ones to download by simply reading the descriptions, but this typically is not the case. To find this information, we read the "Supplemental Information" and found the page describing the data (Supplementary Table 1. Overview of sequencing samples).
suppTable.png
According to the table, we are interested in HUMwktTBRAAPE_* and HUMwktTBRBAPE. We first click HUMwktTBRAAPE. Whenever downloading samples from SRA, it is important to note whether they are single-end or paired-end as this will change the way you extract the FASTQ files. If we look at "Spot description", we see that the reads are paired-end. Click the links under size, and then download the two files SRR324687.sra and SRR325616.sra. Do the same for HUMwktTBRBAPE and the corresponding SRA file SRR324688.sra.

sraDownload.png

The reads are stored using SRA format, but most tools use the plaintext format, FASTQ. Luckily, we can easily extract FASTQ files from the SRA files using fastq-dump from the SRA Toolkit.

$ fastq-dump --split-3 SRR324687.sra 
$ fastq-dump --split-3 SRR324688.sra 
$ fastq-dump --split-3 SRR325616.sra

Note the use of --split-3. This is absolutely necessary ONLY if you have paired end reads. Once fastq-dump is complete, you should have six *.fastq files.

Generating a set of target sequences (with an existing annotation)


Before you do any quantification, you should determine whether there is a set of target sequences that are sufficient for your analysis. For RNA-Seq, this would be a transcript annotation, typically in Gene Transfer Format (GTF). Since eXpress works with general target sequences, the GTF file must be used to create one FASTA record per transcript.

If you are analyzing a common species such as human, there are many existing annotations. If you are analyzing something a little more obscure (for example, the yak as seen above), then you might need to do a de novo assembly.

The easiest way to create a set of target sequences is to use an existing annotation. It is even easier if your organism is on the UCSC genome browser. For an annotation not on the genome browser, or an annotation not on the genome browser and reference not on the genome browser, look at the appendix. Here are the steps for the simple genome browser case:

  1. Go to http://genome.ucsc.edu
  2. Click "Tables" near the top menu
  3. Select the genome
    • We will be looking at Mammal -> Human -> hg19
  4. Select the annotation
    • We will look at 'Ensembl'. Click track: "Ensembl Genes"
  5. Change the "output format" to "sequence"
  6. Type a descriptive name
    • i.e.: ensembl-hg19.fa.gz
  7. Compress to save on the download time
  8. The final screen will look something like this:
    humanTranscriptome.png
  9. Make sure everything is okay, then click "get output". At the next screen, click "genomic" and click submit. If you don't see this, you did something wrong! Hit back!
  10. IMPORTANT: At the next screen, make sure "Introns" is NOT selected. The remaining defaults are fine for most applications. Click "get sequence" and your download should begin. Your screen should look similar to the one below.
    humanNoIntrons.png

As a sanity check, lets unzip this file and take a look into it.

$ gunzip ensembl-hg19.fa.gz
$ head ensembl-hg19.fa
>hg19_ensGene_ENST00000237247 range=chr1:66999066-67210057 5'pad=0 3'pad=0 strand=+ repeatMasking=none
AGTTTGATTCCAGAGCCCCACTCGGCGGACGGAATAGACCTCAGCAGCGG
CGTGGTGAGGACTTAGCTGGGACCTGGAATCGTATCCTCCTGTGTTTTTT
CAGACTCCTTGGAAATTAAGGAATGCAATTCTGCCACCATGATGGAAGGA
TTGAAAAAACGTACAAGGAAGGCCTTTGGAATACGGAAGAAAGAAAAGGA
CACTGATTCTACAGGTTCACCAGATAGAGATGGAATTCAGGGGAAAAAAA
AGACCCAGAAGACTCAGCTTCTCCTCACCTCTTGCTTCTGGCTCAGAGCC
CTCTCGTTAACTCTGTCTCAGAAGAAAAGCAATGGGGCACCAAATGGATT
TTATGCGGAAATTGATTGGGAAAGATATAACTCACCTGAGCTGGATGAAG
AAGGCTACAGCATCAGACCCGAGGAACCCGGCTCTACCAAAGGAAAGCAC

Now that your have your multi-FASTA target sequence file, you can build a Bowtie index.

Alignment

Building an index

Using Bowtie,

$ bowtie-build --offrate 1 ensembl-hg19.fa ensembl-hg19

bowtie-build builds a Bowtie index of the annotation. This required index allows Bowtie to map quickly to the target sequences. Offrate 1 increases the speed of the alignment, at the cost of hard disk space.

This index took me about 25 minutes to make. Ensembl is typically a pretty large annotation.

Note that, you only need to do this once if you are going to reuse the same annotation.

Performing alignments

We can now perform the actual alignments.

$ bowtie -aS --chunkmbs 128 -v 3 -X 800 -p 40 /home/lmcb/bowtie_indices/ensembl-hg19/ensembl-hg19 \ 
    -1 SRR324687_1.fastq,SRR324688_1.fastq,SRR325616_1.fastq \
    -2 SRR324687_2.fastq,SRR324688_2.fastq,SRR325616_2.fastq \
    | samtools view -Sb - > hits.bam

The parameters here are slightly different since they using Bowtie:
  • -a - because we want Bowtie 2 to report all alignments (eXpress will handle the multi-mapping issues)
  • -S - to output in SAM format
  • -v 3 - will allow up to three mismatches
  • -X 800 - Set the fragment length to no more than 800. This is the largest fragment your typically RNA-Seq experiment will have that we will consider
  • -p 4 - Use four CPUs to perform the alignments. You should use as many as available.
  • -1 ... -2 ... - The left and right reads from the RNA-Seq experiment

We then pipe (|) the SAM output into samtools to create a BAM file, then store it in hits.bam.

Of course, if you have many replicates or many time points, you will likely be doing this over and over again. It is a good idea to have one directory per experiment. In this spirit, you can use my script below in the Appendix.

[samopen] SAM header is present: 181648 sequences.
# reads processed: 421836549
# reads with at least one reported alignment: 184591763 (43.76%)
# reads that failed to align: 237244786 (56.24%)
Reported 242921711 paired-end alignments to 1 output stream(s)

With 40 cores, this alignment took approximately 11 hours to complete. You might notice that the alignment rate is somewhat low. It is somewhat expected since the subject was of Chinese descent (different from the reference genome).

Running eXpress for abundance estimation


Since we have the same name for the alignments as in the previous, the eXpress command is identical.

$ express -o xprs_out /home/lmcb/bowtie_indices/ensembl-hg19/ensembl-hg19.fa hits.bam

Argument -o makes a new directory and stores the results in xprs_out. The next argument is the multi-FASTA set of target sequences. Finally, hits.bam is the set of aligned reads.

You will see some output that looks like the following:

Attempting to read 'hits.bam' in BAM format...
Parsing BAM header...
Loading target sequences and measuring bias background...
 
Processing input fragment alignments...
Synchronized parameter tables.
Fragments Processed (hits.bam): 1000000      Number of Bundles: 135798
Fragments Processed (hits.bam): 2000000      Number of Bundles: 124769
Fragments Processed (hits.bam): 3000000      Number of Bundles: 118656
Fragments Processed (hits.bam): 4000000      Number of Bundles: 114476
....

Eventually it will finish and you will have the files
  • results.xprs
  • params.xprs

On our machine, using only two cores, eXpress completed the analysis in only 3 hours.

Analyzing results.xprs

results.xprs is simply a tab-delimited file that contains:

  • bundle_id - The bundle that the particular target belongs to
  • target_id - The name of the target from the multi-FASTA annotation
  • length - The true length of the target sequence
  • effective length - The length adjusted for biases
  • total counts - The number of fragments that mapped to this target
  • unique counts - The number of unique fragments that mapped to this target
  • estimated counts - The number of counts estimated by the online EM algorithm
  • effective counts - The number of fragments adjusted for biases: est_counts * length / effective_length
  • alpha - Beta-Binomial alpha parameter
  • beta - Beta-binomial beta parameter
  • FPKM - Fragments Per Kilobase per Million Mapped. Proportional to (estimated counts) / (effective length)
  • FPKM low
  • FPKM high
  • solveable flag - Whether or not the likelihood has a unique solution

You can then take this file (results.xprs and import it as a spreadsheet into Excel (or LibreOffice, etc). You then might be interested in sorting by FPKM:

humanResults.png


You can download the results here



Appendix: Useful scripts


#!/bin/bash
# Author: Harold Pimentel
 
if [ $# -eq 0 ]; then
    echo "Usage: bowtie2.sh DIR1 [DIR2 DIR3 ... DIRN]" 1>&2
    exit 1
fi
 
# TODO: Configure the variables in this block!
BWT2=`which bowtie2`
IDX=/home/lmcb/bowtie2_indices/ensembl-mm9/ensembl-mm9
N_THR=40
 
for DIR in "$@"
do
    echo "Aligning reads in $DIR..."
 
    if [ ! -e $DIR ]; then
        echo "ERROR: Couldn't find directory '${DIR}'" 1>&2
        echo "\tskipping...." 1>&2
        continue
    fi
 
    LEFT=`ls ${DIR}/*_1.fastq | sed ':a;N;$!ba;s/\n/,/g'`
    RIGHT=`ls ${DIR}/*_2.fastq | sed ':a;N;$!ba;s/\n/,/g'`
 
    if [ -z "$LEFT"  ] || [ -z "$RIGHT" ]; then
        echo "Didn't find matching paired-end reads in $DIR"
        echo "Trying single-end..."
 
        SINGLE=`ls ${DIR}/*_1.fastq | sed ':a;N;$!ba;s/\n/,/g'`
        if [ -z $SINGLE ]; then
            echo "ERROR: Couldn't find any single-end reads either... skipping $DIR"
            continue
        fi
    fi
 
    if [ -z "$SINGLE" ]; then
        CMD="$BWT2 -a -X 800 -p $N_THR -x $IDX -1 $LEFT -2 $RIGHT | samtools view -Sb - > ${DIR}/hits.bam"
    else
        CMD="$BWT2 -a -X 800 -p $N_THR -x $IDX -U $SINGLE | samtools view -Sb - > ${DIR}/hits.bam"
    fi
 
    echo "Executing..."
    echo $CMD
 
    # Before you run it, you might want to comment the next line so that you can do a "dry run"
    $CMD 2>&1 | tee ${DIR}/bwt2.log
 
    unset SINGLE
done

I cannot guarantee this script will work for you!, but it works for me and hopefully you can use it. As noted in the comments near the end, you might want to comment out the $CMD line to see what it will do, and give it a quick visual inspection to make sure everything is working as expected, then run it again uncommented.

If you copy and paste it into a file somewhere called bowtie2.sh and make it executable, you can use it like this:

$ chmod 755 bowtie2.sh
$ ./bowtie2.sh exp1 exp2 exp3

where exp1, exp2, exp3 are directories with *.fastq reads you want to align. These reads should follow the convention *_1.fastq for left reads and *_2.fastq for right reads.


Appendix: Extracting a multi-FASTA file from custom annotation


Using UCSC Genome Browser

Use this method if:
  • You have an annotation that is not on the Genome Browser
  • There is a reference genome on the Genome Browser
Go go the UCSC Genome Broswer and click "Tables" again. Click "add custom tracks" and click "Choose file" to upload the annotation. Click submit.

Once it is done uploading, you will get to a new screen and click "go to table browser."

ucscCustom.png

Follow the same method used in the human dataset to extract the sequences.

Using your own reference on a local machine

If you have TopHat installed, there is a tool included called gtf_to_fasta. The tool requires you have,

  • Your genome in a multi-FASTA file
  • An annotation in GFF/GTF format

The usage is then simple:

$ gtf_to_fasta ensembl-hg19.gtf hg19.fa ensembl-hg19.fa

where the arguments are [annotation.gtf] [genome.fa] [target out]. Your target sequences will then look something like:

$ head ensembl-hg19.fa 
>99061 chr10:134210672-134231367 ENST00000305233
ggcggcggcggcggcggcggcggcggcggggcgggggcggcCTGGGACGCGGCGGGAGCA
TGGAGCCGCGCGCCGGCTGCCGGCTGCCGGTGCGGGTGGAGCAGGTCGTCAACGGCGCGC
TGGTGGTCACGGTGAGCTGCGGCGAGCGGAGCTTCGCGGGGATCCTGCTGGACTGCACGA
AAAAGTCCGGCCTCTTTGGCCTACCCCCGTTGGCTCCGCTGCCCCAGGTCGATGAGTCCC
CTGTCAACGACAGCCATGGCCGGGCTCCCGAGGAGGGGGATGCAGAGGTGATGCAGCTGG
GGTCCAGCTCCCCCCCTCCTGCCCGCGGGGTTCAGCCCCCCGAGACCACCCGCCCCGAGC
CACCCCCGCCCCTCGTGCCGCCGCTGCCCGCCGGAAGCCTGCCCCCGTACCCTCCCTACT
TCGAAGGCGCCCCCTTCCCTCACCCGCTGTGGCTCCGGGACACGTACAAGCTGTGGGTGC
CCCAGCCGCCGCCCAGGACCATCAAGCGCACGCGGCGGCGTCTGTCCCGCAACCGCGACC

If you prefer to have the transcript name instead of the unique index, you can simply run the file through awk with the following script:

$ awk '{if ($1 ~ /^>/) print ">"$3; else print $0}' ensembl-hg19.fa > ensembl-hg19_fixed_names.fa
$ head ensembl-hg19_fixed_names.fa 
>ENST00000305233
ggcggcggcggcggcggcggcggcggcggggcgggggcggcCTGGGACGCGGCGGGAGCA
TGGAGCCGCGCGCCGGCTGCCGGCTGCCGGTGCGGGTGGAGCAGGTCGTCAACGGCGCGC
TGGTGGTCACGGTGAGCTGCGGCGAGCGGAGCTTCGCGGGGATCCTGCTGGACTGCACGA
AAAAGTCCGGCCTCTTTGGCCTACCCCCGTTGGCTCCGCTGCCCCAGGTCGATGAGTCCC
CTGTCAACGACAGCCATGGCCGGGCTCCCGAGGAGGGGGATGCAGAGGTGATGCAGCTGG
GGTCCAGCTCCCCCCCTCCTGCCCGCGGGGTTCAGCCCCCCGAGACCACCCGCCCCGAGC
CACCCCCGCCCCTCGTGCCGCCGCTGCCCGCCGGAAGCCTGCCCCCGTACCCTCCCTACT
TCGAAGGCGCCCCCTTCCCTCACCCGCTGTGGCTCCGGGACACGTACAAGCTGTGGGTGC
CCCAGCCGCCGCCCAGGACCATCAAGCGCACGCGGCGGCGTCTGTCCCGCAACCGCGACC

All better!



Author

Harold Pimentel wrote this tutorial. If you have any suggestions for the tutorial or any questions, his email address is on his website.

Adam Roberts is the author of eXpress.