Monday, March 21, 2011

Analyzing oomycetes RNAseq data - Part 1

This is a small tutorial on how to work with large scale RNAseq data with open source softwares. I will be discussing various options for alignment of reads to the reference sequence and little bit on assembly.

[When I started writing this post, tophat did not have colorspace support, but now it does]
A quicklist of software programs available for nextgen data analysis can be found here:

To begin with bowtie, get the pre compiled binaries from here .

In order to start with bowtie, you need to build index of your reference sequence. This could be typically a fasta file. For me running command . "./bowtie-build /data/vbi/references/p_sojae.fasta sojaeV4"
to generate index for a 70 MB genome in a 64 bit machine  with dual core and 8 GB chip - took exactly 2 minutes. This command will generate 6 .ebwt files in the working directory.
ls -l
26385951 2010-06-11 15:18 sojaeV4.1.ebwt
9702096  2010-06-11 15:18 sojaeV4.2.ebwt
8711     2010-06-11 15:17 sojaeV4.3.ebwt
19404183 2010-06-11 15:17 sojaeV4.4.ebwt
26385951 2010-06-11 15:19 sojaeV4.rev.1.ebwt
9702096  2010-06-11 15:19 sojaeV4.rev.2.ebwt 

Once created make a separate directory called as index/ in the same working area and place all these files in there.

Test installation of index using the following command:
and withing fraction of a second you will see following printed to the stdout

0       -       super_16        1314431 CTTGTTGCGTCTCAGAAACCTTTTCGCGGCCATIIIIIII
# reads processed: 1
# reads with at least one reported alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments to 1 output stream(s)

Good thing about bowtie(version 0.12 onwards, it supports colorspace alignment)

As of version 0.12.0, bowtie can align colorspace reads against a colorspace index when -C is specified. Colorspace is the characteristic output format of Applied Biosystems' SOLiD system.Look at color coding here for more details. See ABI's Principles of Di-Base Sequencing document for details.

All input formats (FASTA -f, FASTQ -q, raw -r, tab-delimited --12, command-line -c) are compatible with colorspace (-C). When -C is specified, read sequences are treated as colors. Colors may be encoded either as numbers (0=blue, 1=green, 2=orange, 3=red) or as characters A/C/G/T (A=blue, C=green, G=orange, T=red).
Some reads include a primer base as the first character; e.g.:
What about BFAST and SHRIMP and VMATCH? 
These 3 softwares are also good for fast alignment of short reads into the genome. 
BFAST stands for "BLAT-like Fast Accurate Search Tool", and it also supports colorspace 
data. SHRIMP also supports colorspace reads and recently the major improvement is that, 
it can work on small memory computers. VMATCH on the other hand is a software for
local alignments and needs licensing. I have not worked extensively on these 3 softwares,
so, can't say much about them
Is designed to predict the splice junctions after aligning short reads to the reference. 
It usually works without a reference junction position(GFF) file, but can also work
if there is already a known gff file. This program chops the reads into smaller fragments 
and stitches the alignments later in separating introns from exons. Now tophat has a 
colorspace support.


This result represents alignment percentage of 3 different assembly version of the
same organism. 

Tophat Output: [For larger datasets try splitting them into smaller files since tophat exits with some error hard to tell why it exited]

Caveats: I work with oomycetes pathogens and the introns and exons in this pathogen could be really much smaller than what tophat defines as default.

Running tophat may not work for everyone. For example with P.sojae, tophat produced awful number of junctions(just 700) with default parameters. Tweaking the parameters only produced worse results. As a work around, we tried aligning these transcripts to the predicted gene models as well. Comparing genome vs predicted gene model alignment results, we found around 10% of the alignments were not translated into predicted models(55% matched with genomes where as 45% match with predicted transcripts)

There is an option to try and get the sequences aligned with the genome to assemble into contigs. One easier method for this is to try ABySS 

Running ABySS is quite straight forward. Unpack the distribution and follow installation instructions. Once installed you could try running ABySS with different k mers. One small shell script for running different k mers
is as belows:

export PATH=$PATH:/home/sutripa/samtools-0.1.11/

for i in {20..40};
./ABYSS -k $i /data/bowtieOutput/PS-1_F3V1.bam.sorted.bam -o tmp-k$i.fa
[NOTE: Here the input file for assembly is sorted bam files. The bam files are generated by running bowtie]

Then check the files for their N50 values [See N50 section of the blog on how to calculate N50 values]
In case of P.sojae here is the N50 results:

 N50 Vs Number of scaffolds having values larger than N50 for P.sojae.
Here, the results are not very encouraging. So, you may try running with a bunch of different k values. I will discuss about running trans-abyss and cufflink in part - 2 of this series.