Friday, September 03, 2010

Color Space to fastq

As next generation sequencing is still evolving, so also myriad of tools that are built in and around these sequences . Applied Biosystems Dibase sequencing that uses ligation based chemistry ensures system accuracy and high throughputness.
 What is SOLiD system?
SOLiD stands for "Sequencing by Oligonucleotide Ligation and Detection". Due to its 2 base encoding system, it ensures greater accuracy e.g; 99.94% .
The decoding process is kind of tricky, since each color represents a dinucleotide. Altogether there are 4 colors i.e; 0,1,2,3 representing 4 bases A,C,G,T
Color Decoding
So, if a csfasta file has the first base known, then the subsequent bases can be calculated using the decoding table as below:

Example:
[0 -> AA, GG, CC, TT ; 1 -> CA, AC, TG, GT; 2-> GA, TC, AG, CT; 3 -> TA, GC, CG, AT]
>44_35_267_F3
T20220213203000111000122223221121222

T2 -> TC (number 2 can be GA,TC,AG,CT: but only TC starts with T, so the first number is deciphered to 'TC')
0  ->  CC
2  ->  CT
2  ->  TC
0  ->  CC
2  ->  CT
1  ->  TG
3  ->  GC
2  ->  CT
0  ->  TT and so on...

So, the colorspace translates into CCTCCTGCTT......

Now how about an error? If the colorspace is represented by a number >=4 or a "." , how to decode the rest of the reads? I guess, in that case, we can designate the rest of the reads as 'N', this can be done especially because we generate abysmally large number of reads and ignoring some of them will not matter much.

Converting Quality Scores:

Now, the next step is to convert the quality scores into sanger fastq format. Sanger Fastq standard was defined by Jim Mullikin, gradually disseminated, but never formally documented. The biggest drawback with the phred quality scores is that the need to separate numbers with space which increases storage space and numbers are often 2 digits numbers, which again adds up to the space issue.

Phred value is calculated as:

Qphred = -10 X log10(Pe), where P stands for probability score.

From Phred to Sanger:

Converting phred quality scores to Sanger Quality score is quite straight forward. Phred values 0 - 93 are represented by ASCII 33 - 126. 33 was used as a offset because ASCII 32 represents a white space.
The paper describing the details of sanger fastq and colorspace can be found here :


So, in order to convert Phred to Sanger in perl language;

$q = chr(($Q<=93? $Q : 93) + 33);
The paper describing Fastq format can be found here
Now how to code the colorspace to nucleotide conversion:
# Generate hash [popular notation]
my @code = ([0,1,2,3],[1,0,3,2],[2,3,0,1],[3,2,1,0]);
my @bases = qw(A C G T);
my %decode = ();
foreach my $i(0..3) {
      foreach my $j(0..3) {
          $decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
     }
}

Here $decode hash has values like:

$decode{0}->A = A;
$decode{1}->A = C;
$decode{2}->;A =G ;
$decode{3}->A =T ;
$decode{1}->C =A ;
$decode{0}->C =C ;
$decode{3}->C =G ;
$decode{2}->C =T ;
$decode{2}->G = A;
$decode{3}->G =C ;
$decode{0}->G =G ;
$decode{1}->G =T ;
$decode{3}->T =A ;
$decode{2}->T =C ;
$decode{0}->T =G ;
$decode{1}->T =T ;
sub decode{
my $str=shift;
my @arr = split($str,'');
my $seq='';
my $base='';
my $anchor = shift(@arr); # The first anchor tag
    for(my $i=0;$i<@arr;$i++){
        $base=$decode{$arr[$i]}->{$anchor};
        $seq .= $base;
        $anchor = $base;
    }
return $seq;

} # End of subroutine
[Modified version of the script can be found here]

4 comments:

Anonymous said...

Ohhh this is a VERY dangerous approach! The data that you get after an error in the read will be complete junk and you will not be able to capture it. If the system makes a color call (0..3) you will not capture a read error, and all of the sequence after it will be junk junk junk. Such errors can be up to 10% in the SOLiD system, but an aligner that is color-space aware will know that a single color-space mismatch is junk and sort it out.

Please, for the love of everything holy in this world, DO NOT USE THIS APPROACH! You HAVE TO use an aligner or assembler that is color-space aware (like bwa/bowtie/mosaik).

For further reading, please see threads on seqanswers.com on this topic.

cheers

Sucheta Tripathy PI @ Computational Genomics Group at IICB, Kolkata said...

Thanks for pointing it out! There are several catches in the way colorspace data is analyzed. My favorite pipeline is bowtie -> cufflink route. However, I am tempted to try tophat -> cufflink approach too. At the moment, tophat does not support colorspace alignment so, I am now really willing to experiment both the routes and see the difference! I have already seen there is a significant drop in the number of reads aligned using tophat approach and we know why!

Siva said...

I have tried using bowtie for colorspace but the percentage of reads align to its own reference was not more that 0.1%. what would be the reason.? i want to do assembly of solid colorspace reads. Kidnly suggest some software which supports colorspace.(tried Rayassembler, Velvet, SOPRA but din get the meaning of the data)

Sucheta Tripathy PI @ Computational Genomics Group at IICB, Kolkata said...

What command are you trying to use? Looks very unlikely that your colorspace reads would have only 0.1% to the reference. Send me more details, I may be able to help.