Friday, September 03, 2010

Color Space to fastq

As next generation sequencing is still evolving, so also myriad of tools that are built in and around these sequences . Applied Biosystems Dibase sequencing that uses ligation based chemistry ensures system accuracy and high throughputness.
 What is SOLiD system?
SOLiD stands for "Sequencing by Oligonucleotide Ligation and Detection". Due to its 2 base encoding system, it ensures greater accuracy e.g; 99.94% .
The decoding process is kind of tricky, since each color represents a dinucleotide. Altogether there are 4 colors i.e; 0,1,2,3 representing 4 bases A,C,G,T
Color Decoding
So, if a csfasta file has the first base known, then the subsequent bases can be calculated using the decoding table as below:

Example:
[0 -> AA, GG, CC, TT ; 1 -> CA, AC, TG, GT; 2-> GA, TC, AG, CT; 3 -> TA, GC, CG, AT]
>44_35_267_F3
T20220213203000111000122223221121222

T2 -> TC (number 2 can be GA,TC,AG,CT: but only TC starts with T, so the first number is deciphered to 'TC')
0  ->  CC
2  ->  CT
2  ->  TC
0  ->  CC
2  ->  CT
1  ->  TG
3  ->  GC
2  ->  CT
0  ->  TT and so on...

So, the colorspace translates into CCTCCTGCTT......

Now how about an error? If the colorspace is represented by a number >=4 or a "." , how to decode the rest of the reads? I guess, in that case, we can designate the rest of the reads as 'N', this can be done especially because we generate abysmally large number of reads and ignoring some of them will not matter much.

Converting Quality Scores:

Now, the next step is to convert the quality scores into sanger fastq format. Sanger Fastq standard was defined by Jim Mullikin, gradually disseminated, but never formally documented. The biggest drawback with the phred quality scores is that the need to separate numbers with space which increases storage space and numbers are often 2 digits numbers, which again adds up to the space issue.

Phred value is calculated as:

Qphred = -10 X log10(Pe), where P stands for probability score.

From Phred to Sanger:

Converting phred quality scores to Sanger Quality score is quite straight forward. Phred values 0 - 93 are represented by ASCII 33 - 126. 33 was used as a offset because ASCII 32 represents a white space.
The paper describing the details of sanger fastq and colorspace can be found here :


So, in order to convert Phred to Sanger in perl language;

$q = chr(($Q<=93? $Q : 93) + 33);
The paper describing Fastq format can be found here
Now how to code the colorspace to nucleotide conversion:
# Generate hash [popular notation]
my @code = ([0,1,2,3],[1,0,3,2],[2,3,0,1],[3,2,1,0]);
my @bases = qw(A C G T);
my %decode = ();
foreach my $i(0..3) {
      foreach my $j(0..3) {
          $decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
     }
}

Here $decode hash has values like:

$decode{0}->A = A;
$decode{1}->A = C;
$decode{2}->;A =G ;
$decode{3}->A =T ;
$decode{1}->C =A ;
$decode{0}->C =C ;
$decode{3}->C =G ;
$decode{2}->C =T ;
$decode{2}->G = A;
$decode{3}->G =C ;
$decode{0}->G =G ;
$decode{1}->G =T ;
$decode{3}->T =A ;
$decode{2}->T =C ;
$decode{0}->T =G ;
$decode{1}->T =T ;
sub decode{
my $str=shift;
my @arr = split($str,'');
my $seq='';
my $base='';
my $anchor = shift(@arr); # The first anchor tag
    for(my $i=0;$i<@arr;$i++){
        $base=$decode{$arr[$i]}->{$anchor};
        $seq .= $base;
        $anchor = $base;
    }
return $seq;

} # End of subroutine
[Modified version of the script can be found here]