What is SOLiD system?
SOLiD stands for "Sequencing by Oligonucleotide Ligation and Detection". Due to its 2 base encoding system, it ensures greater accuracy e.g; 99.94% .
The decoding process is kind of tricky, since each color represents a dinucleotide. Altogether there are 4 colors i.e; 0,1,2,3 representing 4 bases A,C,G,T
Color Decoding |
Example:
[0 -> AA, GG, CC, TT ; 1 -> CA, AC, TG, GT; 2-> GA, TC, AG, CT; 3 -> TA, GC, CG, AT]
>44_35_267_F3
T20220213203000111000122223221121222
T2 -> TC (number 2 can be GA,TC,AG,CT: but only TC starts with T, so the first number is deciphered to 'TC')
0 -> CC
2 -> CT
2 -> TC
0 -> CC
2 -> CT
1 -> TG
3 -> GC
2 -> CT
0 -> TT and so on...
So, the colorspace translates into CCTCCTGCTT......
Now how about an error? If the colorspace is represented by a number >=4 or a "." , how to decode the rest of the reads? I guess, in that case, we can designate the rest of the reads as 'N', this can be done especially because we generate abysmally large number of reads and ignoring some of them will not matter much.
Converting Quality Scores:
Now, the next step is to convert the quality scores into sanger fastq format. Sanger Fastq standard was defined by Jim Mullikin, gradually disseminated, but never formally documented. The biggest drawback with the phred quality scores is that the need to separate numbers with space which increases storage space and numbers are often 2 digits numbers, which again adds up to the space issue.
Phred value is calculated as:
Qphred = -10 X log10(Pe), where P stands for probability score.
From Phred to Sanger:
Converting phred quality scores to Sanger Quality score is quite straight forward. Phred values 0 - 93 are represented by ASCII 33 - 126. 33 was used as a offset because ASCII 32 represents a white space.
The paper describing the details of sanger fastq and colorspace can be found here :
So, in order to convert Phred to Sanger in perl language;
$q = chr(($Q<=93? $Q : 93) + 33);
The paper describing Fastq format can be found here
Now how to code the colorspace to nucleotide conversion:
# Generate hash [popular notation]
my @code = ([0,1,2,3],[1,0,3,2],[2,3,0,1],[3,2,1,0]);
my @bases = qw(A C G T);
my %decode = ();
foreach my $i(0..3) {
foreach my $j(0..3) {
$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
}
}
Here $decode hash has values like:
$decode{0}->A = A;
$decode{1}->A = C;
$decode{2}->;A =G ;
$decode{3}->A =T ;
$decode{1}->C =A ;
$decode{0}->C =C ;
$decode{3}->C =G ;
$decode{2}->C =T ;
$decode{2}->G = A;
$decode{3}->G =C ;
$decode{0}->G =G ;
$decode{1}->G =T ;
$decode{3}->T =A ;
$decode{2}->T =C ;
$decode{0}->T =G ;
$decode{1}->T =T ;
sub decode{
my $str=shift;
my @arr = split($str,'');
my $seq='';
my $base='';
my $anchor = shift(@arr); # The first anchor tag
for(my $i=0;$i<@arr;$i++){
$base=$decode{$arr[$i]}->{$anchor};
$seq .= $base;
$anchor = $base;
}
return $seq;
} # End of subroutine
[Modified version of the script can be found here]