Monday, November 29, 2010

Bayesian Probability theory and its application

The two probability theories, that are considered as pillars of all other complicated probabilistic models are; Addition theory and Multiplication theory:

Addition Theory(OR Theory):
If two events are mutually exclusive, then the possibility of occurrence of event 1 or event 2 is the addition of two probabilities.
Example: Possibility of having an ace or a joker from a pack of 54 cards. There are only 2 jokers and 4 aces. So, the probability is: 2/54 + 4/54 = 6/54 = 1/9.

Addition theory is slightly modified where the events are not mutually exclusive. For example, having a diamond and a queen from a pack of 52 cards .Symbolically, it is represented as: P(A or B) = P(A) + P(B) - P(A and B). So, the result will be 13/52 + 4/52 - 1/52.(Reason: there are 13 diamonds, 4 queens and only one queen in 52 cards). The result therefore is: 16/52= 4/13.

Multiplication Theory: (AND Theory)
When two events A and B are mutually exclusive, then the possibility of occurrence of A and B is P(A) X P(B).
Example: Tossing two coins simultaneously to obtain two tails or two heads is 1/4 X 1/4 = 1/16.

Bayesian Probability:
Bayesian statistics or Bayesian probability is also called as conditional probability where two different conditions are evaluated jointly. Common example is a loaded die in a casino where there are 99% of the dies are honest but 1% of the dies are loaded. Where with the loaded die, there is a possibility of getting a 6 is 50% as against a honest die, where getting a 6 occurs 1/6th of the time. So, what is the probability that one gets 3 consecutive sixes?
In theory Bayesian statistics can be applied where:
  • The sample space is partitioned into a set of mutually exclusive events { A1, A2, . . . , An }.
  • Within the sample space, there exists an event B, for which P(B) > 0.
  • The analytical goal is to compute a conditional probability of the form: P( Ak | B ).
  • You know at least one of the two sets of probabilities described below.
  • P( Ak ∩ B ) for each Ak
  • P( Ak ) and P( B | Ak ) for each Ak. 
So, in the above case, we would like to know what is the possibility(likelihood) that there are 3 consecutive sixes, if it were a loaded die.
P(D Loaded|3 sixes) =                                  P(3 sixes in Loaded die) * P(loaded die)                      
                                   P(3 sixes in Loaded die) * P(loaded die) + P(3 sixes in fair die) * P(Fair die)
                               =          (0.5)3 * (0.01)                        
                                           (0.5)3 * (0.01) + (1/6)3 * 0.99
                              =          0.21

Same can be applied to test, what is the possibility that there are 3 consecutive sixes in a fair die:

(1/6)3 * 0.99/ [ (0.5)3 * (0.01) + (1/6)3 * 0.99 ] = 0.78
Meaning that there is a fair chance that you will get 3 sixes consecutively in a unloaded die than a loaded die.

Same probability theory can be applied for testing a case where the weather prediction channels have predicted about a sunny or a rainy day in a given day. If it rains 5 times a whole year and the weather prediction channel predicted about a rainy day for a particular day and the accuracy of the weather channel is about 90%, then we can calculate the possibility of it is raining the given day is:
(5/365) * 0.9 / [ (5/365) * 0.9 + (360/365) * 0.1] = 0.111

Bayesian statistics is often used with sequence analysis where we have a situation to judge whether a protein is extracellular or intracellular. These 2 situations are mutually exclusive i.e; A protein can either be extracellular or intracellular and having more number of cystein residues at certain location decide at a certain confidence level whether that is the case.
Another test case could be:

A rare genetic disease is discovered. Although only one in a million
people carry it, you consider getting screened. You are told that the genetic
test is extremely good; it is 100% sensitive (it is always correct if
you have the disease) and 99.99% specific (it gives a false positive result
only 0.01 % of the time). Using Bayes' theorem, explain why you might
decide not to take the test.

Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo [Jacob 19771].

Tuesday, November 23, 2010

Dynamic programming for sequence Alignment

According to wikipedia:
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler steps. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller[1] and optimal substructure (described below). When applicable, the method takes far less time than naïve methods.
Top-down dynamic programming simply means storing the results of certain calculations, which are later used again since the completed calculation is a sub-problem of a larger calculation. Bottom-up dynamic programming involves formulating a complex calculation as a recursive series of simpler calculations.

In biological sequence analysis, dynamic programming is often used for sequence alignment using global sequence alignment(Needleman and Wunsch method).
Here I present a simplest interpretation of dynamic programming for aligning two sequences using global alignment(N&W method).

We have sequence 1: G A A T T C A G T T A
               sequence 2: G G A T C G A
Length(seq1) = 11 and Length(seq2) = 7; lets call them k and l
In order to align them globally using dynamic programming method, we have to do the following;
1. Build a matrix M of size (k+1) X (l+1), putting seq1 in columns and seq2 in columns
2. Initialize the matrix where M[0,0..k] and M[0..l,0] is initialized into zero[Fig -1].

Fig - 1
The rest of the rows and columns can be filled using the following mathematical notation:

     Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
     Mi,j-1 + w (gap in sequence #1),
     Mi-1,j + w (gap in sequence #2)]
where Si,j is 1 for a match and 0 for a mismatch
w = 0
Now, lets talk about filling M(1,1) = MAX[ {M(0,0) + S(1,1)},
                                           {M(1,0) + W},
                                           {M(0,1) + W} ] 
= MAX[ {0 + 1}, {0 + 0}, {0+0} ]
= 1

Going by this trend, M[1,2] is going to be: Max[{1+0},{0+0},{0+0} ] = 1
So, we can start filling in the matrix this way:

From this, a common observation can be drawn:
1. Increment by one the value of a cell only when there is a perfect match to 
that of the previous diagonal.
2. If there is a mismatch always continue with the value from the previous column/row.
Going by this, we can end up filling the matrix with the following values:

Now the last phase in dynamic programming:
3) Tracing back

Tracing back almost always begins with the highest score and looks to the diagonal up or to the left(gap in sequence-1) or up (gap in seq -2)

 This tracing gives an alignment:

|       |   |      |       |         |
G G A T -  C -  G  - -  A

G  - A A T T C A G T T A
|       |          |   |       |         |     
G G A  -  - T C -  G  - -  A

For longer sequences large number of other combination is possible, but the dynamic programming outputs only one result.

[Ref: Images adapted from Eric C. Rouchka's web page]

In real life examples, different penalty scores are given for a mismatch and for a gap[Every time, you go down the columns or on a row without a match should be given a penalty as a gap]

Friday, November 05, 2010

Generating random DNA/protein sequence permutation

In general terminology, sequence permutation is defined as shuffling of the bases without disturbing the base composition. In other words, all the permuted sequence should have the same base composition.It is fairly easy to implement such permutations when we talk about single nucleotide composition; and Fisher and Yates 1938, algorithm is a good one for implementing it without introducing much sequence bias. However, in real life situation, it may be necessary to generate a random DNA sequence where there is a need to preserve the di-tri nucleotide frequencies in overlapping windows.Altschul SF, Erickson BW., have explained how to generate such a sequence while conserving the di-tri nucleotide composition using a  graph theory of generating random Eulerian walks on a directed multigraph. [Paper Here]

R has a bioString object with DNAString function, that although does not implement Fisher Yates algorithm, but generates non-biased permuted DNA string. For dinucleotide/trinucleotide composition preservation also, this method can be used effectively as long as they don't overlap:

Example Code:
>  x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.15)), collapse="")
> x <- DNAString(x)
>x # will print x
> alphabetFrequency(x,baseOnly=TRUE)   # Will print base composition in numbers

In order to run it for di/tri nucleotide, just change the code slightly into:
>  x <- paste(sample(c("AT", "CC", "GT", "TA", "AA"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.10,0.05)), collapse="") 
> x<-DNAString(x)
> x
  2000-letter "DNAString" instance
Please NOTE: Here you will not get any other dinucleotide than specified IN FRAME

Same can be done for tri-tetra nucleotides as well.

A very simple unix implementation of random sequence generator could be:

$echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c}
This generates a cross product of a,t,g,c to a,t,g,c, and thus has all tetra nucleotides represented. Similar order can be generated for dinucleotides such as:
$echo {aa,tt,gg,cc}{at,tg,ga,ca}{aa,tc,ag,tc}{aa,at,cg,cc}
Enjoy generating permuted sequences.

Wednesday, November 03, 2010

Some unix/perl oneliners for Bioinformatics

   File format conversion/line counting/counting number of files etc.

1.    $ wc –l   : count number of lines in a file.
2.    $ ls | wc –l        : count number of files in a directory.
3.    $ tac     : print the file in reverse order e.g; last line first, first line last.
4.    $ rev     : reverse the file in lines.
5.    $ sed 's/.$//' or sed 's/^M$//' or sed 's/\x0D$//' : converts a dos file into unix mode.
6.    $sed "s/$/`echo -e \\\r`/" or sed 's/$/\r/' or sed "s/$//": converts a unix newline into a DOS newline.
7.    $ awk '1; { print "" }' : Double space a file.
8.    $ awk '{ total = total + NF }; END { print total+0 }' : prints the number of words in a file.
9.    $sed '/^$/d' or [grep ‘.’] : Delete all blank lines in a file.
10.    $sed '/./,$!d' : Delete all blank lines in the beginning of the file.
11.    $sed -e :a -e '/^\n*$/{$d;N;ba' -e '}': Delete all blank lines at the end of the file.
12.    $sed -e :a -e 's/<[^>]*>//g;/
13.    $sed 's/^[ \t]*//' : deleting all leading white space tabs in a file.
14.    $ sed 's/[ \t]*$//' : Delete all trailing white space and tab in a file.
15.    $ sed 's/^[ \t]*//;s/[ \t]*$//' : Delete both leading and trailing white space and tab in a file.

2.2    Working with Patterns/numbers in a sequence file
16.    $awk '/Pattern/ { n++ }; END { print n+0 }' : print the total number of lines containing the word pattern.
17.    $sed 10q : print first 10 lines.
18.    $sed -n '/regexp/p' : Print the line that matches the pattern.
19.    $sed '/regexp/d' : Deletes the lines that matches the regexp.
20.    $sed -n '/regexp/!p' : Print the lines that does not match the pattern.
21.    $sed '/regexp/!d' : Deletes the lines that does NOT match the regular expression.
22.    $sed -n '/^.\{65\}/p' : print lines that are longer than 65 characters.
23.    $sed -n '/^.\{65\}/!p' : print lines that are lesser than 65 characters.
24.    $sed -n '/regexp/{g;1!p;};h' : print one line before the pattern match.
25.    $sed -n '/regexp/{n;p;}' : print one line after the pattern match.
26.    $sed -n '/^.\{65\}/ {g;1!p;};h' < sojae_seq > tmp : print the names of the sequences that are larger than 65 nucleotide long.
27.    $sed -n '/regexp/,$p' : Print regular expression to the end of file.
28.    $sed -n '8,12p' : print line 8 to 12(inclusive)
29.    $sed -n '52p' : print only line number 52.
30.    $seq ‘/pattern1/,/pattern2/d’ < inputfile > outfile : will delete all the lines between pattern1 and pattern2.
31.    $sed ‘/20,30/d’ < inputfile > outfile : will delete all lines between 20 and 30.   OR sed ‘/20,30/d’ < input > output will delete lines between 20 and 30.
32.    awk '/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that contains ‘baz’.
33.    awk '!/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that does not contain ‘baz’.
34.    grep –i –B 1 ‘pattern’ filename > out : Will print the name of the sequence and the sequence having the pattern in a case insensitive way(make sure the sequence name and the sequence each occupy a single line).
35.    grep –i –A 1 ‘seqname’ filename > out : will print the sequence name as well as the sequence into file ‘out’. 

2.3    Inserting Data into a file:

36.    gawk --re-interval 'BEGIN{ while(a++<49) s=s "x" }; { sub(/^.{6}/,"&" s) }; 1' > fileout : will insert 49 ‘X’ in the sixth position of every line.

37.    gawk --re-interval 'BEGIN{ s="YourName" }; { sub(/^.{6}/,"&" s) }; 1' : Insert your name at the 6 th position in every line.

3.    Working with Data Files[Tab delimited files]:

3.1    Error Checking and data handling:
38.    awk '{ print NF ":" $0 } ' : print the number of fields of each line followed by the line.
39.    awk '{ print $NF }' : print the last field of each line.
40.    awk 'NF > n' : print every line with more than ‘n’ fields.
41.    awk '$NF > n' : print every line where the last field is greater than n.
42.    awk '{ print $2, $1 }' : prints just first 2 fields of a data file in reverse order.
43.    awk '{ temp = $1; $1 = $2; $2 = temp; print }' : prints all the fields in the correct order except the first 2 fields.
44.    awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }' : prints all the fields in reverse order.
45.    awk '{ $2 = ""; print }' : deletes the 2nd field in each line.
46.    awk '$5 == "abc123"' : print each line where the 5th field is equal to ‘abc123’.
47.    awk '$5 != "abc123"' : print each line where 5th field is NOT equal to abc123.
48.    awk '$7  ~ /^[a-f]/' : Print each line whose 7th field matches the regular expression.
49.    awk '$7 !~ /^[a-f]/' : print each line whose 7th field does NOT match the regular expression.
50.    cut –f n1,n2,n3.. > output file : will cut n1,n2,n3 columns(fields) from input file and print the output in output file. If delimiter is other than TAB then give additional argument such as cut –d ‘,’ –f n1,n2.. inputfile > out
51.    sort –n –k 2,2 –k 4,4 file > fileout : Will conduct a numerical sort of column 2, and then column 4. If –n is not specified, then, sort will do a lexicographical sort(of the ascii value).

4.    Miscellaneous:
52.    uniq –u inputfile > out : will print only the uniq lines present in the sorted input file.
53.    uniq –d inputfile > out : will print only the lines that are in doubles from the sorted input file.
54.    cat file1 file2 file3 … fileN > outfile : Will concatenate files back to back in outfile.
55.    paste file1 file2 > outfile : will merge two files horizontally. This function is good for merging with same number of rows but different column width.
56.    !:p : will print the previous command run with the ‘pattern’ in it.
57.    !! : repeat the last command entered at the shell.
58.    ~ : Go back to home directory
59.     echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c} : will generate all tetramers using ‘atgc’. If you want pentamers/hexamers etc. then just increase the number of bracketed entities.NOTE: This is not a efficient sequence shuffler. If you wish to generate longer sequences then use other means.
60.    kill -HUP ` ps -aef | grep -i firefox | sort -k 2 -r | sed 1d | awk ' { print $2 } ' ` : Kills a hanging firefox process.
61.    csplit -n 7 input.fasta '/>/' '{*}' : will split the file ‘input.fasta’ wherever it encounters delimiter ‘>’. The file names will appear as 7 digit long strings.
62.    find . -name data.txt –print: finds and prints the path for file data.txt.
Sample Script to make set operations on sequence files:
63.    grep ‘>’ filenameA > list1  # Will list just the sequence names in a file names.
grep ‘>’ filenameB > list2 # Will list names for file 2
cat list1 list2 > tmp # concatenates list1 and list2 into tmp
sort tmp > tmp1 # File sorted
uniq –u tmp1 > uniq    # AUB – A ∩ B (OR (A-B) U (B-A))
uniq –d tmp1 > double  # Is the intersection (A ∩ B)
cat uniq double > Union # AUB
cat list1 double > tmp
sort tmp | uniq –u > list1uniq # A - B
cat list2 double > tmp
sort tmp | uniq –u > list2uniq # B - A    


1.    perl -pe '$\="\n"'   : double space a file
2.    perl -pe '$_ .= "\n" unless /^$/' : double space a file except blank lines
3.    perl -pe '$_.="\n"x7' : 7 space in a line.
4.    perl -ne 'print unless /^$/' : remove all blank lines
5.    perl -lne 'print if length($_) < 20' : print all lines with length less than 20.
6.    perl -00 -pe '' : If there are multiple spaces, delete all leaving one(make the file a single spaced file).
7.    perl -00 -pe '$_.="\n"x4' : Expand single blank lines into 4 consecutive blank lines
8.    perl -pe '$_ = "$. $_"': Number all lines in a file
9.    perl -pe '$_ = ++$a." $_" if /./' : Number only non-empty lines in a file
10.    perl -ne 'print ++$a." $_" if /./' : Number and print only non-empty lines in a file
11.    perl -pe '$_ = ++$a." $_" if /regex/' ; Number only lines that match a pattern
12.    perl -ne 'print ++$a." $_" if /regex/' : Number and print only lines that match a pattern
13.    perl -ne 'printf "%-5d %s", $., $_ if /regex/' : Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)
14.    perl -le 'print scalar(grep{/./}<>)' : prints the total number of non-empty lines in a file
15.    perl -lne '$a++ if /regex/; END {print $a+0}' : print the total number of lines that matches the pattern
16.    perl -alne 'print scalar @F' : print the total number fields(words) in each line.
17.    perl -alne '$t += @F; END { print $t}' : Find total number of words in the file
18.    perl -alne 'map { /regex/ && $t++ } @F; END { print $t }' : find total number of fields that match the pattern
19.    perl -lne '/regex/ && $t++; END { print $t }' : Find total number of lines that match a pattern
20.    perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m' : will calculate the GCD of two numbers.
21.    perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m' : will calculate lcd of 20 and 35.
22.    perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n' : Generates 10 random numbers between 5 and 15.
23.    perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8': Generates a 8 character password from a to z and number 0 – 9.
24.    perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20': Generates a 20 nucleotide long random residue.
25.    perl -le 'print "a"x50': generate a string of ‘x’ 50 character long
26.    perl -le 'print join ", ", map { ord } split //, "hello world"': Will print the ascii value of the string hello world.
27.    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)': converts ascii values into character strings.
28.    perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"': Generates an array of odd numbers.
29.    perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"': Generate an array of even numbers
30.    perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file: Convert the entire file into 13 characters offset(ROT13)
31.    perl -nle 'print uc' : Convert all text to uppercase:
32.    perl -nle 'print lc' : Convert text to lowercase:
33.    perl -nle 'print ucfirst lc' : Convert only first letter of first word to uppercas
34.    perl -ple 'y/A-Za-z/a-zA-Z/' : Convert upper case to lower case and vice versa
35.    perl -ple 's/(\w+)/\u$1/g' : Camel Casing
36.    perl -pe 's|\n|\r\n|' : Convert unix new lines into DOS new lines:
37.    perl -pe 's|\r\n|\n|' : Convert DOS newlines into unix new line
38.    perl -pe 's|\n|\r|' : Convert unix newlines into MAC newlines:
39.    perl -pe '/regexp/ && s/foo/bar/' : Substitute a foo with a bar in a line with a regexp.

Some other Perl Tricks

Want to display some progress bars while perl does your job:

For this perl provides a nice utility called "pipe opens" ('perldoc -f open' will provide more info)
open(my $file, '-|', 'command','option', 'option', ...) or die "Could not run tar ... - $!";
  while (<$file>) {
       print "-";
  print "\n";
Will print - on the screen till the process is completed 

Wednesday, October 27, 2010

Protein Splicing : Inteins and Exteins

Protein splicing with inteins(protein introns), exteins(protein exons) was discovered some 20 years ago. Not to mention, this process is efficient and autocatalytic where the intein excises itself from the primary protein product (precursor protein) and then catalyzes the joining of the broken ends forming 2 protein products: 1) The mature protein 2) Intein itself.  So, the protein fragments that are joined together to form the mature protein is called as exteins(Same as RNA exons), and inteins are the protein introns.

                   Fig reference: Elleuche S, Pöggeler S. Applied Microbial Biotechnology,2010

All these inteins have sequence similarity with homing nucleases. Homing nucleases are very interesting class of proteins that cleaves the DNA at a particular recognition site. The recognition site is long enough(12 - 40 bp, as against restriction enzymes that recognize 8 bp or less) to occur in a genome by chance. Usually the proteins encoding homing nucleases occur inside the DNA element that is the recognition site for cleavage by themselves. Thus preventing the cleavage of the DNA sequence that carries them, so they are a class of selfish genes. It is very interesting how these homing nucleases propagate themselves to their non-allelic forms. The allele that has homing nuclease is called as HEG+ and the one not having it is called as HEG-. Usually the HEG+ alleles cleave the HEG- gene thus initiating homologous recombination DNA repair. Once repair is initiated, HEG+ is copied at the HEG- locus thus propagating it.
There are currently 4 structural domains of homing endonucleases:
1)LAGLIDADG: If present alone, needs a homodimer to act against a DNA sequence.
2)GIY-YIG: Acts as a monomer, occurs in the N terminus.
3)His-Cys box: 2 histidines and 3 cysteins, acts as a monomer.
4)H-N-H: 2 pairs of histidines flanking one asparagine.

A list of intein integrated sites are listed here [Link
Intein database is [Here]


Friday, September 03, 2010

Color Space to fastq

As next generation sequencing is still evolving, so also myriad of tools that are built in and around these sequences . Applied Biosystems Dibase sequencing that uses ligation based chemistry ensures system accuracy and high throughputness.
 What is SOLiD system?
SOLiD stands for "Sequencing by Oligonucleotide Ligation and Detection". Due to its 2 base encoding system, it ensures greater accuracy e.g; 99.94% .
The decoding process is kind of tricky, since each color represents a dinucleotide. Altogether there are 4 colors i.e; 0,1,2,3 representing 4 bases A,C,G,T
Color Decoding
So, if a csfasta file has the first base known, then the subsequent bases can be calculated using the decoding table as below:

[0 -> AA, GG, CC, TT ; 1 -> CA, AC, TG, GT; 2-> GA, TC, AG, CT; 3 -> TA, GC, CG, AT]

T2 -> TC (number 2 can be GA,TC,AG,CT: but only TC starts with T, so the first number is deciphered to 'TC')
0  ->  CC
2  ->  CT
2  ->  TC
0  ->  CC
2  ->  CT
1  ->  TG
3  ->  GC
2  ->  CT
0  ->  TT and so on...

So, the colorspace translates into CCTCCTGCTT......

Now how about an error? If the colorspace is represented by a number >=4 or a "." , how to decode the rest of the reads? I guess, in that case, we can designate the rest of the reads as 'N', this can be done especially because we generate abysmally large number of reads and ignoring some of them will not matter much.

Converting Quality Scores:

Now, the next step is to convert the quality scores into sanger fastq format. Sanger Fastq standard was defined by Jim Mullikin, gradually disseminated, but never formally documented. The biggest drawback with the phred quality scores is that the need to separate numbers with space which increases storage space and numbers are often 2 digits numbers, which again adds up to the space issue.

Phred value is calculated as:

Qphred = -10 X log10(Pe), where P stands for probability score.

From Phred to Sanger:

Converting phred quality scores to Sanger Quality score is quite straight forward. Phred values 0 - 93 are represented by ASCII 33 - 126. 33 was used as a offset because ASCII 32 represents a white space.
The paper describing the details of sanger fastq and colorspace can be found here :

So, in order to convert Phred to Sanger in perl language;

$q = chr(($Q<=93? $Q : 93) + 33);
The paper describing Fastq format can be found here
Now how to code the colorspace to nucleotide conversion:
# Generate hash [popular notation]
my @code = ([0,1,2,3],[1,0,3,2],[2,3,0,1],[3,2,1,0]);
my @bases = qw(A C G T);
my %decode = ();
foreach my $i(0..3) {
      foreach my $j(0..3) {
          $decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];

Here $decode hash has values like:

$decode{0}->A = A;
$decode{1}->A = C;
$decode{2}->;A =G ;
$decode{3}->A =T ;
$decode{1}->C =A ;
$decode{0}->C =C ;
$decode{3}->C =G ;
$decode{2}->C =T ;
$decode{2}->G = A;
$decode{3}->G =C ;
$decode{0}->G =G ;
$decode{1}->G =T ;
$decode{3}->T =A ;
$decode{2}->T =C ;
$decode{0}->T =G ;
$decode{1}->T =T ;
sub decode{
my $str=shift;
my @arr = split($str,'');
my $seq='';
my $base='';
my $anchor = shift(@arr); # The first anchor tag
    for(my $i=0;$i<@arr;$i++){
        $seq .= $base;
        $anchor = $base;
return $seq;

} # End of subroutine
[Modified version of the script can be found here]

Friday, August 27, 2010

Plotting SAM output

Output from Nextgeneration sequence alignment to genome assembly comes in SAM(Sequence Alignment Map) format. SAM files can be large, so a binary format called BAM is used most often for ease of handling. While full documentation on samtools can be found here , documentation on SAM format can be found here. For quick reference, let me put the SAM alignment format here:

[qname]: Query Name
[flag]:      Is a bitwise operator represented in  decimal format, where the value when converted into binary should have the following meaning:
0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair
0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1
0x0004 the query sequence itself is unmapped
0x0008 the mate is unmapped 1
0x0010 strand of the query (0 for forward; 1 for reverse strand)
0x0020 strand of the mate 1
0x0040 the read is the first read in a pair 1,2
0x0080 the read is the second read in a pair 1,2
0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)
0x0200 the read fails platform/vendor quality checks
0x0400 the read is either a PCR duplicate or an optical duplicate
1. Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.
2. If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should be present and 0x40 and 0x80 are both zero.
Example: In our case, we mostly get 0 or 16 as the value, where 0 means(00000000000) forward strand and 16 means(00000010000) reverse strand. We are NOT concerned about rest of the bits because ours is not a paired end alignment.
M Alignment match (can be a sequence match or mismatch)
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read (clipped sequence present in )
H Hard clip on the read (clipped sequence NOT present in )
P Padding (silent deletion from the padded reference sequence)
Lets not discuss about the other fields
Samtools view command:
samtools view  $DATAFILE/sorted.bam super_0:1000-30000 | cut -f 2,3,4,10 > tmp2
[One thing to remember here is the sorted bam files need to be indexed before using this command. So, in other words keep the index files(sorted.bam.bai ) in the same directory.

If you want to see it in human readable format use '-X' after 'samtools view'
The output could be:

Where the first line means the forward strand and the second line means the reverse strand

Sunday, June 20, 2010

Pneumococcal Fratricides

 Excerpts from César Sánchez's blog
Some bacteria produce substances that kill surrounding microbes, and use the resulting dead bodies as a source of nutrients. Sometimes, killer and victim belong to the same species, or even they are siblings. In these cases, researchers speak of cannibalism or fratricide; although if you view microbial populations as coordinated, multicellular entities, then you may prefer to use the term programmed cell death.
Among pneumococci, some cells in a population become competent in response to certain signals; which means that they are able to take up DNA from their surroundings, and incorporate this genetic information into their own chromosome. This way, competent cells can acquire new inheritable abilities—such as production of a new capsule type, or resistance to an antibiotic—that can be very important for their survival. (This was the underlying mechanism in the famous Avery-MacLeod-McCarty experiment that helped identify DNA as the hereditary material in cells.)
But competent pneumococci do something else: they encourage non-competent siblings and other closely-related bacteria to commit suicide. They do this by releasing a particular lytic enzyme, called CbpD, that diffuses through the milieu and—somehow—activates LytC and other lytic enzymes that are already present in the non-competent siblings. Cell wall weakening finally results in a big bang: that is, the explosion of the non-competent pneumococci. The materials released serve not only as nutrients and sources of genetic information (DNA), but also as virulence factors that help competent cells to survive in their human host.

The structure of the pneumococcal
autolysin, LytC. Source.
The 3D structure of LytC now provides the clues to explain the enzyme's peculiar behaviour during pneumococcal fratricide. Have a look at the model of LytC on the right: ain't it a beauty? A substrate-binding module (in blue and green in the image) recognizes and binds the cell wall peptidoglycan, whereas a catalytic module (in red) is responsible for breaking a specific linkage in the substrate. Because of the unusual hook shape of the protein, the substrate-binding module and the catalytic module partially block each other. As a result, LytC cannot bind the highly cross-linked peptidoglycan that is predominant under normal circumstances. Only when CbpD or other lytic enzymes cut specific linkages in the cell wall, LytC is able to bind the 'loosened' peptidoglycan and comes into action—with deleterious consequences for the non-competent pneumococci.

Monday, May 31, 2010

How To Embed Slideshare Into regular web pages

In the following presentation, it is described how to embed slide share on your wiki space, but if you want to  embed this on any regular web page, just use the same "embed html" code and copy paste in a html page. One thing to remember is,  first make your slide share upload public. If you make it private, then you will get "embed slide share" functionality disabled.

Friday, May 21, 2010

JCVI Team Creates Functional Microbe Controlled By Synthetic Genome

May 20, 2010

By a GenomeWeb staff reporter

NEW YORK (GenomeWeb News) – A team of J. Craig Venter Institute researchers reported online today in Science that they have successfully created the first functional bacterial cells controlled by a synthetic genome.
The researchers amalgamated several of their previously reported approaches for the study, which involved creating a synthetic Mycoplasma mycoides genome called JCVI-syn1.0 and transplanting it into a M. capricolum strain. In so doing, the team was able to produce functional, self-replicating cells that closely resemble natural M. mycoides cells.
"This work provides a proof of principle for producing cells based on genome sequences designed in the computer," senior author Craig Venter and his colleagues wrote. "[T]he approach we have developed should be applicable to the synthesis and transplantation of more novel genomes as genome design progresses."
"This is the first self-replicating species we've had on the planet whose parent is a computer," Venter said today during a telephone briefing with reporters.
In early 2008, researchers from the Venter Institute reported on the first synthetic genome, creating four M. genitalium quarter-genomes that were assembled in yeast. The team subsequently streamlined this process so that dozens of pieces of the M. genitalium genome could be assembled in yeast in a single step.
Because M. genitalium grows very slowly, the researchers explained, they decided to design a new synthetic genome based on the sequence of another species — the M. mycoides subspecies capri — for their current synthetic transplant work.
JCVI researchers previously showed that they could transfer a natural M. mycoides genome into M. capricolum — most recently using yeast as a stop en route in this transplant process.
For the current study, funded by Synthetic Genomics, the team designed cassettes for building a synthetic M. mycoides subspecies capri GM12 genome based on finished genome sequences for two M. mycoides strains — one used as a genome donor in a previous genome transfer study and another containing a transplanted genome that was cloned in yeast.
The latter strain was primarily used as the design reference, the researchers noted, with the synthetic genome matching that genome at all but 19 harmless polymorphisms.
Similar to synthetic genomes designed at JCVI in the past, the team tossed watermark sequences into the synthetic genome, placing them at sites that weren't expected to affect cell growth or viability. Such watermarks are intended "to absolutely make clear that the DNA was synthetic," Venter said.
The watermark sequences used in the new M. mycoides synthetic genome were designed using a code containing frequent stop codons that represents all of the letters in the English alphabet as well as punctuation, Venter explained. These watermarks not only contain the names of nearly four dozen study authors and project contributors, but also a web address and three quotations, he added, including quotes from James Joyce and Richard Feynman.
The Washington-based company Blue Heron synthesized the cassettes, each about 1,080 base pairs long, and these cassettes were then assembled via a series of steps in yeast and Escherichia coli, using multiplex PCR and restriction enzyme analyses to find and verify complete synthetic genomes.
Next, the team transplanted complete synthetic genomes into M. capricolum subspecies capricolum cells lacking restriction enzymes that would chop up the transferred genome, which had been unmethylated during its detour in yeast.
Once they had found cells harboring the synthetic genome, the researchers tested the functionality, characteristics, and growth patterns of the recipient cell, comparing those transplanted with either fully synthetic or semi-synthetic genomes.
Indeed, they found that cells transplanted with the complete synthetic M. mycoides genome are capable of self-replication, have phenotypic and growth patterns resembling natural M. mycoides cells, and produce a set of proteins that appears to be nearly identical to those found in M. mycoides, though their growth rate was slightly higher than natural control cells in at least one set of experiments.
Even so, the researchers cautioned, synthetic genome transplantation is far from simple and relies on precise genetic information.
"[O]btaining an error-free genome that could be transplanted into a recipient cell to create a new cell controlled only by the synthetic genome was complicated and required many quality control steps," the team noted, pointing to a problem they encountered when they inadvertently attempted to transplant a synthetic genome carrying a lone deletion in an essential gene.
"One wrong base out of over one million in an essential gene rendered the genome inactive, while major genome insertions and deletions in non-essential parts of the genome had no observable impact on viability," they wrote.
Overall though, those involved in the study are optimistic that their approach holds potential for creating synthetic cells with a range of applications — including bugs that can be used for everything from biofuel production and environmental cleanup to vaccine production.
"If the methods described here can be generalized, design, synthesis, assembly, and transplantation of synthetic chromosomes will no longer be a barrier to the progress of synthetic biology," the researchers concluded. "We expect that the cost of DNA synthesis will follow what has happened with DNA sequencing and continue to exponentially decrease. Lower synthesis costs combined with automation will enable broad applications for synthetic genomics."
For instance, Venter noted that the team plans to tackle the problem of making synthetic algae cells in the near future.
"I have no doubt that [co-author Daniel Gibson] and the team can easily make a synthetic algae chromosome," he said. "I think the biology will be challenging because we have to find an appropriate recipient cell to boot up that chromosome. Both have to happen in parallel."
Venter said Synthetic Genomics has filed multiple patents related to the methods used in various stages of the synthetic cell research on behalf of JCVI.

Thursday, May 06, 2010

The rap guide to Evolution

-By Baba Brinkman

Brinkman, a Canadian from Vancouver, a self-styled “rap troubadour,” with a master’s degree in English and a history of tree-planting (he has personally planted more than one million trees - according to his web site).

This is the only hip hop show that talks about mitochondria, genetic drift, sexual selection or memes. For Brinkman has taken Darwin’s exhortation seriously. He is a man on a mission to spread the word about evolution — how it works, what it means for our view of the world, and why it is something to be celebrated rather than feared.

At the end of the show he talks about the social slime mold Dictyostelium discoidium streaming together while rapping about how cooperation evolves.
Dictyostelium is notorious, in some circles, for its strange life-style. Usually, an individual Dictyostelium lives alone as a single cell. But when food is scarce, the single cells come together and form a being known as “the slug”; this crawls off in search of better conditions. When it finds them, the slug develops into a stalked fruiting body, and releases spores. But here’s the mystery: not all members of the slug get to make spores — and thereby contribute to the next generation — so why do they cooperate?
Here it is:

Wednesday, April 28, 2010

Simple pubmed API

if ($xml = simplexml_load_file('' . urlencode($argv[1]))){
if ($xml = simplexml_load_file("{$xml->QueryKey}&WebEnv={$xml->WebEnv}&retstart=0&retmax=10")){
$docs = $xml->DocSum;
More can be found here

Thursday, March 25, 2010

Packaging your source code

I have written down few software packages but never released any of them officially into the nicer ./configure && make && make install format.

We have created a C package that does what GCRMA would do plus much more in a very memory efficient way. Although GCRMA C source is available for free consumption, it needed quite some work on our end to customize it. Now that it is already written down, I am settling for releasing it under GNU open source license.

Few things need to be taken care of before creating the package:

1. Put all your C sources under src/ directory
2. All data such as sample CEL files need to go to /data directory
3. Accessory scripts like split file, merge file, plotting with R need to go to script/ directory
4. Documents go to /doc directory


First run autoscan

With all likelihood this command will exit with error. Nevertheless, it produces a configure.scan file.
Open configure.scan file and edit the line with
into useful inputs like
AC_INIT(Modified GCRMA, 1.0,

$ mv configure.scan used to be the earlier version)


This will create the configure file.

You need to create a series of files each inside your /script /man/ doc/ bin/ directories with appropriate values.

In your src/ directory you may like to keep useful information like:

# what flags you want to pass to the C compiler & linker
AM_CFLAGS = --pedantic -Wall -std=c99 -O2

# this lists the binaries to produce, the (non-PHONY, binary) targets in
# the previous manual Makefile

#bin_SCRIPTS = scripts/ scripts/ scripts/plot.R
bin_PROGRAMS = Modified GCRMA
loglikelihood_SOURCES = file1.c file2.c file3.c file4.c file5.c...

GCRMA_LDADD = -lm -lz
main.o: main.c utility.h
cc -c main.c
calculate.o: calculate.c utility.h
cc -c calculate.c
read_seq.o: read_seq.c utility.h
cc -c read_seq.c
read_file.o: read_file.c utility.h
cc -c read_file.c
rev_complement.o: rev_complement.c utility.h
cc -c rev_complement.c
detect_chimera.o: detect_chimera.c utility.h
cc -c detect_chimera.c
find_orf.o: find_orf.c utility.h
cc -c find_orf.c
command.o: command.c utility.h
cc -c command.c
process.o: process.c utility.h
cc -c process.c
rm -f *.o

in scripts directory change into
bin_SCRIPTS = file3.R

in Man directory change
man_MANS = man.1 man.2 man.3...

Now go back to file and make changes to the line just after

AM_INIT_AUTOMAKE(ModifiedGCRMA, 1.0). This will initialize automake

And at the bottom of the file make the following changes

AC_OUTPUT(Makefile src/Makefile doc/Makefile man/Makefile scripts/Makefile)

Now run aclocal followed by automake --add-missing

automake will read and create a file for configure to create a final make file.

While reading automake be cautioned - you may be asked for files

Never mind you can create those files using
touch NEWS
touch README
touch ChangeLog

If automake did not generate the previous time, run it again.

you may like to build the configure script again using autoconf.

Once this is done, you are all set. You may run ./configure --prefix-path="YOUR_PATH"

followed by make make install

You may pack your stuff using command make dist.

Here few of the autoconf macros are listed.

Thursday, March 04, 2010

A reasonable list of free and paid softwares for nextgen sequence analysis great list on free and paid softwares for nextgen sequence analysis is available at the seqanswers forum here.

To browse through a compiled list of Bioinformatics Softwares click here.

A list of publications devoted at the Bioinformatics journal here
Now I am working on samtools and bowtie and will write a detailed review soon...

Tuesday, March 02, 2010

AGBT 2010

AGBT(Advances in Genome Biology and Technology) was recently concluded at Florida(24-27th Feb 2010). Anthony Fejas has done a great job in putting the outlines of the presentations in his blog . Check out for details.