genomics: November 2010

Monday, November 29, 2010

Bayesian Probability theory and its application

The two probability theories, that are considered as pillars of all other complicated probabilistic models are; Addition theory and Multiplication theory:

Addition Theory(OR Theory):
If two events are mutually exclusive, then the possibility of occurrence of event 1 or event 2 is the addition of two probabilities.
Example: Possibility of having an ace or a joker from a pack of 54 cards. There are only 2 jokers and 4 aces. So, the probability is: 2/54 + 4/54 = 6/54 = 1/9.

Addition theory is slightly modified where the events are not mutually exclusive. For example, having a diamond and a queen from a pack of 52 cards .Symbolically, it is represented as: P(A or B) = P(A) + P(B) - P(A and B). So, the result will be 13/52 + 4/52 - 1/52.(Reason: there are 13 diamonds, 4 queens and only one queen in 52 cards). The result therefore is: 16/52= 4/13.

Multiplication Theory: (AND Theory)
When two events A and B are mutually exclusive, then the possibility of occurrence of A and B is P(A) X P(B).
Example: Tossing two coins simultaneously to obtain two tails or two heads is 1/4 X 1/4 = 1/16.

Bayesian Probability:
Bayesian statistics or Bayesian probability is also called as conditional probability where two different conditions are evaluated jointly. Common example is a loaded die in a casino where there are 99% of the dies are honest but 1% of the dies are loaded. Where with the loaded die, there is a possibility of getting a 6 is 50% as against a honest die, where getting a 6 occurs 1/6th of the time. So, what is the probability that one gets 3 consecutive sixes?
In theory Bayesian statistics can be applied where:

The sample space is partitioned into a set of mutually exclusive events { A₁, A₂, . . . , A_n }.
Within the sample space, there exists an event B, for which P(B) > 0.
The analytical goal is to compute a conditional probability of the form: P( A_k | B ).
You know at least one of the two sets of probabilities described below.

P( A_k ∩ B ) for each A_k
P( A_k ) and P( B | A_k ) for each A_k.

So, in the above case, we would like to know what is the possibility(likelihood) that there are 3 consecutive sixes, if it were a loaded die.
P(D Loaded|3 sixes) =                                  P(3 sixes in Loaded die) * P(loaded die)
                                   P(3 sixes in Loaded die) * P(loaded die) + P(3 sixes in fair die) * P(Fair die)
   =    (0.5)³ * (0.01)
                                           (0.5)³ * (0.01) + (1/6)³ * 0.99
=    0.21

Same can be applied to test, what is the possibility that there are 3 consecutive sixes in a fair die:

(1/6)³ * 0.99/ [ (0.5)³ * (0.01) + (1/6)³ * 0.99 ] = 0.78
Meaning that there is a fair chance that you will get 3 sixes consecutively in a unloaded die than a loaded die.

Same probability theory can be applied for testing a case where the weather prediction channels have predicted about a sunny or a rainy day in a given day. If it rains 5 times a whole year and the weather prediction channel predicted about a rainy day for a particular day and the accuracy of the weather channel is about 90%, then we can calculate the possibility of it is raining the given day is:
(5/365) * 0.9 / [ (5/365) * 0.9 + (360/365) * 0.1] = 0.111

Bayesian statistics is often used with sequence analysis where we have a situation to judge whether a protein is extracellular or intracellular. These 2 situations are mutually exclusive i.e; A protein can either be extracellular or intracellular and having more number of cystein residues at certain location decide at a certain confidence level whether that is the case.
Another test case could be:

A rare genetic disease is discovered. Although only one in a million
people carry it, you consider getting screened. You are told that the genetic
test is extremely good; it is 100% sensitive (it is always correct if
you have the disease) and 99.99% specific (it gives a false positive result
only 0.01 % of the time). Using Bayes' theorem, explain why you might
decide not to take the test.

Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo [Jacob 19771].

Tuesday, November 23, 2010

Dynamic programming for sequence Alignment

According to wikipedia:
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler steps. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller^[1] and optimal substructure (described below). When applicable, the method takes far less time than naïve methods.
Top-down dynamic programming simply means storing the results of certain calculations, which are later used again since the completed calculation is a sub-problem of a larger calculation. Bottom-up dynamic programming involves formulating a complex calculation as a recursive series of simpler calculations.

In biological sequence analysis, dynamic programming is often used for sequence alignment using global sequence alignment(Needleman and Wunsch method).
Here I present a simplest interpretation of dynamic programming for aligning two sequences using global alignment(N&W method).

We have sequence 1: G A A T T C A G T T A
sequence 2: G G A T C G A
Here;
Length(seq1) = 11 and Length(seq2) = 7; lets call them k and l
In order to align them globally using dynamic programming method, we have to do the following;
1. Build a matrix M of size (k+1) X (l+1), putting seq1 in columns and seq2 in columns
2. Initialize the matrix where M[0,0..k] and M[0..l,0] is initialized into zero[Fig -1].

Fig - 1

The rest of the rows and columns can be filled using the following mathematical notation:

M_i,j = MAXIMUM[
     M_{i-1, j-1} + S_i,j (match/mismatch in the diagonal),
     M_i,j-1 + w (gap in sequence #1),
     M_i-1,j + w (gap in sequence #2)]

where S_i,j is 1 for a match and 0 for a mismatch

w = 0

Now, lets talk about filling M(1,1) = MAX[ {M(0,0) + S(1,1)},

                                           {M(1,0) + W},

                                           {M(0,1) + W} ]

= MAX[ {0 + 1}, {0 + 0}, {0+0} ]

= 1

Going by this trend, M[1,2] is going to be: Max[{1+0},{0+0},{0+0} ] = 1

So, we can start filling in the matrix this way:

From this, a common observation can be drawn:

1. Increment by one the value of a cell only when there is a perfect match to

that of the previous diagonal.

2. If there is a mismatch always continue with the value from the previous column/row.

Going by this, we can end up filling the matrix with the following values:

Now the last phase in dynamic programming:
3) Tracing back

Tracing back almost always begins with the highest score and looks to the diagonal up or to the left(gap in sequence-1) or up (gap in seq -2)

OR

G - A A T T C A G T T A
| | | | | |
G G A - - T C - G - - A

For longer sequences large number of other combination is possible, but the dynamic programming outputs only one result.

[Ref: Images adapted from Eric C. Rouchka's web page]

In real life examples, different penalty scores are given for a mismatch and for a gap[Every time, you go down the columns or on a row without a match should be given a penalty as a gap]

Friday, November 05, 2010

Generating random DNA/protein sequence permutation

In general terminology, sequence permutation is defined as shuffling of the bases without disturbing the base composition. In other words, all the permuted sequence should have the same base composition.It is fairly easy to implement such permutations when we talk about single nucleotide composition; and Fisher and Yates 1938, algorithm is a good one for implementing it without introducing much sequence bias. However, in real life situation, it may be necessary to generate a random DNA sequence where there is a need to preserve the di-tri nucleotide frequencies in overlapping windows.Altschul SF, Erickson BW., have explained how to generate such a sequence while conserving the di-tri nucleotide composition using a graph theory of generating random Eulerian walks on a directed multigraph. [Paper Here]

R has a bioString object with DNAString function, that although does not implement Fisher Yates algorithm, but generates non-biased permuted DNA string. For dinucleotide/trinucleotide composition preservation also, this method can be used effectively as long as they don't overlap:

Example Code:
>library(Biostrings)
> x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.15)), collapse="")
> x <- DNAString(x)
>x # will print x
> alphabetFrequency(x,baseOnly=TRUE) # Will print base composition in numbers

In order to run it for di/tri nucleotide, just change the code slightly into:
> x <- paste(sample(c("AT", "CC", "GT", "TA", "AA"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.10,0.05)), collapse="")
> x<-DNAString(x)
> x
2000-letter "DNAString" instance
seq: CCATCCCCCCCCTAATCCTATACCTATACCTACCGT...CCATATTATATACCTAATAACCCCCCTACCATCCAT
Please NOTE: Here you will not get any other dinucleotide than specified IN FRAME

Same can be done for tri-tetra nucleotides as well.

A very simple unix implementation of random sequence generator could be:

$echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c}
This generates a cross product of a,t,g,c to a,t,g,c, and thus has all tetra nucleotides represented. Similar order can be generated for dinucleotides such as:
$echo {aa,tt,gg,cc}{at,tg,ga,ca}{aa,tc,ag,tc}{aa,at,cg,cc}
Enjoy generating permuted sequences.

Wednesday, November 03, 2010

Some unix/perl oneliners for Bioinformatics

1    File format conversion/line counting/counting number of files etc.

1.    $ wc –l : count number of lines in a file.
2.    $ ls | wc –l        : count number of files in a directory.
3.    $ tac     : print the file in reverse order e.g; last line first, first line last.
4.    $ rev     : reverse the file in lines.
5.    $ sed 's/.$//' or sed 's/^M$//' or sed 's/\x0D$//' : converts a dos file into unix mode.
6.    $sed "s/$/`echo -e \\\r`/" or sed 's/$/\r/' or sed "s/$//": converts a unix newline into a DOS newline.
7.    $ awk '1; { print "" }' : Double space a file.
8.    $ awk '{ total = total + NF }; END { print total+0 }' : prints the number of words in a file.
9.    $sed '/^$/d' or [grep ‘.’] : Delete all blank lines in a file.
10.    $sed '/./,$!d' : Delete all blank lines in the beginning of the file.
11.    $sed -e :a -e '/^\n*$/{$d;N;ba' -e '}': Delete all blank lines at the end of the file.
12.    $sed -e :a -e 's/<[^>]*>//g;/
13.    $sed 's/^[ \t]*//' : deleting all leading white space tabs in a file.
14.    $ sed 's/[ \t]*$//' : Delete all trailing white space and tab in a file.
15.    $ sed 's/^[ \t]*//;s/[ \t]*$//' : Delete both leading and trailing white space and tab in a file.

2.2    Working with Patterns/numbers in a sequence file
16.    $awk '/Pattern/ { n++ }; END { print n+0 }' : print the total number of lines containing the word pattern.
17.    $sed 10q : print first 10 lines.
18.    $sed -n '/regexp/p' : Print the line that matches the pattern.
19.    $sed '/regexp/d' : Deletes the lines that matches the regexp.
20.    $sed -n '/regexp/!p' : Print the lines that does not match the pattern.
21.    $sed '/regexp/!d' : Deletes the lines that does NOT match the regular expression.
22.    $sed -n '/^.\{65\}/p' : print lines that are longer than 65 characters.
23.    $sed -n '/^.\{65\}/!p' : print lines that are lesser than 65 characters.
24.    $sed -n '/regexp/{g;1!p;};h' : print one line before the pattern match.
25.    $sed -n '/regexp/{n;p;}' : print one line after the pattern match.
26.    $sed -n '/^.\{65\}/ {g;1!p;};h' < sojae_seq > tmp : print the names of the sequences that are larger than 65 nucleotide long.
27.    $sed -n '/regexp/,$p' : Print regular expression to the end of file.
28.    $sed -n '8,12p' : print line 8 to 12(inclusive)
29.    $sed -n '52p' : print only line number 52.
30.    $seq ‘/pattern1/,/pattern2/d’ < inputfile > outfile : will delete all the lines between pattern1 and pattern2.
31.    $sed ‘/20,30/d’ < inputfile > outfile : will delete all lines between 20 and 30.   OR sed ‘/20,30/d’ < input > output will delete lines between 20 and 30.
32.    awk '/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that contains ‘baz’.
33.    awk '!/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that does not contain ‘baz’.
34.    grep –i –B 1 ‘pattern’ filename > out : Will print the name of the sequence and the sequence having the pattern in a case insensitive way(make sure the sequence name and the sequence each occupy a single line).
35.    grep –i –A 1 ‘seqname’ filename > out : will print the sequence name as well as the sequence into file ‘out’.

2.3    Inserting Data into a file:

36.    gawk --re-interval 'BEGIN{ while(a++<49) s=s "x" }; { sub(/^.{6}/,"&" s) }; 1' > fileout : will insert 49 ‘X’ in the sixth position of every line.

37.    gawk --re-interval 'BEGIN{ s="YourName" }; { sub(/^.{6}/,"&" s) }; 1' : Insert your name at the 6 th position in every line.

3.    Working with Data Files[Tab delimited files]:

3.1    Error Checking and data handling:
38.    awk '{ print NF ":" $0 } ' : print the number of fields of each line followed by the line.
39.    awk '{ print $NF }' : print the last field of each line.
40.    awk 'NF > n' : print every line with more than ‘n’ fields.
41.    awk '$NF > n' : print every line where the last field is greater than n.
42.    awk '{ print $2, $1 }' : prints just first 2 fields of a data file in reverse order.
43.    awk '{ temp = $1; $1 = $2; $2 = temp; print }' : prints all the fields in the correct order except the first 2 fields.
44.    awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }' : prints all the fields in reverse order.
45.    awk '{ $2 = ""; print }' : deletes the 2nd field in each line.
46.    awk '$5 == "abc123"' : print each line where the 5th field is equal to ‘abc123’.
47.    awk '$5 != "abc123"' : print each line where 5th field is NOT equal to abc123.
48.    awk '$7 ~ /^[a-f]/' : Print each line whose 7th field matches the regular expression.
49.    awk '$7 !~ /^[a-f]/' : print each line whose 7th field does NOT match the regular expression.
50.    cut –f n1,n2,n3.. > output file : will cut n1,n2,n3 columns(fields) from input file and print the output in output file. If delimiter is other than TAB then give additional argument such as cut –d ‘,’ –f n1,n2.. inputfile > out
51.    sort –n –k 2,2 –k 4,4 file > fileout : Will conduct a numerical sort of column 2, and then column 4. If –n is not specified, then, sort will do a lexicographical sort(of the ascii value).

4.    Miscellaneous:
52.    uniq –u inputfile > out : will print only the uniq lines present in the sorted input file.
53.    uniq –d inputfile > out : will print only the lines that are in doubles from the sorted input file.
54.    cat file1 file2 file3 … fileN > outfile : Will concatenate files back to back in outfile.
55.    paste file1 file2 > outfile : will merge two files horizontally. This function is good for merging with same number of rows but different column width.
56.    !:p : will print the previous command run with the ‘pattern’ in it.
57.    !! : repeat the last command entered at the shell.
58.    ~ : Go back to home directory
59.    echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c} : will generate all tetramers using ‘atgc’. If you want pentamers/hexamers etc. then just increase the number of bracketed entities.NOTE: This is not a efficient sequence shuffler. If you wish to generate longer sequences then use other means.
60.    kill -HUP ` ps -aef | grep -i firefox | sort -k 2 -r | sed 1d | awk ' { print $2 } ' ` : Kills a hanging firefox process.
61.    csplit -n 7 input.fasta '/>/' '{*}' : will split the file ‘input.fasta’ wherever it encounters delimiter ‘>’. The file names will appear as 7 digit long strings.
62.    find . -name data.txt –print: finds and prints the path for file data.txt.
Sample Script to make set operations on sequence files:
63.    grep ‘>’ filenameA > list1 # Will list just the sequence names in a file names.
grep ‘>’ filenameB > list2 # Will list names for file 2
cat list1 list2 > tmp # concatenates list1 and list2 into tmp
sort tmp > tmp1 # File sorted
uniq –u tmp1 > uniq    # AUB – A ∩ B (OR (A-B) U (B-A))
uniq –d tmp1 > double # Is the intersection (A ∩ B)
cat uniq double > Union # AUB
cat list1 double > tmp
sort tmp | uniq –u > list1uniq # A - B
cat list2 double > tmp
sort tmp | uniq –u > list2uniq # B - A

PERL ONELINERS:

1.    perl -pe '$\="\n"' : double space a file
2.    perl -pe '$_ .= "\n" unless /^$/' : double space a file except blank lines
3.    perl -pe '$_.="\n"x7' : 7 space in a line.
4.    perl -ne 'print unless /^$/' : remove all blank lines
5.    perl -lne 'print if length($_) < 20' : print all lines with length less than 20.
6.    perl -00 -pe '' : If there are multiple spaces, delete all leaving one(make the file a single spaced file).
7.    perl -00 -pe '$_.="\n"x4' : Expand single blank lines into 4 consecutive blank lines
8.    perl -pe '$_ = "$. $_"': Number all lines in a file
9.    perl -pe '$_ = ++$a." $_" if /./' : Number only non-empty lines in a file
10.    perl -ne 'print ++$a." $_" if /./' : Number and print only non-empty lines in a file
11.    perl -pe '$_ = ++$a." $_" if /regex/' ; Number only lines that match a pattern
12.    perl -ne 'print ++$a." $_" if /regex/' : Number and print only lines that match a pattern
13.    perl -ne 'printf "%-5d %s", $., $_ if /regex/' : Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)
14.    perl -le 'print scalar(grep{/./}<>)' : prints the total number of non-empty lines in a file
15.    perl -lne '$a++ if /regex/; END {print $a+0}' : print the total number of lines that matches the pattern
16.    perl -alne 'print scalar @F' : print the total number fields(words) in each line.
17.    perl -alne '$t += @F; END { print $t}' : Find total number of words in the file
18.    perl -alne 'map { /regex/ && $t++ } @F; END { print $t }' : find total number of fields that match the pattern
19.    perl -lne '/regex/ && $t++; END { print $t }' : Find total number of lines that match a pattern
20.    perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m' : will calculate the GCD of two numbers.
21.    perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m' : will calculate lcd of 20 and 35.
22.    perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n' : Generates 10 random numbers between 5 and 15.
23.    perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8': Generates a 8 character password from a to z and number 0 – 9.
24.    perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20': Generates a 20 nucleotide long random residue.
25.    perl -le 'print "a"x50': generate a string of ‘x’ 50 character long
26.    perl -le 'print join ", ", map { ord } split //, "hello world"': Will print the ascii value of the string hello world.
27.    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)': converts ascii values into character strings.
28.    perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"': Generates an array of odd numbers.
29.    perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"': Generate an array of even numbers
30.    perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file: Convert the entire file into 13 characters offset(ROT13)
31.    perl -nle 'print uc' : Convert all text to uppercase:
32.    perl -nle 'print lc' : Convert text to lowercase:
33.    perl -nle 'print ucfirst lc' : Convert only first letter of first word to uppercas
34.    perl -ple 'y/A-Za-z/a-zA-Z/' : Convert upper case to lower case and vice versa
35.    perl -ple 's/(\w+)/\u$1/g' : Camel Casing
36.    perl -pe 's|\n|\r\n|' : Convert unix new lines into DOS new lines:
37.    perl -pe 's|\r\n|\n|' : Convert DOS newlines into unix new line
38.    perl -pe 's|\n|\r|' : Convert unix newlines into MAC newlines:
39.    perl -pe '/regexp/ && s/foo/bar/' : Substitute a foo with a bar in a line with a regexp.

Some other Perl Tricks

Want to display some progress bars while perl does your job:

For this perl provides a nice utility called "pipe opens" ('perldoc -f open' will provide more info)

open(my $file, '-|', 'command','option', 'option', ...) or die "Could not run tar ... - $!";
  while (<$file>) {
       print "-";
  }
  print "\n";
  close($file);

Will print - on the screen till the process is completed