Friday, November 05, 2010

Generating random DNA/protein sequence permutation

In general terminology, sequence permutation is defined as shuffling of the bases without disturbing the base composition. In other words, all the permuted sequence should have the same base composition.It is fairly easy to implement such permutations when we talk about single nucleotide composition; and Fisher and Yates 1938, algorithm is a good one for implementing it without introducing much sequence bias. However, in real life situation, it may be necessary to generate a random DNA sequence where there is a need to preserve the di-tri nucleotide frequencies in overlapping windows.Altschul SF, Erickson BW., have explained how to generate such a sequence while conserving the di-tri nucleotide composition using a  graph theory of generating random Eulerian walks on a directed multigraph. [Paper Here]

R has a bioString object with DNAString function, that although does not implement Fisher Yates algorithm, but generates non-biased permuted DNA string. For dinucleotide/trinucleotide composition preservation also, this method can be used effectively as long as they don't overlap:

Example Code:
>library(Biostrings)
>  x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.15)), collapse="")
> x <- DNAString(x)
>x # will print x
> alphabetFrequency(x,baseOnly=TRUE)   # Will print base composition in numbers

In order to run it for di/tri nucleotide, just change the code slightly into:
>  x <- paste(sample(c("AT", "CC", "GT", "TA", "AA"), 1000, replace=TRUE,prob=c(0.2, 0.55
, 0.1, 0.10,0.05)), collapse="") 
> x<-DNAString(x)
> x
  2000-letter "DNAString" instance
seq: CCATCCCCCCCCTAATCCTATACCTATACCTACCGT...CCATATTATATACCTAATAACCCCCCTACCATCCAT
Please NOTE: Here you will not get any other dinucleotide than specified IN FRAME

Same can be done for tri-tetra nucleotides as well.

A very simple unix implementation of random sequence generator could be:

$echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c}
This generates a cross product of a,t,g,c to a,t,g,c, and thus has all tetra nucleotides represented. Similar order can be generated for dinucleotides such as:
$echo {aa,tt,gg,cc}{at,tg,ga,ca}{aa,tc,ag,tc}{aa,at,cg,cc}
Enjoy generating permuted sequences.

No comments: