Thursday, February 12, 2009

Finding overlap


In sequence analysis, it is a common practice to find overlaps between 2 sets of features. This may be a coding sequence with repeats or coding sequences from 2 different prediction programs or 2 different coding sequences and so on. If data is present in gff files and are read into hashes with gene being the key holding the positions in an array, then the comparison can be made the following ways:

Lets find overlap between a gene and a repeat element.

geneLeft, repeatLeft be the first element of gene and repeat array respectively and geneRight, repeatRight be the last element of the gene and repeat array respectively.

#if repeat elements are occurring towards the left side of the gene element:
# and overlaps at the left hand side
repeatLeft <= geneLeft and repeatRight >= geneLeft # Note here Not repeatRight <= geneRight -> This condition will be true even if the repeat is occuring way upstream of the gene element and even if they don't overlap

# If repeat occurs at the right han side of the gene and overlaps at the right side

repeatLeft <= geneRight and repeatRight >= geneRight

# If repeat element lies within the gene element

geneLeft <= repeatLeft and repeatRight <= geneRight

#If gene element occurs inside the repeat element

repeatLeft <= geneLeft and geneRight <= repeatRight

If any of these conditions become true then it will find overlap(Join them with 'or')

if( ($coord1[$i] <= $coord[0] && $coord1[$i+1] >= $coord[-1]) ||
($coord1[$i] >= $coord[0] && $coord1[$i+1] <= $coord[-1]) ||
($coord1[$i] <= $coord[0] && $coord1[$i+1] >= $coord[0]) ||
($coord1[$i] >= $coord[0] && $coord1[$i] <= $coord[-1] )){

1 comment:

Nat said...

I've been looking for a way to do this for a while. Do you have a complete perl script example to load in to GFF files and extract all of the overlaps between the features in them and then store and write this to a file?

I'm completely new to perl so seeing an example would be a massive help.