genomics: February 2009

Thursday, February 19, 2009

Cool perl tips

While dealing with very huge files, it is often necessary not to read the entire file into an array. This will be a huge RAM cost to the machine. Iterating through a file handle is often a easy solution. But often we need to go back and forth the file as we find a match. The easiest way to achieve this will be to use perl 'seek' function.

while(){

some pattern..
seek(FH, - length($_), 1); # will point the File handle 
# one line back

}

NOTE: Just reading one line of a file at a time can be done by simply;

$_=; # where FH stands for file handle.

Next call to $_=; will read the next line. This is a great way to read

2 files together especially if they both are representing sequence and qual files

and the number of lines and values are in same order in both the files. Simply iterate

one file with a while loop and call $_= inside the while loop for the second

file.

Similarly, Redo operator can also be explored to move back in loop.

Searching patterns in 2 data files:

In bioinformatics work, it is often necessary to search a list of names from a file from another file, that either has a list of sequences or blast output or anything else. So, the trick to reduce the search time will be keep the larger file as the outer loop(mostly as a while loop), the inside loop could be a while loop or a for loop, depending on the file size. If there is a for loop for the inner loop, then it is possible to get rid of the element that has just had a match to the outer element. This way with every iteration, the inner loop gets shorter and the time spent in searching gets shorter. In perl it is quite easy to remove an array element by simply using a delete function.

delete($array[index]) will do the job. But just delete removes the element while leaving $array[index] space empty. In order to remove the element along with index use splice function instead.

splice(@array,index,offset) --> will remove offset number of elements and indices starting from index. i.e; splice(@array,2,7) will remove 7 indices starting from 2.

The whole construct may look something like this:

while(){

if(/Query= (\S+)/){

$flag = 0;

for(my $i; $i<=$#arr;$i++){

my @tmp = split(/\t/,$arr[$i]);

if($1 eq $tmp[0]){

print "$_";

$flag = 1;

delete($arr[$i]); # or splice(@arr,$i,1);

last;

}

}

next;

}

if($flag){

print "$_";

}


}

Passing Array by Reference
If you have a subroutine and you need to pass more than one array, then don't pass by value. If passed by value then the list is flattened and the first and second element of the array is passed as 2 arrays. Example: @arr1 = [1,2,3];
@arr2 = [6,7,8];
func(@arr1,@arr2);
will print: Array1 is 1 and array2 is 2
sub func
{
my @a = shift;
my @b = shift;
print "Array1 is @a and array2 is @b\n";
}
In order to get the values intact pass by reference:
func(\@arr1,\@arr2);
in
sub func{
$a = shift;
$b = shift;
@arr1 = @$a; # de-referencing array
@arr2 = @$b;
Here is a time and space saver if you want to sort a hash by value and iterate over the sorted list:
# sorting tmp hash by value
foreach my $sorted(sort {$tmp{$b} <=> $tmp{$a}} keys %tmp){
my %geneHash;
        for my $gene(keys %{$HOH1{$sorted}}){
        my @coord = sort{$a <=> $b} @{ $HOH1{$sorted}{$gene}{'pos'}};
        my $strand1 = $HOH1{$sorted}{$gene}{'strand'};
       $geneHash{$gene} = $coord[0];
}
foreach my $sortedgene(sort {$geneHash{$a} <=> $geneHash{$b}} keys %geneHash ){
     print FH "$sortedgene\t$serielNumber\n";
    print FH1 ">
    $serielNumber\n";
    print FH1 "$transcript{$sortedgene}\n";
`sed 's/\t$sortedgene/$serielNumber/' output.gff`;
   $serielNumber++;
}
}

# One variation if you want to sort the sequence file in descending order of their length. step -1:
Read fasta file into a hash called as %hash
# sort hash by the length of the value
foreach my $scaffold(sort {length($hash{$b}) <=> length($hash{$a}) } keys %hash) {
print ">scaffold_$num\n$hash{$scaffold}\n";
print FH "$scaffold\tscaffold_$num\t".length($hash{$scaffold})."\n"; $num++;
}
To make it to the ascending order simply change this to:
foreach my $scaffold(sort {length($hash{$a}) <=> length($hash{$b}) } keys %hash)

Thursday, February 12, 2009

Finding overlap


In sequence analysis, it is a common practice to find overlaps between 2 sets of features. This may be a coding sequence with repeats or coding sequences from 2 different prediction programs or 2 different coding sequences and so on. If data is present in gff files and are read into hashes with gene being the key holding the positions in an array, then the comparison can be made the following ways:

Lets find overlap between a gene and a repeat element.

geneLeft, repeatLeft be the first element of gene and repeat array respectively and geneRight, repeatRight be the last element of the gene and repeat array respectively.

#if repeat elements are occurring towards the left side of the gene element:
# and overlaps at the left hand side
repeatLeft <= geneLeft and repeatRight >= geneLeft # Note here Not repeatRight <= geneRight -> This condition will be true even if the repeat is occuring way upstream of the gene element and even if they don't overlap

# If repeat occurs at the right han side of the gene and overlaps at the right side

repeatLeft <= geneRight and repeatRight >= geneRight

# If repeat element lies within the gene element

geneLeft <= repeatLeft and repeatRight <= geneRight

#If gene element occurs inside the repeat element

repeatLeft <= geneLeft and geneRight <= repeatRight

If any of these conditions become true then it will find overlap(Join them with 'or')

 if( ($coord1[$i] <= $coord[0] && $coord1[$i+1] >= $coord[-1]) ||
 ($coord1[$i] >= $coord[0] && $coord1[$i+1] <= $coord[-1]) || 
 ($coord1[$i] <= $coord[0] && $coord1[$i+1]  >= $coord[0]) || 
 ($coord1[$i] >= $coord[0] && $coord1[$i] <= $coord[-1] )){

Tuesday, February 10, 2009

Using multidimensional hashes in storing sequence objects


One of the most important functions of multidimensional perl hashes has been their usage in reading sequence related data. The most common data type one reads from a genome sequencing prospective is a gff file. A gff file has the following attributes that are most likely captured in a data structure:

1. Name of the scaffold or chromosome on which the gene resides.
2. Name of the gene
3. Gene positions
4. Strand

Other attributes are less important, so lets ignore them at the moment.

So, now a scaffold will have multiple genes and a gene will have multiple locations. The ideal data structure that can store all these information will be a has of hashes and one of the hash holds arrays. In other words:

scaffold_1 => {
               gene1 => {  pos => [positions], strand => + or minus},
               gene2 => {  pos => [positions], strand => + or minus},
               .......
              },

scaffold_2 => {   
              gene1 => {pos => [positions], strand => + or minus},
              gene2 => {pos => [positions], strand => + or minus},
               .......
              },
.....


Populating this hash from gff file:

sub test_gff{

my $fileName = shift;
my %HOH;

open PRED, $fileName or die "Can't open file $!\n";

while(){

    chomp;
    my @line = split(/\t/,$_);
      
    push(@{$HOH{$line[0]}->{$line[8]}->{'pos'}}, $line[3], $line[4]);
    $HOH{$line[0]}->{line[8]}->{'strand'}, $line[5];

}

close PRED;
return %HOH;

}


Now Printing the values:


my %HOH1 = &utility::test_gff($ARGV[0]); # Read the file


for my $scaffold(keys %HOH1){


    for my $gene(keys %{$HOH1{$scaffold}}){

        my @coord =  @{ $HOH1{$scaffold}{$gene}{'pos'}};
        my $strand = $HOH1{$scaffold}{$gene}{'strand'};
       

        print "$scaffold -> $gene -> @coord -> $strand\n";


    }

}

# If you want to sort the co-ordinates,Then that can be done as below:

 my @coord = sort{$a <=> $b} @{ $HOH1{$scaffold}{$gene}{'pos'}};

# Values will be printed as :

Contig632 -> 713969 -> 1687 3391 -> +
Contig1122 -> 707832 -> 3109 3183 3194 3200 3200 3232 3232 3251 -> +
Contig1294 -> 714344 -> 3744 4473 -> +
Contig1294 -> 714343 -> 24 1652 -> -
Contig772 -> 707541 -> 14 64 126 147 147 182 -> +
Contig772 -> 713832 -> 4 81 126 147 147 194 -> +
Contig772 -> 711369 -> 1 64 130 147 147 204 -> +

Cool...

Adding one more dimension:

Mostly in draft genome sequences the genome assembly is not in great shape, so most often same genes can be found in multiple locations(Not necessarily through gene duplication). Sometimes the same gene could be present in the same scaffold in more than one location. In that case to stop over writing or appending the positions for the same gene, another check can be implemented, that is the gene serial number. So, you can just add position as another attribute with seriel number being the key. 

    push(@{$HOH{$line[0]}->{$line[8]}->{$serialNum}->{'pos'}}, $line[3], $line[4]);
    $HOH{$line[0]}->{line[8]}->{$serialNum}->{'strand'}, $line[5];

And while iterating add one more for loop:

for my $scaffold(keys %HOH1){


    for my $gene(keys %{$HOH1{$scaffold}}){

         for my $serialNum(keys %{%HOH1{$scaffold}{$gene}}){

          my @coord =  @{ $HOH1{$scaffold}{$gene}{$serialNum}{'pos'}};
          my $strand = $HOH1{$scaffold}{$gene}{$serialNum}{'strand'};
       

          print "$scaffold -> $gene -> @coord -> $strand\n";


    }

}

The ease at which we can use perl data structures make it so desirable. The coder does not need to worry about memory allocation stuff. Earlier I used to use linked list in C to store sequence data. Then I will be stuck in debugging stuff a lot. Now after I discovered this, I have almost stopped coding in C. Happy perl coding..