Thursday, February 19, 2009

Cool perl tips

While dealing with very huge files, it is often necessary not to read the entire file into an array. This will be a huge RAM cost to the machine. Iterating through a file handle is often a easy solution. But often we need to go back and forth the file as we find a match. The easiest way to achieve this will be to use perl 'seek' function.

while(){

some pattern..
seek(FH, - length($_), 1); # will point the File handle 
# one line back

} 
NOTE: Just reading one line of a file at a time can be done by simply;
$_=; # where FH stands for file handle.
Next call to $_=; will read the next line. This is a great way to read
2 files together especially if they both are representing sequence and qual files
and the number of lines and values are in same order in both the files. Simply iterate
one file with a while loop and call $_= inside the while loop for the second
file.  


Similarly, Redo operator can also be explored to move back in loop.

Searching patterns in 2 data files:

In bioinformatics work, it is often necessary to search a list of names from a file from another file, that either has a list of sequences or blast output or anything else. So, the trick to reduce the search time will be keep the larger file as the outer loop(mostly as a while loop), the inside loop could be a while loop or a for loop, depending on the file size. If there is a for loop for the inner loop, then it is possible to get rid of the element that has just had a match to the outer element. This way with every iteration, the inner loop gets shorter and the time spent in searching gets shorter. In perl it is quite easy to remove an array element by simply using a delete function.

delete($array[index]) will do the job. But just delete removes the element while leaving $array[index] space empty. In order to remove the element along with index use splice function instead.

splice(@array,index,offset) --> will remove offset number of elements and indices starting from index. i.e; splice(@array,2,7) will remove 7 indices starting from 2.

The whole construct may look something like this:
while(){

if(/Query= (\S+)/){

$flag = 0;

for(my $i; $i<=$#arr;$i++){

my @tmp = split(/\t/,$arr[$i]);

if($1 eq $tmp[0]){

print "$_";

$flag = 1;

delete($arr[$i]); # or splice(@arr,$i,1);

last;

}

}

next;

}

if($flag){

print "$_";

}


}
Passing Array by Reference
If you have a subroutine and you need to pass more than one array, then don't pass by value. If passed by value then the list is flattened and the first and second element of the array is passed as 2 arrays. Example: @arr1 = [1,2,3];
@arr2 = [6,7,8];
func(@arr1,@arr2);
will print: Array1 is 1 and array2 is 2
sub func
{
my @a = shift;
my @b = shift;
print "Array1 is @a and array2 is @b\n";
}
In order to get the values intact pass by reference:
func(\@arr1,\@arr2);
in
sub func{
$a = shift;
$b = shift;
@arr1 = @$a; # de-referencing array
@arr2 = @$b;
Here is a time and space saver if you want to sort a hash by value and iterate over the sorted list:
# sorting tmp hash by value
foreach my $sorted(sort {$tmp{$b} <=> $tmp{$a}} keys %tmp){
my %geneHash;
        for my $gene(keys %{$HOH1{$sorted}}){
        my @coord = sort{$a <=> $b} @{ $HOH1{$sorted}{$gene}{'pos'}};
        my $strand1 = $HOH1{$sorted}{$gene}{'strand'};
       $geneHash{$gene} = $coord[0];
}
foreach my $sortedgene(sort {$geneHash{$a} <=> $geneHash{$b}} keys %geneHash ){
     print FH "$sortedgene\t$serielNumber\n";
    print FH1 ">
    $serielNumber\n";
    print FH1 "$transcript{$sortedgene}\n";
  `sed 's/\t$sortedgene/$serielNumber/' output.gff`;
   $serielNumber++;
}
}

# One variation if you want to sort the sequence file in descending order of their length. step -1:
Read fasta file into a hash called as %hash
 # sort hash by the length of the value
foreach my $scaffold(sort {length($hash{$b}) <=> length($hash{$a}) } keys %hash) {
print ">scaffold_$num\n$hash{$scaffold}\n";
print FH "$scaffold\tscaffold_$num\t".length($hash{$scaffold})."\n"; $num++;
 }
To make it to the ascending order simply change this to:
foreach my $scaffold(sort {length($hash{$a}) <=> length($hash{$b}) } keys %hash)

No comments: