Monday, March 02, 2009

Implementing binary search in searching a list of sequences from a complex file

In sequence analysis work, it becomes imperative to search a list of sequences from a sequence file or a file having complex sets of data. One such complex dataset I can think of is an ACE file. An ACE file has multiple components:

AS contigs reads
CO contig_name bases reads segments compl (CAP3: segments=0)
sequence
BQ base_qualities
AF read1 compl padded_start_consensus (negatives meaning?)
AF read2 ..
BS segments
RD read1 bases info_items info_tags (latter two set to 0 by CAP3)
sequence
QA read1 qual_start qual_end align_start align_end
DS (phred header? left empty by CAP3)
RD read2 ...


Suppose you have a list of sequences in a file and you got to search which sequence belongs to which Contig from the ACE file. Then probably, you have to read the ACE file first as an hash of arrays, something like this:


while(){

if(/^CO Contig/){

$key = $_;
$flag = 1;
}

if($flag && /^[ATGC][ATGC][ATGC]/){

chomp;
$seq.=$_;
}

if(/^BQ/){

push(@arr,$seq);
$seq = '';

}

if(/^AF\s+(\S+)/){

push(@arr,$1);

}

if(/^BS/ && $flag){

$HOA{$key} = [ @arr ];
$flag = 0;
@arr = ();

}


}



After reading ACE into hash of arrays, need to iterate over the hash and get the sequence of arrays searched against each entry in the file. If the sequence file size is hugh, its good to implement a binary search(O(log2N)) instead of a serial search(O(N)). But knowing the fact that binary search can be implemented on a sorted array or list, it is more probable to sort the list file instead of sorting the array of hash. Because once sorted the list file stays sorted and does not need any sorting.


Implementation:

my @list = sort(@list); # sorting the sequence names in list file

# The following implementation is a linear search and takes
# much longer time
foreach my $key(keys %HOA){

my @arr = @{ $HOA{$key}} ;

my @found;

for(my $i=1;$i<=$#arr;$i++){

for(my $j=0;$j<=$#list;$j++){
chomp($list[$j]);

if($arr[$i] =~ $list[$j]){

push(@found, $arr[$i]);

splice(@list,$j,1); # removing the
# element from the list

last;
}
}
}


Binary search:

for(my $i=1;$i<=$#arr;$i++){

push(@found, bsearch($arr[$i],\@list);

}


sub bsearch {
my ($x, $a) = @_; # search for x in array a
my ($l, $u) = (0, @$a - 1); # lower, upper end of search interval
my $i; # index of probe
while ($l <= $u) {
$i = int(($l + $u)/2);
print($i, "\n");
if ($a->[$i] < $x) {
$l = $i+1;
}
elsif ($a->[$i] > $x) {
$u = $i-1;
}
else {
return $i; # found
}
}
return -1; # not found
}

No comments: