Thursday, July 16, 2009

Finding difference between 2 files

In bioinformatics work, one often encounters something like finding difference between 2 sequence files or finding Union and intersection between 2 datasets. This type of operation often needs comparison of one file with the other.

The most common way to do this job is to loop over first file and then loop over second file finding the match, if found terminate the internal loop reset the file pointer and again move on to the second element in the outer loop. This operation need O(M * N) time space. If size of M and N are huge than the amount of time space requirement is humongous. Using a binary search may reduce the time by O(log2 M * N), but the price will be paid again in sorting the files. If the files are non numeric, then the additional burden of converting strings to numbers will add to the price tag. So, recently, I have found using hash tables for finding an element, instead of looping through a second time is a great substitute for binary search. The files I was trying to search against were each having 2 million records. So, in a serial search, the file would have iterated 2 * 2 miilion times, comparing words 4 million times and resetting file 2 million times. While I was running this using the serial version the program ran for 2 days and was still running. The second approach just few minutes to run. Check this out..

Code for serial search:

#!/usr/bin/perl -w

# This script will find the difference between two list files
# Here find the difference between FH1 - FH2

open FH1, $ARGV[0] or die "Can't open file for reading $!\n";
open FH2, $ARGV[1] or die "Can't open file for reading $!\n";
my $flag = 1;

while(){
$flag = 1;
my $toBeSubtracted = $_;
while(){
if($_ eq $toBeSubtracted){
$flag =0;
last;
}
}
if($flag == 1){
print $toBeSubtracted;
}
seek(FH2,0,0);
}


Code using hash table:
#!/usr/bin/perl -w
# Subtract FH - FH1

open FH, $ARGV[0] or die "Can't open the file to be subtracted$!\n";
open FH1, $ARGV[1] or die "Can't open the file to be subtracted$!\n";


my %hash1;
my %hash2;

while(){
my $tmp = $_;
chomp($tmp);

$hash1{$tmp} = 1;

}
close(FH);

while(){
my $tmp = $_;
chomp($tmp);
$tmp =~ s/\t.*//g;

$hash2{$tmp} = 1;

}
close(FH1);


foreach my $key(keys %hash1){

if(exists($hash2{$key})){
#do nothing
}

else {
print "$key\n";

}

}

3 comments:

DNA said...

Good to see your activity here.
Just to add to your script repo.,
here is a scripts which calculates diff, intersection and diff b/n files. This is not mine, but using it for some time..

----------

#! /bin/sh
# Union, intersection or minus of two sets, where each set is a file
# and an element of the set is a line in the file. WH 1994 Dec, 2001 Jan



NAME=`basename $0`

if [ $# != 3 ] ; then echo >&2 Need three argments ; usage ; fi
FILE1="$1"
FILE2="$3"
if [ ! -r "$FILE1" ] ; then
echo >&2 Error: File "$FILE1" does not exist ; usage
fi
if [ ! -r "$FILE2" ] ; then
echo >&2 Error: File "$FILE2" does not exist ; usage
fi
case "$2" in
u|i|-) OP="$2" ;;
*) echo >&2 Error: Bad option ; usage ;;
esac

UNION=$NAME.u.$$
DIFF=$NAME.d.$$

trap '/bin/rm "$UNION" "$DIFF" 2>/dev/null' 0
trap '/bin/rm "$UNION" "$DIFF" 2>/dev/null; exit 1' 1 2 15

## generate $FILE1 UNION $FILE2
## uniq: remove duplicate = UNION
cat "$FILE1" "$FILE2" | sort | uniq > "$UNION"

if [ "$OP" = u ] ; then cat "$UNION" ; exit 0; fi

## generate $FILE1 diff $FILE2 as $FILE2 symdiff ( $FILE1 union $FILE2 )
## uniq -u: nonrepeated lines = symdiff
cat "$FILE2" "$UNION" | sort | uniq -u > "$DIFF"

if [ "$OP" = - ] ; then cat "$DIFF" ; exit 0; fi

## generate intersection: $FILE1 symdiff ( $FILE1 diff $FILE2 )
cat "$FILE1" "$DIFF" | sort | uniq -u

Sucheta Tripathy PI @ Computational Genomics Group at IICB, Kolkata said...

Thanks for your comments!! I am wondering about its performance. How long does it take to execute for large files.

Sucheta Tripathy PI @ Computational Genomics Group at IICB, Kolkata said...

Hi Wilma,

I am really impressed with VADLO search engine. This is completely new information for me.. Please keep posting new information.

Best

Sucheta