genomics: July 2009

Tuesday, July 21, 2009

ncftp and wget

As a bioinformatics researcher, it often becomes imperative to download a large amount of data sets from various servers. The most frequent data download site is NCBI and/or EBI. While most of the raw data can be found at NCBI, EBI hosts something much more curated. One nightmare I often face is, updating interproscan database. The data files are something like 9GB and take a lot of time to download, often exceeding the data limit for our server that downloads it. The most irritating of all is when it times out or says "the message list is too long". The best way to handle this trouble could be by using "wget" or "ncftp".

It is very simple to use and is very user friendly. Wget is very reliable and robust. It is specially designed for the home network if you are working from home using an unreliable network. If a download does not complete due to a network problem, Wget will automatically try to continue the download from where it left off, and repeat this until the whole file has been retrieved. It was one of the first clients to make use of the then-new Range HTTP header to support this feature.

If you are using the http protocol with wget then the format will be something like this:
wget --no-check-certificate https://login:passwd@site-address//path
or
wget ftp://ftp.gnu.org/pub/gnu/wget/wget-latest.tar.gz
Or more command info can be found by doing "man wget".

While wget is really cool, ncftp is another ftp protocol, that is sometimes much better than other existing methods, if you must use a ftp protocol for data download. A typical ncftp command could be:

ncftp -u login -p pass ftp://ftp.hostname.edu

Then use get command to get the files of interest.

Thursday, July 16, 2009

Finding difference between 2 files

In bioinformatics work, one often encounters something like finding difference between 2 sequence files or finding Union and intersection between 2 datasets. This type of operation often needs comparison of one file with the other.

The most common way to do this job is to loop over first file and then loop over second file finding the match, if found terminate the internal loop reset the file pointer and again move on to the second element in the outer loop. This operation need O(M * N) time space. If size of M and N are huge than the amount of time space requirement is humongous. Using a binary search may reduce the time by O(log2 M * N), but the price will be paid again in sorting the files. If the files are non numeric, then the additional burden of converting strings to numbers will add to the price tag. So, recently, I have found using hash tables for finding an element, instead of looping through a second time is a great substitute for binary search. The files I was trying to search against were each having 2 million records. So, in a serial search, the file would have iterated 2 * 2 miilion times, comparing words 4 million times and resetting file 2 million times. While I was running this using the serial version the program ran for 2 days and was still running. The second approach just few minutes to run. Check this out..

Code for serial search:

#!/usr/bin/perl -w

# This script will find the difference between two list files
# Here find the difference between FH1 - FH2

open FH1, $ARGV[0] or die "Can't open file for reading $!\n";
open FH2, $ARGV[1] or die "Can't open file for reading $!\n";
my $flag = 1;

while(){
$flag = 1;
my $toBeSubtracted = $_;
while(){
if($_ eq $toBeSubtracted){
$flag =0;
last;
}
}
if($flag == 1){
print $toBeSubtracted;
}
seek(FH2,0,0);
}

Code using hash table:
#!/usr/bin/perl -w
# Subtract FH - FH1

open FH, $ARGV[0] or die "Can't open the file to be subtracted$!\n";
open FH1, $ARGV[1] or die "Can't open the file to be subtracted$!\n";

my %hash1;
my %hash2;

while(){
my $tmp = $_;
chomp($tmp);

$hash1{$tmp} = 1;

}
close(FH);

while(){
my $tmp = $_;
chomp($tmp);
$tmp =~ s/\t.*//g;

$hash2{$tmp} = 1;

}
close(FH1);

foreach my $key(keys %hash1){

if(exists($hash2{$key})){
#do nothing
}

else {
print "$key\n";

}

}

Cool unix tips

I work with large scale genome sequences, so I often have to handle huge files. The files many times have some unwanted characters and formats that I try to correct using vim substitute command. Although very powerful, substitution operation on huge files using vim is very time consuming!!

A very powerful option to vim substitution is sed(stream editor) utility. This utility is often under utilized because of lack of proper documentation. Some of very powerful file editing can be done using sed.

Simple Substitution:

sed 's/\/usr\/local\/bin/\/common\/bin/' new # will replace /usr/local/bin by /common/bin

Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:

sed 's_/usr/local/bin_/common/bin_' new

Some people use colons:

sed 's:/usr/local/bin:/common/bin:' new

Others use the "|" character.

sed 's|/usr/local/bin|/common/bin|' new

Replacing a pattern with something else:

In a file if the content is something like: '123 abc' and you want to make it 123 123 abc, then do the following:
sed 's/[0-9]*/& &/' < old > new # Here '&' serves as the pattern

Meaning of \1 and \2 till \9

These simply mean the pattern number. Altogether sed can remember upto 9 patterns.
Suppose say in a file you would like to replace "1234 abcd" by "abcd 1234" then do the following;

sed 's/$[0-9]*$ $[a-z]*$/\2 \1/'. Here notice the space between the two patterns.

Substituting only in some lime lines then use the following:
sed '101,532 s/A/a/' or sed '101,$ s/A/a/'

Deleting lines beginning with # is easy:

sed '/^#/ d'

This one will remove comments and blank lines:

sed -e 's/#.*//' -e '/^$/ d'

This one will remove all tabs and blanks before end of the line:

sed -e 's/#.*//' -e 's/[ ^I]*$//' -e '/^$/ d'

Before I sign off how to tell if your machine is 64 bit linux or 32 bit?

uname -m
i386 or i686 then it is 32 bit
x86_64 is 64bit

Getting cpu info:

more /proc/cpuinfo

Some more Generic unix shell scripts:

Total number of files in your working area: ls / -R | wc -l &
To see a process you are really interested in seeing: ps ax | grep httpd
To display a tree of processes: pstree
calculator program in unix: bc ( It will wait for your inputs)

Cool Small Applications:
Chessboard:
for (( i = 1; i <= 9; i++ )) ### Outer for loop ### do for (( j = 1 ; j <= 9; j++ )) ### Inner for loop ### do tot=`expr $i + $j` tmp=`expr $tot % 2` if [ $tmp -eq 0 ]; then echo -e -n "\033[47m " else echo -e -n "\033[40m " fi done echo -e -n "\033[40m" #### set back background colour to black echo "" #### print the new line ### done Use Cut and paste command

I find these 2 commands kind of blessing. cut will cut the columns of a flat file for you and paste will merge them together. There is one more command join that joins 2 files if they have a common field...

These utilities are extremely powerful if you are using them in merging 2 or more data files having background corrected/normalized data or something like that. Perl scripts are expansive in terms of file handling.

Using awk:

Did we know that the present day text editor perl burrows most of its syntax from unix shell scripting, particularly from awk.

awk is a powerful pattern matcher having most of its syntax similar to perl or should I say perl has similar syntax as awk.

awk '
BEGIN { actions }
/pattern/ { actions }
/pattern/ { actions }
END { actions }
' files

in the quoted area you can put anything like a condition such as

awk ‘{ if($5 !~ /[0-9]/ || NF != 7) {print “Line Number ” NR “ has error: ” $0}}’ Input file

This command will check if the input file has column 5 non-numeric value and if the input file has 7 fields or not. In case it is not, this is going to print the file having error. In case of no error it just prints nothing. NR is a special character meaning line number. NF stands for number of fields. By default awk takes [tab] as field separator. But in case you have something else as a separator you can begin awk with:

awk '
BEGIN { FS =":"; }
/pattern/ { actions }
/pattern/ { actions }
END { actions }
' files

For example if you are scanning a file for a particular pattern and you want to print that line having the pattern

awk ' BEGIN{FS = ":"; }
/pattern/ {i++; print "Line Number: " NR ":" $0}
END {printf("Number of lines having pattern %d\n",i)}
'

This command will print

Line Number : 1 : suydusydu 88 sllsilsdi
Line Number :5 : suyfdusydu 88 sllsilsdi
Line Number :6 : suycdusydu 88 sllsilsdi
Line Number :7 : suyvdusydu 88 sllsilsdi
Line Number :8 : suygdusydu 88 sllsilsdi
Line Number :9 : suydugsydu 88 sllsilsdi
Line Number :10 : suydusggydu 88 sllsilsdi
Line Number :11 : suydusygdu 88 sllsilsdi
Line Number :12 : suydusgydu 88 sllsilsdi
Line Number :13 : suydusffydu uiuiu sllsilsdi
Line Number :14 : suydusssydu 88 sllsilsdi
Line Number :15 : suydussydu 88 sllsilsdi
Line Number :16 : suydussssydu 88 sllsilsdi
Line Number :17 : suydusssssydu 88 sllsilsdi
Line Number :18 : suydusyssadu 88 sllsilsdi
Line Number :19 : suydusyqqdu 88 sllsilsdi
Line Number :20 : suydusydffu 88 sllsilsdi
Number of lines having pattern 17

Wednesday, July 01, 2009

DNA Methylation AIDS HIV Latency

Current drug therapies inhibit replication of the human immunodeficiency virus (HIV). In patients undergoing these therapies, the amount of HIV is reduced to an undetectable level and HIV-related disease subsides. However, stopping antiviral drug therapy results in the quick return of HIV and of disease. One reason for this is latently infected cells, in which virus replication is temporarily halted. When drug therapy is stopped, virus from these latently infected cells can resume infection and spread to other cells in the patient, resulting in the return of disease

One of the world's most elusive viruses is an expert at maintaining a low profile, laying dormant in CD4+ cells even during highly active anti-retroviral treatment (HAART). A team of American and Swedish researchers found that the virus might be using DNA methylation as a cloak.

Hypermethylated CpG islands flanking the HIV transcription site attract methyl-CpG binding domain protein 2 (MBD2) -- an endogenous host protein – which in turn recruits histone deacetylaces and other enzymes to shut down transcription.

Using 5-aza-deoxycytidine (Aza-CdR) to strip the DNA methylation at these island allowed researchers to reverse the transcriptional block and reactivate HIV right out of hiding, indicating that Aza-CdR might be a great complement to other antiviral therapies. So there’s hope for flushing out the reservoir, clearing patients of HIV-1, and letting them live a drug free life. Wouldn’t Nancy Reagan be proud?

See all the HAARTening details at PloS Pathogens June 2009.