genomics: Cool unix tips

I work with large scale genome sequences, so I often have to handle huge files. The files many times have some unwanted characters and formats that I try to correct using vim substitute command. Although very powerful, substitution operation on huge files using vim is very time consuming!!

A very powerful option to vim substitution is sed(stream editor) utility. This utility is often under utilized because of lack of proper documentation. Some of very powerful file editing can be done using sed.

Simple Substitution:

sed 's/\/usr\/local\/bin/\/common\/bin/' new # will replace /usr/local/bin by /common/bin

Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:

sed 's_/usr/local/bin_/common/bin_' new

Some people use colons:

sed 's:/usr/local/bin:/common/bin:' new

Others use the "|" character.

sed 's|/usr/local/bin|/common/bin|' new

Replacing a pattern with something else:

In a file if the content is something like: '123 abc' and you want to make it 123 123 abc, then do the following:
sed 's/[0-9]*/& &/' < old > new # Here '&' serves as the pattern

Meaning of \1 and \2 till \9

These simply mean the pattern number. Altogether sed can remember upto 9 patterns.
Suppose say in a file you would like to replace "1234 abcd" by "abcd 1234" then do the following;

sed 's/$[0-9]*$ $[a-z]*$/\2 \1/'. Here notice the space between the two patterns.

Substituting only in some lime lines then use the following:
sed '101,532 s/A/a/' or sed '101,$ s/A/a/'

Deleting lines beginning with # is easy:

sed '/^#/ d'

This one will remove comments and blank lines:

sed -e 's/#.*//' -e '/^$/ d'

This one will remove all tabs and blanks before end of the line:

sed -e 's/#.*//' -e 's/[ ^I]*$//' -e '/^$/ d'

Before I sign off how to tell if your machine is 64 bit linux or 32 bit?

uname -m
i386 or i686 then it is 32 bit
x86_64 is 64bit

Getting cpu info:

more /proc/cpuinfo

Some more Generic unix shell scripts:

Total number of files in your working area: ls / -R | wc -l &
To see a process you are really interested in seeing: ps ax | grep httpd
To display a tree of processes: pstree
calculator program in unix: bc ( It will wait for your inputs)

Cool Small Applications:
Chessboard:
for (( i = 1; i <= 9; i++ )) ### Outer for loop ### do for (( j = 1 ; j <= 9; j++ )) ### Inner for loop ### do tot=`expr $i + $j` tmp=`expr $tot % 2` if [ $tmp -eq 0 ]; then echo -e -n "\033[47m " else echo -e -n "\033[40m " fi done echo -e -n "\033[40m" #### set back background colour to black echo "" #### print the new line ### done Use Cut and paste command

I find these 2 commands kind of blessing. cut will cut the columns of a flat file for you and paste will merge them together. There is one more command join that joins 2 files if they have a common field...

These utilities are extremely powerful if you are using them in merging 2 or more data files having background corrected/normalized data or something like that. Perl scripts are expansive in terms of file handling.

Using awk:

Did we know that the present day text editor perl burrows most of its syntax from unix shell scripting, particularly from awk.

awk is a powerful pattern matcher having most of its syntax similar to perl or should I say perl has similar syntax as awk.

awk '
BEGIN { actions }
/pattern/ { actions }
/pattern/ { actions }
END { actions }
' files

in the quoted area you can put anything like a condition such as

awk ‘{ if($5 !~ /[0-9]/ || NF != 7) {print “Line Number ” NR “ has error: ” $0}}’ Input file

This command will check if the input file has column 5 non-numeric value and if the input file has 7 fields or not. In case it is not, this is going to print the file having error. In case of no error it just prints nothing. NR is a special character meaning line number. NF stands for number of fields. By default awk takes [tab] as field separator. But in case you have something else as a separator you can begin awk with:

awk '
BEGIN { FS =":"; }
/pattern/ { actions }
/pattern/ { actions }
END { actions }
' files

For example if you are scanning a file for a particular pattern and you want to print that line having the pattern

awk ' BEGIN{FS = ":"; }
/pattern/ {i++; print "Line Number: " NR ":" $0}
END {printf("Number of lines having pattern %d\n",i)}
'

This command will print

Line Number : 1 : suydusydu 88 sllsilsdi
Line Number :5 : suyfdusydu 88 sllsilsdi
Line Number :6 : suycdusydu 88 sllsilsdi
Line Number :7 : suyvdusydu 88 sllsilsdi
Line Number :8 : suygdusydu 88 sllsilsdi
Line Number :9 : suydugsydu 88 sllsilsdi
Line Number :10 : suydusggydu 88 sllsilsdi
Line Number :11 : suydusygdu 88 sllsilsdi
Line Number :12 : suydusgydu 88 sllsilsdi
Line Number :13 : suydusffydu uiuiu sllsilsdi
Line Number :14 : suydusssydu 88 sllsilsdi
Line Number :15 : suydussydu 88 sllsilsdi
Line Number :16 : suydussssydu 88 sllsilsdi
Line Number :17 : suydusssssydu 88 sllsilsdi
Line Number :18 : suydusyssadu 88 sllsilsdi
Line Number :19 : suydusyqqdu 88 sllsilsdi
Line Number :20 : suydusydffu 88 sllsilsdi
Number of lines having pattern 17

genomics

Thursday, July 16, 2009

Cool unix tips

No comments: