genomics: January 2009

Thursday, January 29, 2009

Editing publication quality images with GIMP

GIMP stand for GNU Image Manipulating Program. I have this software in my computer on and off for the last few years. I was adamant to not budge for paid softwares like Adobe photo shop and like. Gimp used to sit there in my computer and annoy me with its small icon, so I used to uninstall it often. When I needed to manipulate some figures for publication, I would install it back on. Not until very recently I had no clue how powerful and useful gimp can be. It can even beat the best image editing softwares in market..

Challenges:

Recently, I faced a challenge merging 3 pathway maps. They can be as complicated as a circuit diagram!! I had to basically merge Image 1 and Image 2 and make a composite image12. Then I had to merge this with image 3 in such a way that, imae3 - image12 will be in a different color and image12 - image3 will be in a different color.

Open image 1 as layer(file --> open as layer), then open image 2 as layer as well. Then go to layer menu --> Merge down --> save image as image12. In image12, you will see the extra portions of image 1 as little lighter objects, but image 2 appears as it is. Now open image12 as layer then open image 3 as layer. Then go to layer menu --> merge down --> save as image123.

Now in image123 you see the extra portions in 12 that is not present in 3 as lighter objects. Well enough..

Now do the reverse. Open image 3 as layer first followed by image 12 as layer. Then merge as before and name it as image 312.

What you will see is in image 312 --> All that is present extra in 3 are not present in image 12 will appear as solid bordered transparent object. Distinct from the upper layer. In Image 123 you will see just the reverse. The ones that are present extra in image 12 and are not present in image 3. Now after all these, you can happily open image 312 and start playing around with colors. Open image 312 and as shown in figure select the shape from the toolbox menu. Select the area neatly that need to be colored differently. Then choose color. Then select color menu and just click on the selected area in the image. Viola.. that changes the color so well....

Friday, January 16, 2009

The power of vim

vim is a very powerful text editor having many great features that can prove to be a time saver if you are editing a large file.

I use the substitution operation in vim more often than not. One can a do a number of things with just very little practice.

1. Make a newline:

%s/$/\r/g # will make the end of the line into a new line character.

2. Replace annoying ^M characters that often happens from a DOS to unix transition

%s/ctrlVM/BY anything you like/ # Remember here ctrl-V makes a ^ and M makes an M

3. If you want to substitute something from line 20,30 here is how you can do

:20,30s/old/new/ -> remember here no '%' before 's'

4. If you want to change the order of two words from Hypothetical protein -> protein Hypothetical

:%s/$Hypothetical$ $protein$/\2 \1/ -> This will change all occurrences of Hypothetical protein to protein Hypothetical.

Some More Tricks in vim

Delete lines that contain pattern:
:g/pattern/d

Delete all empty lines:
:g/^$/d
Delete lines in range that contain pattern:
:20,30/pattern/d
or with marks a and b:
:'a,'b/pattern/d

Substitute all lines for first occurance of pattern:
:%s/pattern/new/
:1,$s/pattern/new/

Substitute all lines for pattern globally (more than once on the line):
:%s/pattern/new/g
:1,$s/pattern/new/g

Find all lines containing pattern and then append -new to the end of each line:
:%s/$.*pattern.*$/\1-new/g

Substitute range:
:20,30s/pattern/new/g
with marks a and b:
:'a,'bs/pattern/new/g

Swap two patterns on a line:
:s/$pattern1$$pattern2$/\2\1/

Capitalize the first lowercase character on a line:
:%s/$[a-z]$/\u\1/
more concisely:
:%s/[a-z]/\u&/

Capitalize all lowercase characters on a line:
:%s/$[a-z]$/\u\1/g
more concisely:
:s/[a-z]/\u&/g

Capitalize all characters on a line:
:s/$.*$/\U\1\E/

Example: In a sequence file, if you have multiple headers and you are particularly
interested in substituting line Hp_ENSC_10a23 -> Hp_ENSC_10A23 without touching any
other line this is for you..

:%s/ENSC_$.*$$[a-z]$/ENSC_\1\u\2/g

Regular Expressions:

Regular expressions in unix is more or less similar like perl - ya perl borrowed regexp from Unix..

But using substitution, one need to take care of adding '\' (esc) character in order to match a value.

For example if a line with super_2[TAB]5997239[TAB]5997668 needs to change to super_2:5997239-5997668, then instead of using '\S+' as the first non space patter use $\S\+$. So, in other words the syntax will be

:%s/^$\S\+$\t$\S\+$\t/\1:\2-/g

Capitalize the first character of all words on a line:
:s/\<[a-z]/\u&/g

Uncapitalize the first character of all words on a line:
:s/\<[A-Z]/\l&/g

Change case of character under cursor:
~

Change case of all characters on line:
g~~

Change case of remaining word from cursor:
g~w

Increment the number under the cursor:

Decrement the number under the cursor:

redraw:

Turn on line numbering:
:set nu
Turn it off:
:set nonu

Number lines (filter the file through a unix command and replace with output):
:%!cat -n

Sort lines:
:%!sort

Sort and uniq:
:%!sort -u

Read output of command into buffer:
:r !ls -l

Refresh file from version on disk:
:e!

Open a new window:

n

Open a new window with the same file (split):

s

Split window vertically:

v

Close current window:

c
:q

Make current window the only window:

o

Cycle to next window:

w

Move to window below current window:

j

Move to window above current window:

k

Move to window left of current window:

h

Move to window right of current window:

l

Set textwidth for automatic line-wrapping as you type:
:set textwidth=80

Turn on syntax highlighting
:syn on
Turn it off:
:syn off

Force the filetype for syntax highlighting:
:set filetype=python
:set filetype=c
:set filetype=php

Use lighter coloring scheme for a dark background:
:set background=dark

Htmlize a file using the current syntax highlighting:
:so $VIMRUNTIME/syntax/2html.vim

Or, htmlize from a command prompt:
in 2html.sh put:

#!/bin/sh
vim -n -c ':so $VIMRUNTIME/syntax/2html.vim' -c ':wqa' $1 > /dev/null 2> /dev/null

Now just run: shell> 2html.sh foo.py

Getting rid of tab while copy pasting from a vim
edited file:

By default when you try to copy paste a block from a file edited with vim, it automatically appends character in the beginning of each line. In order to get rid of that in your .vimrc file put a line saying "set paste". This will do the job.

File Recovery:

Many times while editing a file the connection to the server may be lost. This is what I face all the time. In that case you can recover all the changes you made earlier by saying vim -r .file_name.swp

Executing Shell Commands

You can execute a command with the shell, for example, you can compile you C code when editing it, :!gcc kernel.c, or :!gcc % to compile the current editing file. Executing other commands is the same.

Keyword Matching

If you are lazy as me, you will find this function wonderful :) In insert mode, you can type a few characters of a word, e.g., if there is a variable called phosphatidylinositolphosphate, you may just type pho then C-p. Vim will find out the last word in the file started with characters pho, if it is not you want, you can re-type C-p to match the further previous ones. Similarly, C-n can do that for finding the next matches. Therefore you do not worry about the mistyping of the long variables, or uncommon-used words.

Insert/Delete an indent

C-T/C-D insert/delete an indent in the current line, no matter what column the cursor is in. There are handy hot keys when you are editing language with ambiguous block boundary such as Python, which you have to delete the indent yourself when necessary.

Settings for .vimrc

Some useful setting and mapping for .vimrc are listed as follows:

set nocp " nocompatible
set sw=4 " shiftwidth
set et " expandtab
set wm=8 " wrapmargin
set bs=2 " backspace
set ru " ruler
set ic " ignorecase
set is " incsearch
set scs " smartcase: override the 'ic' when searching
" if search pattern contains uppercase char
set vb t_vb= " set visual bell and disable screen flash
set backup

NOTE

Sometimes find replace and delete lines with a particular pattern can very quickly done using sed. Yes, vim takes longer time to complete a request if the file is large. So, instead try sed.

sed '/^--/d' lane7.1 > tmp # Will delete all lines with pattern --
# from file lane7.1 and will print in tmp
# -- comes as a pattern delimiter from 'grep'

sed 's/^@/>/g' lane7.1 > tmp # Will replace all lines beginning with @ to >
# This is useful in converting fastq files to fasta

Wednesday, January 14, 2009

Plotting Histograms and boxplot with R

Plotting huge files in R can be a cake walk. Here is how you can proceed:

Suppose say you have a huge file with 2 million entries in [TAB] format.

Then go ahead and read the file in the following format:

>data <- read.table(file="fileName",head=TRUE,sep=",")

# read.table is a file reading function
# Note here, if you have no header then say head=FALSE and if you data is separated by
# \t then say sep="\t"

#You may like to see summary of your data by saying

> summary(data)
V1 V2
HPAA-aaa57a02.b1: 1 Min. :0.000000
HPAA-aaa57a02.g1: 1 1st Qu.:0.009985
HPAA-aaa57a03.b1: 1 Median :0.020565
HPAA-aaa57a03.g1: 1 Mean :0.021674
HPAA-aaa57a04.b1: 1 3rd Qu.:0.033898
HPAA-aaa57a04.g1: 1 Max. :0.065789
(Other) :1140845

# Pretty Cool..

If you have not written down the column names in the data file, don't worry R provides one for you.

>names(data)
>[1] "V1" "V2"

# So, V1 and V2 are your column names.
# To Plot a Histogram simply say
>hist(data$V2)

# If you did not select any output device by default it will plot as a Rplots.ps file
# on your working directory

# You can however copy this to a file of your interest

>dev.copy(postscript,"myfile.ps",height=7,width=7)

# Here Height and width are in inches
# Sometimes it does not allow you to do a dev.copy when your dev.list() is NULL.
# You can populate it by first plotting something or by doing a hist(data$V2)
# or just a plot(data$V2)
# Then do
>hist(data$V2)

# You can check how many devices are attached to your screen by doing a

> dev.list()
postscript postscript postscript
2 3 4
# You can switch off your current device by doing a
>dev.off()
postscript
2
> dev.list()
postscript postscript
2 3

# I Plotted my histogram using the following command:

5
> hist(data$V2,breaks=100,main="Trace File Blast hit Frequency plot, cutoff=30"
,xlab="Y=M/(R+L - 2O)",col="red",las=1,freq=TRUE)
> lines(density(data$V2))

This will save the file in the current file "myfile.ps". Every subsequent call to a plot function will keep plotting in different pages of the same file.

Alternatively a truehist() can be called to plot histograms. But before that do a
>require(MASS)
> truehist(data$V2,nbins=100,main="Trace File Blast hit Frequency plot, cutoff=
30",xlab="Y=M/(R+L - 2O)",col="blue",las=1,prob=TRUE)

Making a boxplot:

Plotting a boxplot is equally easier using R. Follow the following commands:

data <- read.table(file="bgcorrect.out",head=FALSE,sep="\t")
summary(data)

#boxplot(data$vals,main='normalized data after synthetic normalization', ylab='G
ene Expression values');
boxplot(log2(data))
help(boxplot)

SAVING OUTPUT AS TEXT FILE

#In case you are just interested to save the histogram output file into a text file

> h <- hist(data$V2,breaks=100,plot=FALSE) # Plot= FALSE will not plot, default true
> save(h,ascii=TRUE,file="data.txt") # Will save data in ascii format, default=FALSE

# If you want to read the content of h
>h
# You will get the following
# $breaks -> break points
# $count -> frequency
# $density -> plot density
# $intensities -> plot intensity
# $mids -> mid points of data

So, just printing a single variable will be
>h$mids # Will print the mid point..

Friday, January 09, 2009

RepeatModeler Woes...

Continued from my last post..

Folks, if you are tempted to run a de novo repeat finder such as RepeatModeler, then this is for you.

I happened to get copies of all the dependencies along with RepeatModeler and got them installed in my linux machine. Here is what is not documented in any installation manual.

Installation Caveats:

1. If you installed RepeatMasker without getting a copy of RepBase, it will not complain and installation will be fine. But when you complete installation of RepeatModeler(If you could), then it will complain big time - "you did not install RepBase on repeatMasker", so go back and re-configure RepeatMasker all over again.

SO, LESSON NO 1 ALWAYS GET REPBASE BEFORE INSTALLING REPEATMASKER

2. TRFinder is named as trfLinux.. when you obtain it from the download site. The installation program looks for just "trf" program in the folder. So, RENAME the executable to trf before installation.

3. The biggest trouble comes with installation of RepeatModeler. It will run smooth till it asks you the path for wublast installation. WUBLAST is currently named as ABBLast and is no longer available. You only get the 1998 version from the web site. If you are lucky and you have a copy of wublast, then the installation complains about xdformat.
ERROR:
/home/sutripa/wublast/xdformat is too old. Must have one dated 3/16/2005 or new
er. Install a newer version of wublast and re-run configure.

In order to fix this, you can do little bit of hacking:
Go to the "configure" file in repeatmodeler package and comment the block
#unless ( $result =~ /option requires an argument -- m/ ) {
# die "$wuLocation/xdformat is too old. Must have one dated 3/16
/2005 "
# . "or newer. Install a newer version of wublast and re-run
"
# . "configure.\n";
#}

(This is located between line number 247 - 255)

4. While giving the path for RECON, say $RECON_PATH/bin. Otherwise configuration script will not find the executables.

Once all these are taken care of installation of repeatmodeler can be done.

RUNNING REPEATMODELER:

Again this is not any lesser challenge than installation.

Goto RepeatModeler file and on line 234 do the following:
Replace line
#my $dbIndex = `$RepModelConfig::XDFORMAT_PRGM -m2 -n -r $genomeDB`;
with
my $dbIndex = `$RepModelConfig::XDFORMAT_PRGM -n -r $genomeDB`;

AND ALSO Line number 500
Replace line
#my $sampleIndex = `$RepModelConfig::XDFORMAT_PRGM -m2 -n -r $sampleFa
staFile`;

with
my $sampleIndex = `$RepModelConfig::XDFORMAT_PRGM -n -r $sampleFastaF
ile`;

Use the README file for creating a database using the xdformat program from wublast.

/BuildXDFDatabase -name elephant elephant.fa

IMPORTANT:
[Before Running this, check your fasta file. Make the header of fasta as >string[SPACE]number format, where the number could ideally be the length of the sequence. Leave nothing else trailing on the header.]

If you fail to do this, while running wublast the program is going to complain
either of these Two. Either sequence not found OR Database is empty.

After doing all this, you may like to run the program as

RepeatModeler -database Ha.test >& Ha.out

Alas!!You thought the problem is finally solved and go check the run file. It still has error like:

FastaDB:_cleanIndexAndCompact - ERROR: Multiple fasta seqs appear on one line (>
seq-3_1-40000) and possibly more! Ignoring all entries on this line.
FastaDB:_cleanIndexAndCompact - ERROR: Multiple fasta seqs appear on one line (>
seq-1_1280001-1320000) and possibly more! Ignoring all entries on this line.
FastaDB:_cleanIndexAndCompact - ERROR: Multiple fasta seqs appear on one line (>
seq-1_520001-560000) and possibly more! Ignoring all entries on this line.

This is beacuse there is a silly bug in line 1590.
Go check one of the directories RM_12849.TueJan131331092009/round-1

you will see the header is printed like this: >>seq-3_1-40000

Now go to line 1590 in RepeatModeler and Replace

#print FASTA ">$seqID" . "_$start-$end\n";

With

print FASTA "$seqID" . "_$start-$end\n";

Now you can run the program.
Even after doing all these, I still have no luck in running this script. The run file ended up giving me a "unix broken pipe" error which is way beyond my control to fix.
Argh...

RepeatMasker blues..

I read rave reviews on RepeatMasker for repeat finding, so decided to get hold of a copy and get it used. I used this software before and did not like it much, but after getting some leading genome research institutes using it for their newly sequenced genomes, I got tempted. I used to use TRFinder from Boston university for finding direct repeats earlier.

I went to the web page of RepeatMasker that directs me to an array of other dependencies to be installed first. RepeatMasker depends on a repeat library RepBase which can be obtained from http://www.girinst.org. You go there and have to get yourself registered. Your registration need to be accepted by the group and then they send you an email carrying your login id and password. This may take upto 2 working days. Only after this you can download the RepBase database.

However, the trouble with RepBase is, it is a predefined library and will not detect any new repeat that may be there in your genome. So, the best option is to run a "De Novo" repeat finder. From the same web site you can get a link to RepeatModeler that acts as a wrapper around some de novo repeat finders such as RepeatScout and RECON.

Fair enough, then you go to install RepeatModeler...

This web page indicates you have to have a number of dependencies:

-- Of course you need to have perl 5.8 or above installed first.
-- you will need the Trfinder that I was mentioning earlier.
-- then you will need to download and install repeatmasker. This will not install unless you have RepBase in place.
--For RepeatMasker, you will also need phred/phrap/cross_match package. This can be done by writing to the authors and getting a copy by email. This may take a few business days.
--After you are done with repeatmasker you will need RECON to be installed
--Then RepeatScout need to be downloaded and installed.
--You will need WUBLAST also to be installed. The recommended site from the repeatmodeler page points to WUBLAST site at http://blast.wustl.edu/. When you go there you get redirected to http://www.advbiocomp.com . In advbiocomp site, they say that wublast is no longer supported and you can get an older version of WUblast for free. Then you get the binaries, inflate and keep them in place.

Now the final part: Installation of RepeatModeler. You do a ./configure. It asks you perl path, repeatMasker path etc. When it comes to WuBlast path, it exits saying that you xdformat program is older than 2005. What the hell, you back and check in wublast site, that says the program is as old as 1998. There are no newer version. There you get stuck.

Out of frustartion, I went and looked at the configure file in RepeatModeler. In line 237 there is a statement:

else {
my $result = `$wuLocation/xdformat -m 2>&1`;

# Two responses are expected:
# Good Version:
# /blah/wublast/xdformat: option requires an argument -- m
# Old Version:
# /blah/wublast/xdformat: invalid option -- m
unless ( $result =~ /option requires an argument -- m/ ) {
die "$wuLocation/xdformat is too old. Must have one dated 3/16/
5 "
. "or newer. Install a newer version of wublast and re-run

. "configure.\n";
}

I went back and asked the authors of xdformat, what this -m option means, they said to ignore it. Now I can change this, if only I find what is the substitute of this. Because my search in repeatModeler directory says me -m2 option is used:

gus@tyler-lab:~/RepeatModeler$ grep --regex=-m -r .
./configure: my $result = `$wuLocation/xdformat -m 2>&1`;
./Refiner: $malign->serializeOUT( "$wrkDir/$consName-malign-$round.ser"
)
./RepeatClassifier:# wublastx the simple-masked consensi vs the transposable ele
ment
./RepeatModeler:my $dbIndex = `$RepModelConfig::XDFORMAT_PRGM -m2 -n -r $genomeD
B`;
./RepeatModeler: my $sampleIndex = `$RepModelConfig::XDFORMAT_PRGM -m2 -n -r $s
ampleFastaFile`;
./RepeatModeler: # . " -mincount 3 -tandemdist 500 -output $fastaFile.lfreq
"
./RepeatModeler:"$workDir/sampleDB-$round.fa $workDir/consensi.fa -warnings -kap
-wordmask=dust wordmask=seg maskextra=10 -hspmax=20 V=0 B=250 Q=25 R=5 W=7 S=25
0 gapS2=250 S2=125 X=250 gapX=500 -matrix=comparison.matrix"

I am still waiting for a response...

Sunday, January 04, 2009

Author List

Recently there was a post by physioprof on the co-first author that appears second in the list. While I don't agree to with views that second listed first author to be completely ignored, but it does raise some question about authorship in papers.

While the younger lot generally go for quantity over quality and try to get themselves in any paper they find getting out of the group, many older un-productive ones that have been lying dormant for a long time get into papers through their experience. In many instances there are at least 30% of the people in a multi author paper get a free ride. There are also few others who generally get a ride through their influence or by having a good relationship with the PI. I know atleast a bunch of people who see their names in the paper by periodically checking their previous PI's website. They have no idea how and when the paper get published. Now this raises a serious question on genuine authorships in papers. In many journals there is a section to mention which author did what to the paper and I think that should be the norm

Friday, January 02, 2009

Author List and Order

Recently there was a post by physioprof on the co-first author that appears second in the list. While I don't agree to his views that second listed first author to be completely ignored, but it does raise some question about authorship in papers.

While the younger lot generally go for quantity over quality and try to get themselves in any paper they find getting out of the group, many older un-productive ones that have been lying dormant for a long time