Wednesday, December 28, 2011

Calling a perl subroutine from PHP scripts

I spent a good part of my Christmas vacation in figuring out how to call a perl subroutine from a PHP script. There are several reasons why you would like to do that. The first and foremost may be because you don't want to replicate all your perl subroutines to PHP in order to use it. The other issues may be incompatibilities. The one I face is on incompatibility of my PHP version to run oracle queries which can only be solved at the sys admin level.On the other hand the perl/CGI interface for oracle just works fine.

There are 3 levels this task can be achieved:
1. We will see how to pass absolute values to perl subroutines.
2. Pass variables to perl subroutines and
3. Collect return values from the perl subroutine

Following are some points to be remembered:

1. If your perl subroutines are packed into perl packages then they are good to go (e.g; The file should begin with a package "name"; header and the end of the file should have a 1; )
2. Do not use <include "package name"> inside the PHP script.
3. Initialize a string with perl commands e.g; $command='perl -MpackageName -e "packageName::subroutine(arg,arg,arg)"'
4. Call system($command); from the php script. Do not use backticks (`)

Here is an example of passing absolute value to perl subroutine:

##Package Test ###
#!usr/bin/perl -w
package Test;

sub printNames
my $name1 = shift;
my $name2 = shift;

print "The names are $name1 and $name2\n";



## save it as

Level1: Passing an absolute value into the perl subroutine:

# test.php
$command='perl -MTest -e "Test::printNames(Guest1,Guest2)"';
#Open browser and run test.php :

The names are Guest1 and Guest2

Level2: Passing a PHP variable into the perl subroutine:


$arg   ="guest1 and guest2";
$arg1 ="guest3 and guest4";

$command = "perl -MTest -e 'Test::printNames("$arg","$arg1")'";


# Open browser and run the command:

The names are guest1 and guest2 and guest3 and guest4

Level3: Collecting the return values from perl subroutine as PHP array

Instead of running PHP "system" command, run "exec". Print the outputs from inside the perl subroutine, that can be captured by exec. Now the perl subroutine will undergo slight modification:

##Package Test ###
#!usr/bin/perl -w
package Test;

sub calculateVal
my $val1 = shift;
my $val2 = shift;

$val1 *= 20;
$val2 /= 3;

print $val1;
print $val2;





$arg   =10;
$arg1 =300;

$command = "perl -MTest -e 'Test::printVal($arg,$arg1)'";
$out = array();
$tmp = exec($command,$out);

# Output

[0] => 200
[1] => 100

NOTE: Multidimensional perl arrays can also be passed by printing the value from inside the perl subroutine.

Thursday, October 20, 2011

Installing SQL Developer

This is slightly offtopic for this blog, but nevertheless a very important one. I have decided to blog this one becuase this particular problem dragged for atleast few days for me and after surfing countless number of sites and installing various softwares including harmful ones, I learnt the hard way how to tackle it.
If you have a 64 bit windows7  machine and you are trying your luck installing SQL Developer and failing consistently, this post is for you.
SQL developer is available at the oracle site here However, this version does not come with a compatible java version. You may already have java installed in your machine, if not, you may need to follow the instructions from the above mentioned site on how to install java.
After you downloaded and unpacked SQL Developer, when you click on the sqldeveloper.exe file, it may say "permission denied". This may be because the executable file may not have execution permission. You may like to change that with a chmod 755 <filename> . Then click on the executable file sqldeveloper.exe . It may possibly ask you the java path if it does not find one, just provide the path through the browse button option. I was stuck when the program complained about absence of msvcr100.dll and jvm.dll files . I browsed countless number of sites and in that process installed a melaware and ended up cleaning it later. Be aware, don't download .dll files from anywhere other than safe places. I looked for resources that asked me to download .NET and visual C++ that could solve the lingering msvcr100.dll problem but in vein. I have installed uninstalled .NET and visual C++ few times at least to make sure that the softwares are installed correctly, but that did not work. Since it was looking for this file from Java, I checked the java distribution and finally located it under  ProgramsFiles/Java/jdk.7.0_01/bin/ . Copy pasted it under Windows/system32/ . For jvm.dll, many sites including the java site advised that probably the java installation was incorrect. I re-installed java several times after each un-install and it still did not work. Finally I found a safe site at , from where I downloaded jvm.dll. Copy pasted this file under Windows/system32 and it solved the problem!!

Monday, October 17, 2011

RGS14 - The protein that makes us forget

In this months protein spotlight issue, there is a protein, RGS14 highlighted that makes us filter our memory. Do we not need something that will make us remember things rather forget them? Well, too much un-necessary information stored in brain will certainly make it more chaotic. However, silencing RGS14 in mice makes them smarter, I wonder if same can be said about humans.

The full article is available here

Thursday, September 22, 2011

Morph of plant embryo development

Awesome video!

To or Not to with cufflink:

Cufflink is an amazingly easy to install and use software, that lured me into using it. However, it is not without its sets of pitfalls... I am still researching on the illusive nature of the outputs from this software.
this software and the types of outputs it produces using different commands:
Here are few things to keep in mind before trying to run cufflink
1. Cufflink can be run on sorted bam files/ sam files.
2. It can run in multi-threading mode with a -p option and is much faster than single threading mode.
3. The new cuffmerge program first converts your gtf files derived from cuffcompare to sam files, merges them, sorts them, runs cufflink on them with a set of hard coded parameters and then runs cuffcompare on the finally to give you your merged.gtf file.
[The cufflink commandline option: cufflinks -o ./merged_asm/ -F 0.05 -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 16 ./merged_asm/tmp/mergeSam_filewvcXTG.]

Earlier, I used a route as follows:
1. Run cufflink on each of the libraries with a reference gtf and without a reference gtf.
2. Merge the outputs separately using cuffcompare on, with and without reference gtf and merge the output with a script to indicate which of the transcripts are represented by a gene id. But the potential pitfall with this approach is the overlapping and shorter transcripts. This clearly is a stumbling block when you are trying to produce an assembled transcript.

In the recent cufflink versions, there is a accessory program called as cuffmerge, which the manual suggests for merging the individual gtfs. As I have mentioned earlier, this is again a wrapper, that internally calls cufflink and cuffcompare, albeit with several options already pre-set. So, what I did was, merged the bam files generated from different libraries, merged them with samtools, sorted with samtools and derived sam files from the sorted bam for running cufflink [ Please note running the same sorted sam files with a reference gtf file suggests it is not sorted, where as without a gtf file runs fine..]

Output files
with reference gtf:
genes.fpkm.tracking -> has no fpkm information
transcripts.gtf -> has fpkm column but meaningless (all 0.0000)
isoforms.fpkm_tracking -> similar in size and content with genes.fpkm.tracking

Without reference gtf:
transcripts.gtf -> has transcript and exon information with fpkm and cov values; the co-ordinates are 1 based.
genes.fpkm.tracking -> has gene info with fpkm. The co-ordinates are 0 based
isoform.fpkm_tracking -> same as genes.fpkm.tracking

Comparison between all-merged-sam with cufflink VS cufflink -> gtf -> cuffcompare:

1. In both the cases, only a single exon is reported per gene(Since cufflink is run after bowtie directly without running tophat, this may be the case).
2. Much less number of transcripts are found in the first case, and they are non-overlapping and larger than the later case, where the transcripts are short, overlapping. The FPKM is slightly higher than the FPKM values in the later case. The first one seems to merge several smaller transcripts together.

So, in essence, if you have various biological replicates of a single treatment type, instead of going through the path of running them individually with cufflink, followed by merging the results with cuffmerge, merge the map bam files first and follow this route. The FPKM values are much accurate in this case...

Thursday, September 08, 2011

JavaScript: Changing drop down Lists

The select menu should work something like this. Change the first list and the second list also changed

Although this is a very small javascript trick, but nevertheless a very useful one! While creating forms, you would sometimes like to change a drop down list dynamically depending on which option was chosen in a earlier list(This may be a radio button, a list itself or anything else)
Here is a step by step procedure:

1. Write down the names and values for first drop down box e.g; Psojae V1-> name, psv1 -> value; Psojae V5-> name, psv5-> value and so on...
2. For each name in first drop down list, make a sublist of name value pairs, you want to appear on select: for example, for Psojae V1: PS1->name; ps1->value; PS2->name; ps2->value AND for Psojae V4: WI1->name; wi1->value; WI2->name; wi2->value.
3. Now write a javascript  with all these primary lists and sublist name value pairs something  like this:
<script language="javascript">
var lists = new Array();

//First List
lists['psv5']    = new Array(); // Notice here you are making a list with the value of first list
lists['psv5'][0] = new Array( // These are the names you want to appear on the second list
lists['psv5'][1] = new Array( //These are the values you want to pass on from the second list
//Second List
lists['psv1']    = new Array(); // Notice here you are making a list with the value of first list
lists['psv1'][0] = new Array( // These are the names you want to appear on the second list
lists['psv1'][1] = new Array( //These are the values you want to pass on from the second list

4. Write the second sets of javascripts having functions like:
emptyList, fillList and changeList

function emptyList( box ) {
        while ( box.options.length ) box.options[0] = null;
function fillList( box, arr ) {
        for ( i = 0; i < arr[0].length; i++ ) {
                option = new Option( arr[0][i], arr[1][i] );
                box.options[box.length] = option;
function changeList( box ) {
        list = lists[box.options[box.selectedIndex].value];
        emptyList( box.form.reads ); // Here notice I have given name 'reads' for the form object
        fillList( box.form.reads, list ); // that is the name tag on the select object in html for the second list

5. Now add the following to the body tag of your html page:

<body onload="changeList(document.forms['nextgen'].reference)"> // Here 'nextgen' is the name of the form. Notice, I add this trigger when the form gets loaded to execute changeList().

The page will be fine at this stage, only problem is you have to refresh it after making a select on the first select drop down. If you don't want that do this last thing:

6.  Write the following on the first select tag:

<select name="reference" size=1 onchange="changeList(this)">

In case, you want to be able to select multiple values from a select list(by shift + ctrl), go ahead and add the following to your second select tab:

<select name="reads" size=N multiple width=M>

Enjoy with javascript!

Monday, August 08, 2011

String matching algorithms - An overview

String matching algorithms are the most commonly used algorithms in sequence analysis. While we care less about the underlying principle that does the hard work for us, we remain elusive about the output. It helps a great deal if students get to know the algorithm that forms the basis of a search program. Here is a great link that illustrates various algorithms underlying the string search principle along with the C pseudo code. Enjoy...

Friday, July 15, 2011

Department of defense Ovarian cancer award proposal

My collaborator Prof. Alka Nanda Basu from UNT health Science Center and I go to the same Zumba class. Once while talking, we both discovered that we both are in science and could be potential collaborators. Alaka invited me to present my work at UNT and it was very well received. Since then we have been working together on few grants. One of grant proposal on OCRP from DOD got green signal and we are all set to write the full proposal. As exciting as this news is, another exciting news came from one of her summer students who worked on role of AMP activated Protein Kinase(AMPK) on cancer cells developing resistance to drug cisplatin. The hypothesis that led to a prestigious google award can be summarized as:
 AMPK inhibition may decrease cisplatin resistance in ovarian cancer cells that are resistant to the drug.

More on google award can be found at google web site at:​ampkandcisplatinresistance/hom​e

Friday, June 17, 2011

Feeding twitter rolls into your web site

This is the simplest way you could start feeding twitter rolls of your interest into your web site, just replace the word1, word2... with your suitable key words.

new TWTR.Widget({
version: 2,
type: 'search',
search: 'Word1 OR word2 OR word3 OR word4',
interval: 6000,
title: 'Oomycetes tweet feed',
subject: 'Tweet Feed',
width: 250,
height: 300,
theme: {
shell: {
background: '#8ec1da',
color: '#ffffff'
tweets: {
background: '#ffffff',
color: '#444444',
links: '#1985b5'
features: {
scrollbar: false,
loop: true,
live: true,
hashtags: true,
timestamp: true,
avatars: true,
toptweets: true,
behavior: 'default'

Monday, March 21, 2011

Analyzing oomycetes RNAseq data - Part 1

This is a small tutorial on how to work with large scale RNAseq data with open source softwares. I will be discussing various options for alignment of reads to the reference sequence and little bit on assembly.

[When I started writing this post, tophat did not have colorspace support, but now it does]
A quicklist of software programs available for nextgen data analysis can be found here:

To begin with bowtie, get the pre compiled binaries from here .

In order to start with bowtie, you need to build index of your reference sequence. This could be typically a fasta file. For me running command . "./bowtie-build /data/vbi/references/p_sojae.fasta sojaeV4"
to generate index for a 70 MB genome in a 64 bit machine  with dual core and 8 GB chip - took exactly 2 minutes. This command will generate 6 .ebwt files in the working directory.
ls -l
26385951 2010-06-11 15:18 sojaeV4.1.ebwt
9702096  2010-06-11 15:18 sojaeV4.2.ebwt
8711     2010-06-11 15:17 sojaeV4.3.ebwt
19404183 2010-06-11 15:17 sojaeV4.4.ebwt
26385951 2010-06-11 15:19 sojaeV4.rev.1.ebwt
9702096  2010-06-11 15:19 sojaeV4.rev.2.ebwt 

Once created make a separate directory called as index/ in the same working area and place all these files in there.

Test installation of index using the following command:
and withing fraction of a second you will see following printed to the stdout

0       -       super_16        1314431 CTTGTTGCGTCTCAGAAACCTTTTCGCGGCCATIIIIIII
# reads processed: 1
# reads with at least one reported alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments to 1 output stream(s)

Good thing about bowtie(version 0.12 onwards, it supports colorspace alignment)

As of version 0.12.0, bowtie can align colorspace reads against a colorspace index when -C is specified. Colorspace is the characteristic output format of Applied Biosystems' SOLiD system.Look at color coding here for more details. See ABI's Principles of Di-Base Sequencing document for details.

All input formats (FASTA -f, FASTQ -q, raw -r, tab-delimited --12, command-line -c) are compatible with colorspace (-C). When -C is specified, read sequences are treated as colors. Colors may be encoded either as numbers (0=blue, 1=green, 2=orange, 3=red) or as characters A/C/G/T (A=blue, C=green, G=orange, T=red).
Some reads include a primer base as the first character; e.g.:
What about BFAST and SHRIMP and VMATCH? 
These 3 softwares are also good for fast alignment of short reads into the genome. 
BFAST stands for "BLAT-like Fast Accurate Search Tool", and it also supports colorspace 
data. SHRIMP also supports colorspace reads and recently the major improvement is that, 
it can work on small memory computers. VMATCH on the other hand is a software for
local alignments and needs licensing. I have not worked extensively on these 3 softwares,
so, can't say much about them
Is designed to predict the splice junctions after aligning short reads to the reference. 
It usually works without a reference junction position(GFF) file, but can also work
if there is already a known gff file. This program chops the reads into smaller fragments 
and stitches the alignments later in separating introns from exons. Now tophat has a 
colorspace support.


This result represents alignment percentage of 3 different assembly version of the
same organism. 

Tophat Output: [For larger datasets try splitting them into smaller files since tophat exits with some error hard to tell why it exited]

Caveats: I work with oomycetes pathogens and the introns and exons in this pathogen could be really much smaller than what tophat defines as default.

Running tophat may not work for everyone. For example with P.sojae, tophat produced awful number of junctions(just 700) with default parameters. Tweaking the parameters only produced worse results. As a work around, we tried aligning these transcripts to the predicted gene models as well. Comparing genome vs predicted gene model alignment results, we found around 10% of the alignments were not translated into predicted models(55% matched with genomes where as 45% match with predicted transcripts)

There is an option to try and get the sequences aligned with the genome to assemble into contigs. One easier method for this is to try ABySS 

Running ABySS is quite straight forward. Unpack the distribution and follow installation instructions. Once installed you could try running ABySS with different k mers. One small shell script for running different k mers
is as belows:

export PATH=$PATH:/home/sutripa/samtools-0.1.11/

for i in {20..40};
./ABYSS -k $i /data/bowtieOutput/PS-1_F3V1.bam.sorted.bam -o tmp-k$i.fa
[NOTE: Here the input file for assembly is sorted bam files. The bam files are generated by running bowtie]

Then check the files for their N50 values [See N50 section of the blog on how to calculate N50 values]
In case of P.sojae here is the N50 results:

 N50 Vs Number of scaffolds having values larger than N50 for P.sojae.
Here, the results are not very encouraging. So, you may try running with a bunch of different k values. I will discuss about running trans-abyss and cufflink in part - 2 of this series.



Monday, February 28, 2011

Genomics: In search of rare human variants

This paper makes a very good reading. Besides that following are the few concepts that caught my imagination...

Missing Heritability:
obesity, diabetes and cardiovascular disease — are known to have a strong genetic component, their associated genomic variants detected through GWAS cannot explain most of the experimentally identified genetic effects found in affected families. Human geneticists call this problem the 'missing heritability'. Missing heritability may be due to very rare variants rather than the common ones discovered by GWAS.
Since GWAS misses out most of the common variants and sequencing thousands of individuals is going to be rather very expansive, a new concept is now getting rounds that is called as Imputation. Imputation is defined as 'inventing data' for some individuals depending on data available for other individuals[see the figure below].


Ref  [Rasmus Nielsen
Nature Volume: 467,Pages: 1050–1051 Date published: (28 October 2010)
DOI: doi:10.1038/4671050a]
A very good read...

Saturday, February 26, 2011

Calculating N50 and L50 of a contig assembly file

N50 is most often used in draft genome assembly - can be defined as the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb.

L50 is the number of scaffolds that accounts for more than 50% of the genome assembly.

Here is a small step by step protocol to calculate N50:

1. Read Fasta file and calculate sequence length.
2. Sort length on reverse order.
3. Calculate Total size.
4. Calculate N50.

## Read Fasta File and compute length ###
my $length;
my $totalLength;
my @arr;
   push (@arr, $length);
   $totalLength += $length;
  $length += length($_);


my @sort = sort {$b <=> $a} @arr;
my $n50;
my $L50;
foreach my $val(@sort){
      if($n50 >= $totalLength/2){
         print "N50 length is $n50 and N50 value is: $val and L50 is $L50\n";
If each sequence length is given in fasta header then one can grep '>' inputfile > out and do proper substitution
to get only the values(see vim editor section of the blog).
Then do a
$ sort -g -r inputfile > out
$ awk '{sum+=$1}END{print "Total:", sum} out  # To calculate total
$ Total : Number

# Then use second part of the perl subroutine to get the N50 value