genomics: 2013

Saturday, July 06, 2013

Fastq format converter in Fastx

Fastx is a useful tool for quality clipping, getting quality statistics etc. Just running with -i and -o parameters, it will complain if you have file in sanger fastq format. there is an undocumented parameter that takes the quality offset. For instance if you want to run a sanger fastq file, then do the following:
./fastx_quality_stats -i tmp -Q 33 -o tmp.out
But if the file is in Solexa format, then you dont have to specify the -Q option.

However, there is always a need to have a script in your tool box that can convert one fastq format to another. I found a nice script online probably written by Heng Li

It takes about 30 mins to run in a single processor mode running on the head node of a cluster with 49 GB memory took approximately 1 hour for a 25 GB input file.

#!/usr/bin/perl -w

# Author: lh3
#
# fq_all2std.pl was originally distributed as part of the Maq software package (http://maq.sourceforge.net/)
# and is distributed under the GNU Public License (GPL) http://www.gnu.org/copyleft/gpl.html.
#
# Modified: Lance Parsons
# Added illumina2std function
# Fixed CR/LF conversion in sol2std
# Version: 0.1.7_lp

use strict;
use warnings;
use Getopt::Std;

my $usage = qq(
Usage:   fq_all2std.pl <command> <in.txt>

Command: scarf2std      Convert SCARF format to the standard/Sanger FASTQ
         fqint2std      Convert FASTQ-int format to the standard/Sanger FASTQ
         sol2std        Convert Solexa/Illumina FASTQ (Pipline 1.0 and below) to the standard FASTQ
         illumina2std Convert Illumina FASTQ (Pipeline v1.3) to the standard FASTQ
         std2sol        Convert standard FASTQ to Solexa/Illumina FASTQ (simplified)
         std2illumina Convert standard FASTQ to Illumina FASTQ (Pipeline v1.3)
         fa2std         Convert FASTA to the standard FASTQ
         seqprb2std     Convert .seq and .prb files to the standard FASTQ
         fq2fa          Convert various FASTQ-like format to FASTA
         export2sol     Convert Solexa export format to Solexa FASTQ
         export2std     Convert Solexa export format to Sanger FASTQ
         csfa2std       Convert AB SOLiD read format to Sanger FASTQ
         std2qual       Convert standard FASTQ to .seq+.qual
         instruction    Explanation to different format
         example        Show examples of various formats

Note:    Read/quality sequences MUST be presented in one line.
\n);

die($usage) if ( @ARGV < 1 );

# Solexa->Sanger quality conversion table
my @conv_table;
my @rev_conv_table;
for ( -64 .. 64 ) {
 $conv_table[ $_ + 64 ] = chr( int( 33 + 10 * log( 1 + 10**( $_ / 10.0 ) ) / log(10) + .499 ) );
 $rev_conv_table[ int( 33 + 10 * log( 1 + 10**( $_ / 10.0 ) ) / log(10) + .499 ) ] = chr( $_ + 64 );
}

# parsing command line
my $cmd      = shift;
my %cmd_hash = (
 scarf2std    => \&scarf2std,
 fqint2std    => \&fqint2std,
 sol2std      => \&sol2std,
 fa2std       => \&fa2std,
 fq2fa        => \&fq2fa,
 example      => \&example,
 instruction  => \&instruction,
 export2sol   => \&export2sol,
 export2std   => \&export2std,
 csfa2std     => \&csfa2std,
 seqprb2std   => \&seqprb2std,
 std2sol      => \&std2sol,
 illumina2std => \&illumina2std,
 std2illumina => \&std2illumina,
 std2qual     => \&std2qual
);
if ( defined( $cmd_hash{$cmd} ) ) {
 &{ $cmd_hash{$cmd} };
}
else {
 die("** Unrecognized command $cmd");
}

sub fa2std {
 my %opts = ( q => 25 );
 getopts( 'q:', \%opts );
 die("Usage: fq_all2std.pl fa2std [-q $opts{q}] <in.fa>\n") if ( -t STDIN && @ARGV == 0 );
 my $q = chr( $opts{q} + 33 );
 while (<>) {
  if (/^>(\S+)/) {
   print "\@$1\n";
   $_ = <>;
   print "$_+\n", $q x ( length($_) - 1 ), "\n";
  }
 }
}

sub csfa2std {
 my %opts = ( q => 25, Q => '', l => 0 );
 getopts( 'q:Q:l:', \%opts );
 die( "
Usage:   fq_all2std.pl csfa2std [options] <in.csfa>\n
Options: -q INT     default base quality [$opts{q}]
         -Q FILE    quality file [null]
         -l INT     output read length, 0 for auto [$opts{l}]

Note:    For paired-end alignment, Maq requires two sequence files as the
         input. The n-th read in the first file should forms a read pair with
         the n-th read in the second file. However, SOLiD reads may be
         singletons and therefore further prepocessing is needed.
\n" ) if ( -t STDIN && @ARGV == 0 );
 my ( $fh, $name, $seq );
 my $len = $opts{l};
 my $q   = chr( $opts{q} + 33 );
 if ( $opts{Q} ) {
  open( $fh, $opts{Q} ) || die("** fail to open quality file '$opts{Q}'");
 }
 while (1) {
  while (<>) { last if (/^>/); }
  last unless ($_);
  /^>(\S+)/;
  $name = $1;
  $_ = substr( <>, 2 );
  chomp;
  tr/0123./ACGTN/;
  $seq = $_;

  if ($fh) {    # .qual file is available
   while (<$fh>) { last if (/^>(\S+)/); }
   /^>(\S+)/;
   die("** unmatched seq-qual name: '$name' ne '$1'") unless ( $1 eq $name );
   $_ = <$fh>;
   s/(\s*\d+\s*)/chr(int($1) + 33)/eg;
   $_ = substr( $_, 1 );
  }
  else {
   $_ = $q x length($seq);
  }
  if ( $name =~ /^(\S+)_F\d$/ ) {    # change read name for maq
   $name = "$1/1";
  }
  elsif ( $name =~ /^(\S+)_R\d$/ ) {
   $name = "$1/2";
  }
  if ($len) {                        # chop the sequence if required
   $seq = substr( $seq, 1, $len );
   $_   = substr( $_,   1, $len );
  }
  print "\@$name\n$seq\n+\n$_\n";
 }
 close($fh) if ($fh);
}

sub fq2fa {
 while (<>) {
  if (/^@(\S+)/) {
   print ">$1\n";
   $_ = <>;
   print;
   <>;
   <>;
  }
 }
}

sub scarf2std {
 while (<>) {
  my @t = split( ':', $_ );
  my $name = join( '_', @t[ 0 .. 4 ] );
  print "\@$name\n$t[5]\n+\n";
  my $qual = '';
  @t = split( /\s/, $t[6] );
  $qual .= $conv_table[ $_ + 64 ] for (@t);
  print "$qual\n";
 }
}

sub seqprb2std {
 die("Usage: fq_all2std.pl seqprb2std <in.seq.txt> <in.prb.txt>\n") if ( @ARGV != 2 );
 my ( $fhs, $fhq );
 open( $fhs, $ARGV[0] ) || die;
 open( $fhq, $ARGV[1] ) || die;
 while (<$fhs>) {
  my @t = split;
  my $name = join( ":", @t[ 0 .. 3 ] );
  $t[4] =~ tr/./N/;
  print "\@$name\n$t[4]\n+\n";
  $_ = <$fhq>;
  @t = split;
  my $q   = '';
  my $max = -100;

  for ( 0 .. $#t ) {
   $max = $t[$_] if ( $t[$_] > $max );
   if ( ( $_ & 0x3 ) == 3 ) {
    $q .= $conv_table[ $max + 64 ];
    $max = -100;
   }
  }
  print "$q\n";
 }
 close($fhs);
 close($fhq);
}

sub export2sol {
 while (<>) {
  chomp;
  my @t = split( "\t", $_ );
  my $output = *STDOUT;
  if ( $t[21] ne 'Y' ) {
   $output = *STDERR;
  }

  my $x = ( defined( $t[7] ) && ( $t[7] == 1 || $t[7] == 2 ) ) ? "/$t[7]" : '';
  $t[0] =~ s/^(SLXA|HWI)-//;
  $t[0] =~ s/_Human//i;
  $t[0] =~ s/_PhiX//i;
  $t[0] =~ s/_R1//;
  my $rn_head = ( $t[0] =~ /(^[A-Z]+\d+_\d+)/ ) ? $1 : ( $t[1] ? "$t[0]_$t[1]" : $t[0] );
  print {$output} "\@$rn_head:$t[2]:$t[3]:$t[4]:$t[5]$x\n$t[8]\n+\n$t[9]\n";
 }
}

sub export2std {
 while (<>) {
  chomp;
  my @t = split( "\t", $_ );
  my $output = *STDOUT;
  if ( $t[21] ne 'Y' ) {
   $output = *STDERR;
  }
  my $x = ( defined( $t[7] ) && ( $t[7] eq 1 || $t[7] eq 2 ) ) ? "/$t[7]" : '';
  $t[0] =~ s/^SLXA-//;
  $t[0] =~ s/_Human//i;
  $t[0] =~ s/_PhiX//i;
  $t[0] =~ s/_R1//;
  my $rn_head = ( $t[0] =~ /(^[A-Z]+\d+_\d+)/ ) ? $1 : ( $t[1] ? "$t[0]_$t[1]" : $t[0] );
  print {$output} "\@$rn_head:$t[2]:$t[3]:$t[4]:$t[5]$x\n$t[8]\n";
  my @s = split( '', $t[9] );
  my $qual = '';
  $qual .= $conv_table[ ord($_) ] for (@s);
  print {$output} "+\n$qual\n";
 }
}

sub fqint2std {
 while (<>) {
  if (/^@/) {
   print;
   $_ = <>;
   print;
   $_ = <>;
   $_ = <>;
   my @t    = split;
   my $qual = '';
   $qual .= $conv_table[ $_ + 64 ] for (@t);
   print "+\n$qual\n";
  }
 }
}

sub sol2std {
 my $max = 0;
 while (<>) {
  if (/^@/) {
   print;
   $_ = <>;
   print;
   $_ = <>;
   $_ = <>;

   # Added to eliminate carriage return conversion
   chomp;
   my @t = split( '', $_ );
   my $qual = '';
   $qual .= $conv_table[ ord($_) ] for (@t);
   print "+\n$qual\n";
  }
 }
}

sub illumina2std {
 my $max = 0;
 while (<>) {
  if (/^@/) {
   print;
   $_ = <>;
   print;
   $_ = <>;
   $_ = <>;

   # Added to eliminate carriage return conversion
   chomp;
   my @t = split( '', $_ );
   my $qual = '';
   $qual .= chr( ord($_) - 31 ) for (@t);
   print "+\n$qual\n";
  }
 }
}

sub std2illumina {
 my $max = 0;
 while (<>) {
  if (/^@/) {
   print;
   $_ = <>;
   print;
   $_ = <>;
   $_ = <>;
   chomp;
   my @t = split( '', $_ );
   my $qual = '';
   $qual .= chr( ord($_) + 31 ) for (@t);
   print "+\n$qual\n";
  }
 }
}

sub std2sol {
 my $max = 0;
 while (<>) {
  if (/^@/) {
   print;
   $_ = <>;
   print;
   $_ = <>;
   $_ = <>;
   chomp;
   
   #tr/!-]/@-|/; 
   #print "+\n$_\n";
   
   my @t = split( '', $_ );
   my $qual = '';
   $qual .= $rev_conv_table[ ord($_) ] for (@t);
   print "+\n$qual\n";
  }
 }
}

sub std2qual {
 die("Usage fq_all2std.pl std2qual <out.prefix> <in.fastq>\n") if ( @ARGV == 0 );
 my $pre = shift(@ARGV);
 my ( $fhs, $fhq );
 open( $fhs, ">$pre.seq" )  || die;
 open( $fhq, ">$pre.qual" ) || die;
 while (<>) {
  s/^@/>/;
  print $fhs $_;
  print $fhq $_;
  $_ = <>;
  print $fhs $_;
  <>;
  $_ = <>;
  s/([!-~])/" ".(ord($1)-33)/eg;
  $_ = substr( $_, 1 );
  print $fhq $_;
 }
 close($fhs);
 close($fhq);
}

sub instruction {

 print "
FASTQ format is first used in the Sanger Institute, and therefore
we take the Sanger specification as the standard FASTQ. Although
Solexa/Illumina reads file looks pretty much like the standard
FASTQ, they are different in that the qualities are scaled
differently. In the quality string, if you can see a character
with its ASCII code higher than 90, probably your file is in the
Solexa/Illumina format.

Sometimes we also use an integer, instead of a single character,
to explicitly show the qualities. In that case, negative
qualities indicates that Solexa/Illumina qualities are used.

";

}

sub example {
 my $exam_scarf = '
USI-EAS50_1:4:2:710:120:GTCAAAGTAATAATAGGAGATTTGAGCTATTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 19 23 23 23 18 23 23 23
USI-EAS50_1:4:2:690:87:GTTTTTTTTTTTCTTTCCATTAATTTCCCTTT:23 23 23 23 23 23 23 23 23 23 23 23 12 23 23 23 23 23 16 23 23 9 18 23 23 23 12 23 18 23 23 23
USI-EAS50_1:4:2:709:32:GAGAAGTCAAACCTGTGTTAGAAATTTTATAC:23 23 23 23 23 23 23 23 20 23 23 23 23 23 23 23 23 23 23 23 23 12 23 18 23 23 23 23 23 23 23 23
USI-EAS50_1:4:2:886:890:GCTTATTTAAAAATTTACTTGGGGTTGTCTTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
USI-EAS50_1:4:2:682:91:GGGTTTCTAGACTAAAGGGATTTAACAAGTTT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 20 23 23 23 23 23 23 23 23 23 23 23 18 23 23 23 23
USI-EAS50_1:4:2:663:928:GAATTTGTTTGAAGAGTGTCATGGTCAGATCT:23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
';

 my $exam_fqint = '
@4_1_912_360
AAGGGGCTAGAGAAACACGTAATGAAGGGAGGACTC
+4_1_912_360
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 40 40 40 40 40 40 40 40 40 26 40 40 14 39 40 40
@4_1_54_483
TAATAAATGTGCTTCCTTGATGCATGTGCTATGATT
+4_1_54_483
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 16 40 40 40 28 40 40 40 40 40 40 16 40 40 5 40 40
@4_1_537_334
ATTGATGATGCTGTGCACCTAGCAAGAAGTTGCATA
+4_1_537_334
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 29 40 40 33 40 40 33 40 40 33 31 40 40 40 40 18 26 40 -2
@4_1_920_361
AACGGCACAATCCAGGTTGATGCCTACGGCGGGTAC
+4_1_920_361
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 40 40 40 40 40 40 40 40 31 40 40 40 40 40 40 15 5 -1 3
@4_1_784_155
AATGCATGCTTCGAATGGCATTCTCTTCAATCACGA
+4_1_784_155
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 31 40 40 40 40 40
@4_1_595_150
AAAGACGTGGCCAGATGGGTGGCCAAGTGCCCGACT
+4_1_595_150
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 30 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 14 40 40
';

 my $exam_sol = '
@SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
+SLXA-B3_649_FC8437_R1_1_1_610_79
YYYYYYYYYYYYYYYYYYWYWYYSU
@SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
+SLXA-B3_649_FC8437_R1_1_1_397_389
YYYYYYYYYWYYYYWWYYYWYWYWW
@SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
+SLXA-B3_649_FC8437_R1_1_1_850_123
YYYYYYYYYYYYYWYYWYYSYYYSY
@SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
+SLXA-B3_649_FC8437_R1_1_1_362_549
YYYYYYYYYYYYYYYYYYWWWWYWY
@SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA
+SLXA-B3_649_FC8437_R1_1_1_183_714
YYYYYYYYYYWYYYYWYWWUWWWQQ
';

 print qq(
solexa
======
$exam_sol
scarf
=====
$exam_scarf
fqint
=====
$exam_fqint
);
}

Thursday, July 04, 2013

Not mine - I found this csfasta to fastq conversion script that worked best...

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

my $usage = qq(
This script is used for converting SOLiD/ABi FASTQ to Standard(Sanger) FASTQ
Usage : perl $0 [Option]
Option: -seq seq file
-qual qual file
-fastq FASTQ file
-o output file
-help Help Information
Example: perl $0 -fastq solid.fastq -o std.fastq

Author : BENM <BinxiaoFeng\@gmail.com>
Version: 1.2
Date : 2009-10-06
Update : 2012-03-27
\n);
my ($Seq,$Qual,$Fastq,$Output,$Header,$Help);
my %opts;
GetOptions
(
\%opts,
"seq:s"=>\$Seq,
"qual:s"=>\$Qual,
"fastq:s"=>\$Fastq,
"o:s"=>\$Output,
"header:s"=>\$Header,
"help"=>\$Help
);

die($usage) if (((!defined $Seq)||(!defined $Qual))&&(!defined $Fastq)&&(!defined $Output)||($Help));

# Solexa->Sanger quality conversion table
my @conv_table;
for (-64..64) {
$conv_table[$_+64] = chr(int(33 + 10*log(1+10**($_/10.0))/log(10)+.499));
}

# SOLiD color code
my @code = ([0,1,2,3],[1,0,3,2],[2,3,0,1],[3,2,1,0]);
my @bases = qw(A C G T);
my %decode = ();
foreach my $i(0..3)
{
foreach my $j(0..3)
{
$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
}
}

open (OUT,">$Output") || die "Fail to create FASTA file:$Output for writing\n";
if (defined $Fastq)
{
open (IN,$Fastq) || die "Fail to open FASTQ file:$Fastq for reading\n";
while (<IN>)
{
if (/^@/)
{
print OUT $_;
$_ = <IN>;
s/\s+$//;
my $seq = ($_=~/\d+/) ? col2base($_) : $_;
print OUT "$seq\n";
$_ = <IN>; print OUT $_; $_ = <IN>;
s/\s+$//;
my @t = split;
my $qual = '';
if ($_=~/^\d+\s*/)
{
my @t=split;
$qual .= $conv_table[$_+64] for (@t); #intfq: SOLiD -> Standard
}
else
{
$qual .= $_;
}
substr($qual,0,1,'') if (length($qual)>length($seq));
print OUT "$qual\n";
}
}
close IN;
}
elsif ((defined $Seq)&&(defined $Qual))
{
my ($Title,$Name,$seq,$qual)=("","","","");
open (IN1,$Seq) || die "Fail to open Seq file:$Seq for reading\n";
open (IN2,$Qual) || die "Fail to open Quality file:$Qual for reading\n";
while(<IN2>)
{
if ($_=~/^\#\s+Title\:\s+(\S+)/)
{
$Title=$1;
my $in1=<IN1>;
while($in1 !~ /^\#\s+Title\:\s+(\S+)/)
{
$in1=<IN1>;
}
}
else
{
if ($_=~/^#\s\w+/)
{
<IN2>;<IN1>;
}
elsif (/^[\@\>](\S+)/)
{
chomp $seq;
chomp $qual;
substr($qual,0,1,"") if (length($qual)>length($seq));
warn "The length of quality is not match with seqence!\nSeq: $seq\nQual: $qual\n" if (length($qual)!= length($seq));
print OUT "$Name\n$seq\n+\n$qual\n" if ($seq ne "")&&($qual ne "");
($seq,$qual)=("","");
$Name="@".$Title.$1;
$_ = <IN1>;
if (/^[\@\>]/)
{
$_ = <IN1>;
}
my $tmp_seq;
while (($_ ne "")&&($_ !~ /^[\@\>]/)&&(!eof))
{
s/\s+$//;
$tmp_seq.=$_;
$_=<IN1>;
chomp $_;
}
$tmp_seq .= $_ if ((eof)&&($_ !~ /^[\@\>]/));
chomp $tmp_seq;
$seq = (($tmp_seq=~/\d+/)) ? col2base($tmp_seq) : $tmp_seq;
$_ = <IN2>;
s/^\s+//;
s/\s+$//;
if (($_=~/\d+\s+\d+/)||($_=~/^\d+$/))
{
$qual .= " " if ($qual=~/\d+$/);
my @t = split /\s+/,$_;
$qual .= $conv_table[$_+64] for (@t);
}
else
{
$qual .= $_;
}
}
else
{
s/^\s+//;
s/\s+$//;
if (($_=~/\d+\s+\d+/)||($_=~/^\d+$/))
{
$qual .= " " if ($qual=~/\d+$/);
my @t = split /\s+/,$_;
$qual .= $conv_table[$_+64] for (@t);
}
else
{
$qual .= $_;
}
}
}
}

substr($qual,0,1,"") if (length($qual)>length($seq));
print OUT "$Name\n$seq\n+\n$qual\n" if ($seq ne "")&&($qual ne "");
close IN1;
close IN2;
}
close OUT;

sub col2base
{
my $col = shift;
my @colors = split '',$col;
my $string = $colors[0];
if($string !~ /[acgt]/i){
warn "$col has no header base\n";
return 0;
}
my $last_base = $string;
my $current_base = '';
for(my $i=1;$i<@colors;$i++)
{
if (($last_base=~/N/i)&&(($colors[$i] eq ".")||($colors[$i]==5)))
{
$current_base = $bases[int(rand(@bases))];
}
else
{
$current_base = (exists $decode{$colors[$i]}->{$last_base}) ? $decode{$colors[$i]}->{$last_base} : "N";
}
$string .= $current_base;
$last_base = $current_base;
}
substr($string,0,1,"");
return $string;
}

Sunday, June 23, 2013

Installing Samtools and BEDTools

Trouble comes as a package when you start your lab. In my effort to resurrect our old database Eumicrobedb.org, I face new challenges every day. What earlier seemed to a be a cake walk, does not look that easy. I have installed so many open source programs in my last stint that I lost count of it. When running this software, I realize that many more dependencies need to be installed. Not to mention my troubles dealing with a crashed server and subsequent loss of all installations:((((

recently I was stuck with easy installations samtools-0.1.19 and bedtools-2.17.0 for more than 3 days. Earlier versions compiled fine in my SUSE linux machine but the same sources give trouble in Red Hat Linux.

While compiling BEDTools, it complained about "undefined reference to `gzopen64'". Searching the forums I figured out that it is complaining about zlib. I checked, it was there in my server and also in the path. However, finally changing the makefile with the following compiled the program.

Look for the line in Makefile 'export LIBS'

and add the following:
export LIBS = YOUR_PATH/libz.so.1.2.7 -lz

do a make clean and do make
It should compile fine.

Samtools:

With Samtools My problem lasted for a very long time. It always exited with error
samtools error bam_import.c:76: undefined reference to gzopen64

I tried re-installing zlib, added zlib path to LD_LIBRARY_PATH, installed latest version zlib, nothing worked.
Finally I changed the line in Makefile that says :

CFLAGS= -g -Wall -O2 to
CFLAGS= -g -Wall -O2 -L /usr/local/lib (This is where my zlib libraries are located)

viola.... It got installed.

NOTE: Install as regular user. May be later you can copy the binary files to the system
directories.

Sunday, May 19, 2013

Installing CEGMA...

Installing CEGMA can be a bit of a challenge for most of the users including me! I had installed it earlier, but this time it really gave me a very hard time in RH 6.2. Apart from installing a set of pre-requisitites, installation of genewise is bit of a challenge - especially so because now its status is archival. The INSTALL instruction that comes with the package is not very useful..

Genewise has several pre-requisites:

glib -> Provides low level data structure support for C programming. Can be found here:
http://www.linuxfromscratch.org/blfs/view/svn/general/glib2.html

glib has several pre-requisites such as
1. pkg-config: This package provides tools for passing include and library path so that packages can be built using configure and make utilities.

This package can be found here: http://pkgconfig.freedesktop.org/releases/pkg-config-0.28.tar.gz

2. libffi: This library provides high level portable interface for programmers to call any function at run time and can be found at: ftp://sourceware.org/pub/libffi/libffi-3.0.13.tar.gz

This package also requires a patch that can be found at:

http://www.linuxfromscratch.org/patches/blfs/svn/libffi-3.0.13-includedir-1.patch

3. Python2.7.5 if it is not already there: http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.xz

4. PCRE library: This contains perl like regular expression libraries:
http://downloads.sourceforge.net/pcre/pcre-8.32.tar.bz2

All these installation will go smoothly, unless you have some problems with your system. Once done, go
Download wise2.4.1 source from http://wise.sourcearchive.com/downloads/2.4.1/

Do the following:

go to wise-2.4.1/src/HMMer2
then replace 'getline' to 'my_getline' in sqio.c
replace 'isnumber' in src/models/phasemodel.c into 'isdigit'

Then go and check the makefile under each directory under src and relace 'glib-config --libs' to 'pkg-config --libs glib-2.0' and also glib-config --cflags' to 'pkg-config --cflags glib-2.0'
using the following command:

find ./ -type f -name "makefile" -exec sed -i.old 's/glib-config --cflags/pkg-config --cflags glib-2.0/g' "{}" +;
and

find ./ -type f -name "makefile" -exec sed -i.old 's/glib-config --libs/pkg-config --libs glib-2.0/g' "{}" +;

Also set path for wiseconfigdir to wisecfg file that is somewhere inside the distribution.

Then do

make all

It works fine for RH6.2

Make sure blast program and Hmmer3 is on your path. Also install geneid in your system as that is also a pre-requisite. Once installed, first set the environment variables for the following:

CEGMA; CEGMATMP; PERL5LIB

PATH=$PATH:$HOME/bin:/usr/adadata/bin:/bin
export WISECONFIGDIR=/usr/adadata/wise2.4.0/wisecfg
export CEGMA=/usr/adadata/cegma_v2.4.010312
export CEGMATMP=/usr/adadata/cegma_v2.4.010312/tmp/
export PERL5LIB=$CEGMA/lib

Now finally Run CEGMA with the default parameters:

bin/cegma --genome sample/sample.dna --protein sample/sample.prot -o sample

References: http://korflab.ucdavis.edu/datasets/cegma/README

Thursday, April 18, 2013

Creating Network File System for your servers

Network File system is a great way to use your resources efficiently especially when they are connected to each other by network. There is a great article here [http://www.tldp.org/HOWTO/NFS-HOWTO/] that guides you about how to achieve this. For the impatient, I am putting a shorter step by step version here, so that you can quickly achieve it.

As a pre-requisite you need to have root access in all your servers you are trying to connect.

1. edit /etc/exports file to have the following entries:
/usr/data nnn.n.n.n(rw,sync)
/usr/data mmm.m.m.m(rw,sync)

where /usr/data in the local machine is the directory you want to share with machines with ip nnn.n.n.n and mmm.m.m.m. Here rw means read write, sync is to synchronise.
You can change this file in all the servers you want to share items with.

2. make entries in your /etc/hosts file which should read something like:
127.0.0.1 xxx.localdomain localhost
nnn.n.n.n yyy.localdomain -> where your machine has the name 'xxx' and the ip address of the machine you are trying to have in network is nnn.n.n.n and the name of that machine is: 'yyy'. This way you can keep adding the number of machines in the network that you wish to add.

3. In newer machines you can make the following entries into your /etc/hosts.deny file:
portmap:ALL
lockd:ALL
mountd:ALL
rquotad:ALL
statd:ALL
All these commands are harmless, and nothing happens if your system does not have a particular protocol. But it is a great way to protect your machine.
4. In /etc/hosts.allow file add the following lines:
service:nnn.n.n.n , mmm.m.m.m
lockd:nnn.n.n.n , mmm.m.m.m
rquotad:nnn.n.n.n , mmm.m.m.m
mountd:nnn.n.n.n , mmm.m.m.m
statd:nnn.n.n.n , mmm.m.m.m
5. Run rpcinfo -p to check if portmapper, rquotad, mountd, nfs, amd, status, nlockmgr is running.

(where nnn.n.n.n and mmm.m.m.m are the other two machines in your network you wish to give nfs access to)

6. Run exportfs -ra to make nfsd to read /etc/exports

7. Check /proc/filesystems and see if there is an entry on nfs. If it is there then you are good to go

8. Check if rpc.statd and rpc.lockd is running by doing ps aux | grep statd or ps aux | grep lockd

9. Finally mount the file system. Make sure you have mount package is installed in your computer.
be a root in machine xxx.x.x.x and you want to have /usr/data drive of machine nnn.n.n.n to be mounted in /share/data in your local file system -> do the following:
mount nnn.n.n.n:/usr/data /share/data
and keep adding mount drives for other machines as well.

Do the same for other machines if you want to have drives from your machine xxx.x.x.x to be mounted else where. Now finally you are done!!

Happy creating network file system....

Friday, March 22, 2013

Training Augustus gene predictor for your organism

Lately I have been asked by multiple people to solve the training problem of Augustus for their organism data. Although I have done it earlier, this time, I faced unusually long time in solving this issue. Augustus wants you to provide a training dataset either in genbank format or in gff format. It typically wants users to remove duplicate genes; proteins with 70% identity, in order to get over the over fitting problem. Also, remove any kind of overlapping genes from your training files. If you could not find a script that can do it efficiently, please contact me directly I will provide you one (sorry, I am yet to set my source code page up).

I have generated a genbank file from my gff and the annotation files (again script can be sent upon request). Howver, few things I have overlooked and they were leaving empty lines between 2 genomic LOCUS entries. If empty space is left then the randomSplitter.pl that is used to split the gb training file into 2 random parts fails. You either get all your data in .test file or all of them in .train file. Suppose you decide to move on with your .train file with all the data and .test as a empty file. Surprises that you will end up with std::bad_alloc() error from augustus, that has nothing to do with the memory of your system. So, if you have an autogenerated genbank file with empty lines, open the file in vim editor and get rid of empty lines using :g/^$/d.

Once done, try splitting your *.gb file with randomSplit.pl. This will do the magic. Now try to train your program and it should run file.

Here is a check list of commands what you should be doing to train for Augustus.

perl scripts/new_species.pl --species=sojae
perl scripts/randomSplit.pl psv5.gb 30
./bin/etraining --species=sojae psv5.gb.train
# mock prediction
./bin/augustus --species=sojae psv5.gb.test | tee firsttest.out
grep -A 22 Evaluation firsttest.out

This is what you get when you run the last command. This is a pretty good indicator that your training is almost good to go.

******* Evaluation of gene prediction *******

---------------------------------------------\
| sensitivity | specificity |
---------------------------------------------|
nucleotide level | 0.861 | 0.674 |
---------------------------------------------/

--------------------------------------------------------------------------------
--------------------------\
| #pred | #anno | | FP = false pos. | FN = false neg. |
| |
| total/ | total/ | TP |--------------------|--------------------|
sensitivity | specificity |
| unique | unique | | part | ovlp | wrng | part | ovlp | wrng |
| |
--------------------------------------------------------------------------------
--------------------------|
| | | | 24559 | 20733 |
| |
exon level | 34213 | 30387 | 9654 | ------------------ | ------------------ |
0.318 | 0.282 |
| 34213 | 30387 | | 8709 | 4167 | 11683 | 9312 | 5502 | 5919 |
| |
--------------------------------------------------------------------------------
--------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level | 10348 | 12700 | 2547 | 7801 | 10153 | 0.201 | 0.246 |
----------------------------------------------------------------------------/

Sorry this output is jumbled but in essence it tells that 2547 genes were predicted exactly. Although exon and gene level results does not look that good, you could actually try and train the software using various types of exons that are accurate.

perl scripts/optimize_augustus.pl psv5.gb.train # This will take a while....