Monday, February 28, 2011

Genomics: In search of rare human variants

This paper makes a very good reading. Besides that following are the few concepts that caught my imagination...

Missing Heritability:
obesity, diabetes and cardiovascular disease — are known to have a strong genetic component, their associated genomic variants detected through GWAS cannot explain most of the experimentally identified genetic effects found in affected families. Human geneticists call this problem the 'missing heritability'. Missing heritability may be due to very rare variants rather than the common ones discovered by GWAS.
Since GWAS misses out most of the common variants and sequencing thousands of individuals is going to be rather very expansive, a new concept is now getting rounds that is called as Imputation. Imputation is defined as 'inventing data' for some individuals depending on data available for other individuals[see the figure below].


 Remember: IT HAS ALWAYS BEEN AND IT STILL IS AN RNA EARTH LIFE.


Ref  [Rasmus Nielsen
Nature Volume: 467,Pages: 1050–1051 Date published: (28 October 2010)
DOI: doi:10.1038/4671050a]
A very good read...

Saturday, February 26, 2011

Calculating N50 and L50 of a contig assembly file

N50 is most often used in draft genome assembly - can be defined as the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb.

L50 is the number of scaffolds that accounts for more than 50% of the genome assembly.

Here is a small step by step protocol to calculate N50:

1. Read Fasta file and calculate sequence length.
2. Sort length on reverse order.
3. Calculate Total size.
4. Calculate N50.

## Read Fasta File and compute length ###
my $length;
my $totalLength;
my @arr;
while(){
   chomp;
   if(/>/){
   push (@arr, $length);
   $totalLength += $length;
   $length=0;
   next;
  }
  $length += length($_);
}

close(FH);

my @sort = sort {$b <=> $a} @arr;
my $n50;
my $L50;
foreach my $val(@sort){
     $n50+=$val;
     $L50++;
      if($n50 >= $totalLength/2){
         print "N50 length is $n50 and N50 value is: $val and L50 is $L50\n";
         last;
     }
}
 
If each sequence length is given in fasta header then one can grep '>' inputfile > out and do proper substitution
to get only the values(see vim editor section of the blog).
Then do a
$ sort -g -r inputfile > out
$ awk '{sum+=$1}END{print "Total:", sum} out  # To calculate total
$ Total : Number

# Then use second part of the perl subroutine to get the N50 value