**N50**is most often used in draft genome assembly - can be defined as the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb.

**L50**is the number of scaffolds that accounts for more than 50% of the genome assembly.

Here is a small step by step protocol to calculate N50:

1. Read Fasta file and calculate sequence length.

2. Sort length on reverse order.

3. Calculate Total size.

4. Calculate N50.

## Read Fasta File and compute length ###

my $length;

my $totalLength;

my @arr;

while(

chomp;

if(/>/){

push (@arr, $length);

$totalLength += $length;

$length=0;

next;

}

$length += length($_);

}

close(FH);

my @sort = sort {$b <=> $a} @arr;

my $n50;

my $L50;

foreach my $val(@sort){

$n50+=$val;

$L50++;

if($n50 >= $totalLength/2){

print "N50 length is $n50 and N50 value is: $val and L50 is $L50\n";

last;

}

}

If each sequence length is given in fasta header then one can grep '>' inputfile > out and do proper substitution

to get only the values(see vim editor section of the blog).

Then do a

$ sort -g -r inputfile > out

$ awk '{sum+=$1}END{print "Total:", sum} out # To calculate total

$ Total : Number

# Then use second part of the perl subroutine to get the N50 value

## 2 comments:

Excellent post, nice code! Thanks a lot!

Thanks, I am glad that you liked the post.

Post a Comment