Tuesday, February 10, 2009

Using multidimensional hashes in storing sequence objects


One of the most important functions of multidimensional perl hashes has been their usage in reading sequence related data. The most common data type one reads from a genome sequencing prospective is a gff file. A gff file has the following attributes that are most likely captured in a data structure:

1. Name of the scaffold or chromosome on which the gene resides.
2. Name of the gene
3. Gene positions
4. Strand

Other attributes are less important, so lets ignore them at the moment.

So, now a scaffold will have multiple genes and a gene will have multiple locations. The ideal data structure that can store all these information will be a has of hashes and one of the hash holds arrays. In other words:

scaffold_1 => {
gene1 => { pos => [positions], strand => + or minus},
gene2 => { pos => [positions], strand => + or minus},
.......
},

scaffold_2 => {
gene1 => {pos => [positions], strand => + or minus},
gene2 => {pos => [positions], strand => + or minus},
.......
},
.....


Populating this hash from gff file:


sub test_gff{

my $fileName = shift;
my %HOH;

open PRED, $fileName or die "Can't open file $!\n";

while(){

chomp;
my @line = split(/\t/,$_);

push(@{$HOH{$line[0]}->{$line[8]}->{'pos'}}, $line[3], $line[4]);
$HOH{$line[0]}->{line[8]}->{'strand'}, $line[5];

}

close PRED;
return %HOH;

}


Now Printing the values:


my %HOH1 = &utility::test_gff($ARGV[0]); # Read the file


for my $scaffold(keys %HOH1){


for my $gene(keys %{$HOH1{$scaffold}}){

my @coord = @{ $HOH1{$scaffold}{$gene}{'pos'}};
my $strand = $HOH1{$scaffold}{$gene}{'strand'};


print "$scaffold -> $gene -> @coord -> $strand\n";


}

}

# If you want to sort the co-ordinates,Then that can be done as below:

my @coord = sort{$a <=> $b} @{ $HOH1{$scaffold}{$gene}{'pos'}};

# Values will be printed as :

Contig632 -> 713969 -> 1687 3391 -> +
Contig1122 -> 707832 -> 3109 3183 3194 3200 3200 3232 3232 3251 -> +
Contig1294 -> 714344 -> 3744 4473 -> +
Contig1294 -> 714343 -> 24 1652 -> -
Contig772 -> 707541 -> 14 64 126 147 147 182 -> +
Contig772 -> 713832 -> 4 81 126 147 147 194 -> +
Contig772 -> 711369 -> 1 64 130 147 147 204 -> +

Cool...

Adding one more dimension:

Mostly in draft genome sequences the genome assembly is not in great shape, so most often same genes can be found in multiple locations(Not necessarily through gene duplication). Sometimes the same gene could be present in the same scaffold in more than one location. In that case to stop over writing or appending the positions for the same gene, another check can be implemented, that is the gene serial number. So, you can just add position as another attribute with seriel number being the key.

push(@{$HOH{$line[0]}->{$line[8]}->{$serialNum}->{'pos'}}, $line[3], $line[4]);
$HOH{$line[0]}->{line[8]}->{$serialNum}->{'strand'}, $line[5];

And while iterating add one more for loop:

for my $scaffold(keys %HOH1){


for my $gene(keys %{$HOH1{$scaffold}}){

for my $serialNum(keys %{%HOH1{$scaffold}{$gene}}){

my @coord = @{ $HOH1{$scaffold}{$gene}{$serialNum}{'pos'}};
my $strand = $HOH1{$scaffold}{$gene}{$serialNum}{'strand'};


print "$scaffold -> $gene -> @coord -> $strand\n";


}

}

The ease at which we can use perl data structures make it so desirable. The coder does not need to worry about memory allocation stuff. Earlier I used to use linked list in C to store sequence data. Then I will be stuck in debugging stuff a lot. Now after I discovered this, I have almost stopped coding in C. Happy perl coding..

No comments: