Aragonites are Calcium carbonate crystals found in pearl.
In December 2009 issue of ProteinSpotlight, a very interesting protein is featured. This protein complex is none other than the one that gives luster to pearl. The complex is made out of three proteins, known as the Pif complex in which is found pearlin, Pif80 and Pif97. Pif80 and 97 are part of the same sequence which is subsequently cleaved into two.Pif80 – by way of its many asparagine residues – binds to calcium carbonate and not only elicits aragonite crystal formation but also has a role in the orientation of the aragonite crystals. Here is the Full link to the article..
Monday, December 21, 2009
Top five Biology papers for 2009
ScienceWatch tracks and analyzes trends in basic science research, compiles bimonthly lists of the 10 most cited papers. From that list The scientist pulled 5 papers published within the last two which were the most cited in 2009. The two topics that dominate the top five papers this year: genomics and stem cells.
Following is the list:
1. K. Takahashi, et al., "Induction of pluripotent stem cells from adult human fibroblasts by defined factors," Cell, 131: 861-72, 2007.
Citations this year: 520
Total citations to date: 886
Findings: This work from Shinya Yamanaka's lab in Japan was the first to demonstrate that induced pluripotent stem (iPS) cells can be generated from adult human dermal fibroblasts. Previous efforts by the team showed that iPS cells could be derived from mouse somatic cells. This paper was an easy top pick, receiving the most citations this year, according to ScienceWatch.
2. K.A. Frazer, et al., "A second generation human haplotype map of over 3.1 million SNPs," Nature, 449: 854-61, 2007.
Citations this year: 389
Total citations to date: 588
Findings: Since the sequencing of the human genome in 2003, the International HapMap Project has explored single nucleotide polymorphisms (SNPs) -- differences in a single letter of the DNA -- to study how these small variations affect the development of diseases and the body's response to pathogens and drugs. HapMap I, the original report, placed one SNP at roughly every 5,000 DNA letters. The newest map, featured in this paper, sequenced an additional 2 million SNPs, increasing the map's resolution to one SNP per kilobase. The additional detail allows scientists to more closely investigate patterns in SNP differences, especially in hotspot regions, or concentrated stretches of DNA.
3. A. Barski, et al., "High-resolution profiling of histone methylations in the human genome," Cell, 129: 823-37, 2007.
Citations this year: 299
Total citations to date:: 560
Findings: This study looked at how histone modifications influence gene expression in more detail than previous attempts. Using a powerful sequencing tool called Solexa 1G, the researchers mapped more than 20 million DNA sequences associated with specific forms of histones, finding there were differences in methylation patterns between stem cells and differentiated T cells.
4. E. Birney, et al., "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, "Nature, 447: 799-816, 2007.
Citations this year: 267
Total citations to date: 618
Findings: The ENCODE project -- ENCODE stands for the ENCyclopedia Of DNA Elements -- set out to identify all functional elements in the human genome. After examining one percent of the genome, the paper revealed several new insights about how information encoded in the DNA comes to life in a cell.
5. A M. Wernig, et al., "In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state," Nature 448: 318-24, 2007.
Citations this year: 237
Total citations to date: 512
Findings: Scientists successfully performed somatic-cell nuclear transfer (SCNT), producing stem cell lines and cloned animals for the first time using fertilized mouse eggs. This paper consistently ranked in the top 10 most cited papers in 2009, according to ScienceWatch.
Following is the list:
1. K. Takahashi, et al., "Induction of pluripotent stem cells from adult human fibroblasts by defined factors," Cell, 131: 861-72, 2007.
Citations this year: 520
Total citations to date: 886
Findings: This work from Shinya Yamanaka's lab in Japan was the first to demonstrate that induced pluripotent stem (iPS) cells can be generated from adult human dermal fibroblasts. Previous efforts by the team showed that iPS cells could be derived from mouse somatic cells. This paper was an easy top pick, receiving the most citations this year, according to ScienceWatch.
2. K.A. Frazer, et al., "A second generation human haplotype map of over 3.1 million SNPs," Nature, 449: 854-61, 2007.
Citations this year: 389
Total citations to date: 588
Findings: Since the sequencing of the human genome in 2003, the International HapMap Project has explored single nucleotide polymorphisms (SNPs) -- differences in a single letter of the DNA -- to study how these small variations affect the development of diseases and the body's response to pathogens and drugs. HapMap I, the original report, placed one SNP at roughly every 5,000 DNA letters. The newest map, featured in this paper, sequenced an additional 2 million SNPs, increasing the map's resolution to one SNP per kilobase. The additional detail allows scientists to more closely investigate patterns in SNP differences, especially in hotspot regions, or concentrated stretches of DNA.
3. A. Barski, et al., "High-resolution profiling of histone methylations in the human genome," Cell, 129: 823-37, 2007.
Citations this year: 299
Total citations to date:: 560
Findings: This study looked at how histone modifications influence gene expression in more detail than previous attempts. Using a powerful sequencing tool called Solexa 1G, the researchers mapped more than 20 million DNA sequences associated with specific forms of histones, finding there were differences in methylation patterns between stem cells and differentiated T cells.
4. E. Birney, et al., "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, "Nature, 447: 799-816, 2007.
Citations this year: 267
Total citations to date: 618
Findings: The ENCODE project -- ENCODE stands for the ENCyclopedia Of DNA Elements -- set out to identify all functional elements in the human genome. After examining one percent of the genome, the paper revealed several new insights about how information encoded in the DNA comes to life in a cell.
5. A M. Wernig, et al., "In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state," Nature 448: 318-24, 2007.
Citations this year: 237
Total citations to date: 512
Findings: Scientists successfully performed somatic-cell nuclear transfer (SCNT), producing stem cell lines and cloned animals for the first time using fertilized mouse eggs. This paper consistently ranked in the top 10 most cited papers in 2009, according to ScienceWatch.
Thursday, October 15, 2009
Death of End Note
I guess now End Note software will die a natural death!! I am not saying this just because I am not particularly fond of this software, but lately it has fierce competitions from various resources including my NCBI. Before My NCBI came into picture, I was very fond of connotea. Connotea is very easy to link to a browser, all that you will need is to drag and drop it to the menu bar. You can create your own login and password and update all the bibliography you ever wanted. Export the bibliography as you wish.
With My NCBI, it has struck the nail right on the head of End Note. Tutorials are readily available for My NCBI for easy use. Anytime, I would prefer a browser based application than a stand alone application. End note particularly needed endless filling up forms very tedious ways. I had bad experience of interference of end note with MS office 2007. Now I will breath easy writing a manuscript.
With My NCBI, it has struck the nail right on the head of End Note. Tutorials are readily available for My NCBI for easy use. Anytime, I would prefer a browser based application than a stand alone application. End note particularly needed endless filling up forms very tedious ways. I had bad experience of interference of end note with MS office 2007. Now I will breath easy writing a manuscript.
Monday, October 05, 2009
This years Nobel Prize in Medicine
This years Nobel prize in Medicine goes jointly to Elizabeth Blackburn, Carol Greider, and Jack Szostak for their work on telomeres. In the 1970s, Blackburn identified repeating segments at the ends of DNA in Tetrahymena while Szostak found that single-stranded DNA was rapidly degraded in yeast. Blackburn and Szostak then collaborated on a project, finding that the Tetrahymena DNA protected the single-stranded DNA from degradation in yeast. In 1984, Blackburn and Greider, her graduate student, discovered telomerase. The Nobel citation lauds these researchers for their contribution to the study of aging, cancer, and other diseases. "The discoveries by Blackburn, Greider and Szostak have added a new dimension to our understanding of the cell, shed light on disease mechanisms, and stimulated the development of potential new therapies," it says.
The mysterious telomere
The chromosomes contain our genome in their DNA molecules. As early as the 1930s, Hermann Muller (Nobel Prize 1946) and Barbara McClintock (Nobel Prize 1983) had observed that the structures at the ends of the chromosomes, the so-called telomeres, seemed to prevent the chromosomes from attaching to each other. They suspected that the telomeres could have a protective role, but how they operate remained an enigma.
When scientists began to understand how genes are copied, in the 1950s, another problem presented itself. When a cell is about to divide, the DNA molecules, which contain the four bases that form the genetic code, are copied, base by base, by DNA polymerase enzymes. However, for one of the two DNA strands, a problem exists in that the very end of the strand cannot be copied. Therefore, the chromosomes should be shortened every time a cell divides – but in fact that is not usually the case (Fig 1).
Both these problems were solved when this year's Nobel Laureates discovered how the telomere functions and found the enzyme that copies it.
Telomere DNA protects the chromosomes
In the early phase of her research career, Elizabeth Blackburn mapped DNA sequences. When studying the chromosomes of Tetrahymena, a unicellular ciliate organism, she identified a DNA sequence that was repeated several times at the ends of the chromosomes. The function of this sequence, CCCCAA, was unclear. At the same time, Jack Szostak had made the observation that a linear DNA molecule, a type of minichromosome, is rapidly degraded when introduced into yeast cells.
Blackburn presented her results at a conference in 1980. They caught Jack Szostak's interest and he and Blackburn decided to perform an experiment that would cross the boundaries between very distant species (Fig 2). From the DNA of Tetrahymena, Blackburn isolated the CCCCAA sequence. Szostak coupled it to the minichromosomes and put them back into yeast cells. The results, which were published in 1982, were striking – the telomere DNA sequence protected the minichromosomes from degradation. As telomere DNA from one organism, Tetrahymena, protected chromosomes in an entirely different one, yeast, this demonstrated the existence of a previously unrecognized fundamental mechanism. Later on, it became evident that telomere DNA with its characteristic sequence is present in most plants and animals, from amoeba to man.
An enzyme that builds telomeres
Carol Greider, then a graduate student, and her supervisor Blackburn started to investigate if the formation of telomere DNA could be due to an unknown enzyme. On Christmas Day, 1984, Greider discovered signs of enzymatic activity in a cell extract. Greider and Blackburn named the enzyme telomerase, purified it, and showed that it consists of RNA as well as protein (Fig 3). The RNA component turned out to contain the CCCCAA sequence. It serves as the template when the telomere is built, while the protein component is required for the construction work, i.e. the enzymatic activity. Telomerase extends telomere DNA, providing a platform that enables DNA polymerases to copy the entire length of the chromosome without missing the very end portion.
Blackburn was also the Daily Scan poll favorite. She led the pack with 43 percent of the vote. She was followed by her co-laureate Szostak who garnered 25 percent of the vote.
References:
Szostak JW, Blackburn EH. Cloning yeast telomeres on linear plasmid vectors. Cell 1982; 29:245-255.
Greider CW, Blackburn EH. Identification of a specific telomere terminal transferase activity in Tetrahymena extracts. Cell 1985; 43:405-13.
Greider CW, Blackburn EH. A telomeric sequence in the RNA of Tetrahymena telomerase required for telomere repeat synthesis. Nature 1989; 337:331-7.
The mysterious telomere
The chromosomes contain our genome in their DNA molecules. As early as the 1930s, Hermann Muller (Nobel Prize 1946) and Barbara McClintock (Nobel Prize 1983) had observed that the structures at the ends of the chromosomes, the so-called telomeres, seemed to prevent the chromosomes from attaching to each other. They suspected that the telomeres could have a protective role, but how they operate remained an enigma.
When scientists began to understand how genes are copied, in the 1950s, another problem presented itself. When a cell is about to divide, the DNA molecules, which contain the four bases that form the genetic code, are copied, base by base, by DNA polymerase enzymes. However, for one of the two DNA strands, a problem exists in that the very end of the strand cannot be copied. Therefore, the chromosomes should be shortened every time a cell divides – but in fact that is not usually the case (Fig 1).
Both these problems were solved when this year's Nobel Laureates discovered how the telomere functions and found the enzyme that copies it.
Telomere DNA protects the chromosomes
In the early phase of her research career, Elizabeth Blackburn mapped DNA sequences. When studying the chromosomes of Tetrahymena, a unicellular ciliate organism, she identified a DNA sequence that was repeated several times at the ends of the chromosomes. The function of this sequence, CCCCAA, was unclear. At the same time, Jack Szostak had made the observation that a linear DNA molecule, a type of minichromosome, is rapidly degraded when introduced into yeast cells.
Blackburn presented her results at a conference in 1980. They caught Jack Szostak's interest and he and Blackburn decided to perform an experiment that would cross the boundaries between very distant species (Fig 2). From the DNA of Tetrahymena, Blackburn isolated the CCCCAA sequence. Szostak coupled it to the minichromosomes and put them back into yeast cells. The results, which were published in 1982, were striking – the telomere DNA sequence protected the minichromosomes from degradation. As telomere DNA from one organism, Tetrahymena, protected chromosomes in an entirely different one, yeast, this demonstrated the existence of a previously unrecognized fundamental mechanism. Later on, it became evident that telomere DNA with its characteristic sequence is present in most plants and animals, from amoeba to man.
An enzyme that builds telomeres
Carol Greider, then a graduate student, and her supervisor Blackburn started to investigate if the formation of telomere DNA could be due to an unknown enzyme. On Christmas Day, 1984, Greider discovered signs of enzymatic activity in a cell extract. Greider and Blackburn named the enzyme telomerase, purified it, and showed that it consists of RNA as well as protein (Fig 3). The RNA component turned out to contain the CCCCAA sequence. It serves as the template when the telomere is built, while the protein component is required for the construction work, i.e. the enzymatic activity. Telomerase extends telomere DNA, providing a platform that enables DNA polymerases to copy the entire length of the chromosome without missing the very end portion.
Blackburn was also the Daily Scan poll favorite. She led the pack with 43 percent of the vote. She was followed by her co-laureate Szostak who garnered 25 percent of the vote.
References:
Szostak JW, Blackburn EH. Cloning yeast telomeres on linear plasmid vectors. Cell 1982; 29:245-255.
Greider CW, Blackburn EH. Identification of a specific telomere terminal transferase activity in Tetrahymena extracts. Cell 1985; 43:405-13.
Greider CW, Blackburn EH. A telomeric sequence in the RNA of Tetrahymena telomerase required for telomere repeat synthesis. Nature 1989; 337:331-7.
Friday, October 02, 2009
Oldest Human ancestor discovered
How old are humans? Until recently I thought Lucy to be our oldest ancestor, that is around 3.9 to 2.9 million years old. A recent finding by a group of scientists reveals our oldest known ancestor "Lucy" is now replaced by "Ardi"(Ardipithecus ramidus) which is older by 1.4 million years than Lucy.This individual, 'Ardi,' was a female who weighed about 50 kilograms and stood about 120 centimetres tall.
In its 2 October 2009 issue, Science presents 11 papers, authored by a diverse international team, describing an early hominid species, Ardipithecus ramidus, and its environment. These 4.4 million year old hominid fossils sit within a critical early part of human evolution, and cast new and sometimes surprising light on the evolution of human limbs and locomotion, the habitats occupied by early hominids, and the nature of our last common ancestor with chimps.
Science is making access to this extraordinary set of materials FREE (non-subscribers require a simple registration). The complete collection, and abridged versions, are available FREE as PDF downloads for AAAS members, or may be purchased as reprints.
The last common ancestor shared by humans and chimpanzees is thought to have lived six or more million years ago. Though Ardipithecus is not itself this last common ancestor, it likely shared many of this ancestor's characteristics. Ardi is closer to humans than chimps. Measuring in at 47 in. (120 cm) tall and 110 lb. (50 kg), Ardi likely walked with a strange gait, lurching side to side, due to lack of an arch in its feet, a feature of later hominids. It had somewhat monkey-like feet, with opposable toes, but its feet were not flexible enough to grab onto vines or tree trunks like many monkeys -- rather they were good enough to provide extra support during quick walks along tree branches -- called palm walking.
Another surprise comes in Ardi's environment. Ardi lived in a lush grassy African woodland, with creatures such as colobus monkeys, baboons, elephants, spiral-horned antelopes, hyenas, shrews, hares, porcupines, bats, peacocks, doves, lovebirds, swifts and owls. Fig trees grew around much of the area, and it is speculated that much of Ardi's diet consisted of these figs.
The surprise about the environment is that it lays to rest the theory that hominids developed upright walking when Africa's woodland-grassland mix changed to grassy savanna. Under this now theory, hominids began standing and walking upright as a way of seeing predators over the tall grasses. The discovery of Ardi -- an earlier upright walker that lived in woodland -- greatly weakens this theory.
Scientists have theorized that Ardi may have formed human-like relationships with pairing between single males and females. Evidence of this is found in the male's teeth, which lack the long canines that gorillas and other non-monogamous apes use to battle for females. Describes Professor Lovejoy, "The male canine tooth is no longer projecting or sharp. It's no longer weaponry."
In its 2 October 2009 issue, Science presents 11 papers, authored by a diverse international team, describing an early hominid species, Ardipithecus ramidus, and its environment. These 4.4 million year old hominid fossils sit within a critical early part of human evolution, and cast new and sometimes surprising light on the evolution of human limbs and locomotion, the habitats occupied by early hominids, and the nature of our last common ancestor with chimps.
Science is making access to this extraordinary set of materials FREE (non-subscribers require a simple registration). The complete collection, and abridged versions, are available FREE as PDF downloads for AAAS members, or may be purchased as reprints.
The last common ancestor shared by humans and chimpanzees is thought to have lived six or more million years ago. Though Ardipithecus is not itself this last common ancestor, it likely shared many of this ancestor's characteristics. Ardi is closer to humans than chimps. Measuring in at 47 in. (120 cm) tall and 110 lb. (50 kg), Ardi likely walked with a strange gait, lurching side to side, due to lack of an arch in its feet, a feature of later hominids. It had somewhat monkey-like feet, with opposable toes, but its feet were not flexible enough to grab onto vines or tree trunks like many monkeys -- rather they were good enough to provide extra support during quick walks along tree branches -- called palm walking.
Another surprise comes in Ardi's environment. Ardi lived in a lush grassy African woodland, with creatures such as colobus monkeys, baboons, elephants, spiral-horned antelopes, hyenas, shrews, hares, porcupines, bats, peacocks, doves, lovebirds, swifts and owls. Fig trees grew around much of the area, and it is speculated that much of Ardi's diet consisted of these figs.
The surprise about the environment is that it lays to rest the theory that hominids developed upright walking when Africa's woodland-grassland mix changed to grassy savanna. Under this now theory, hominids began standing and walking upright as a way of seeing predators over the tall grasses. The discovery of Ardi -- an earlier upright walker that lived in woodland -- greatly weakens this theory.
Scientists have theorized that Ardi may have formed human-like relationships with pairing between single males and females. Evidence of this is found in the male's teeth, which lack the long canines that gorillas and other non-monogamous apes use to battle for females. Describes Professor Lovejoy, "The male canine tooth is no longer projecting or sharp. It's no longer weaponry."
Friday, September 04, 2009
Restoring your dell laptop to factory setup
I have a dell laptop that had become extremely slow. I was looking at the options of cleaning the resgistry, cleaning up temporary internet files and all other available suggestions on internet. However, this did not help much. My laptop became almost unusable!! My desperate searches in the internet led me to find something like "While booting your system hold your control and f10 key" or "While booting your system hold your control and f11 key" or "While booting your system hold your control and f12 key" to get to setup option. I tried with F10, but it did not work, so I left the rest of the options.
The best way to look for the right combination of keys is to get hold of your DELL manual that comes along with your installation. It is sometimes also called "Owners manual". Once you got hold of it look for the "solving problem" -> "Restoring your operating system" section. Inside this section, you will find specific instructions as to how to get back to your factory default mode. For example mine says the following:
To use PC Restore:
1 Turn on the computer.
During the boot process, a blue bar with www.dell.com appears at the top of the screen.
2 Immediately upon seeing the blue bar, press < Ctrl >< F11 >.
If you do not press < Ctrl >< F11 > in time, let the computer finish starting, and then restart
the computer again.
NOTICE: If you do not want to proceed with PC Restore, click Reboot in the following step.
3 On the next screen that appears, click Restore.
4 On the next screen, click Confirm.
The restore process takes approximately 6–10 minutes to complete.
5 When prompted, click Finish to reboot the computer.
NOTE: Do not manually shut down the computer. Click Finish and let the computer completely
reboot.
6 When prompted, click Yes.
The computer restarts. Because the computer is restored to its original operating state, the
screens that appear, such as the End User License Agreement, are the same ones that
appeared the first time the computer was turned on.
7 Click Next.
The System Restore screen appears and the computer restarts.
8 After the computer restarts, click OK.
And needless to mention this worked like charm!! I got the desirable result. My computer works like new!! Only trouble is, you would have lost all your data and any new installations. But I don't think this is a big price to pay for this great change in performance. Installing softwares as you need and backing up data is something does not take a lot of effort.. So happy restoring your system.
The best way to look for the right combination of keys is to get hold of your DELL manual that comes along with your installation. It is sometimes also called "Owners manual". Once you got hold of it look for the "solving problem" -> "Restoring your operating system" section. Inside this section, you will find specific instructions as to how to get back to your factory default mode. For example mine says the following:
To use PC Restore:
1 Turn on the computer.
During the boot process, a blue bar with www.dell.com appears at the top of the screen.
2 Immediately upon seeing the blue bar, press < Ctrl >< F11 >.
If you do not press < Ctrl >< F11 > in time, let the computer finish starting, and then restart
the computer again.
NOTICE: If you do not want to proceed with PC Restore, click Reboot in the following step.
3 On the next screen that appears, click Restore.
4 On the next screen, click Confirm.
The restore process takes approximately 6–10 minutes to complete.
5 When prompted, click Finish to reboot the computer.
NOTE: Do not manually shut down the computer. Click Finish and let the computer completely
reboot.
6 When prompted, click Yes.
The computer restarts. Because the computer is restored to its original operating state, the
screens that appear, such as the End User License Agreement, are the same ones that
appeared the first time the computer was turned on.
7 Click Next.
The System Restore screen appears and the computer restarts.
8 After the computer restarts, click OK.
And needless to mention this worked like charm!! I got the desirable result. My computer works like new!! Only trouble is, you would have lost all your data and any new installations. But I don't think this is a big price to pay for this great change in performance. Installing softwares as you need and backing up data is something does not take a lot of effort.. So happy restoring your system.
Tuesday, July 21, 2009
ncftp and wget
As a bioinformatics researcher, it often becomes imperative to download a large amount of data sets from various servers. The most frequent data download site is NCBI and/or EBI. While most of the raw data can be found at NCBI, EBI hosts something much more curated. One nightmare I often face is, updating interproscan database. The data files are something like 9GB and take a lot of time to download, often exceeding the data limit for our server that downloads it. The most irritating of all is when it times out or says "the message list is too long". The best way to handle this trouble could be by using "wget" or "ncftp".
It is very simple to use and is very user friendly. Wget is very reliable and robust. It is specially designed for the home network if you are working from home using an unreliable network. If a download does not complete due to a network problem, Wget will automatically try to continue the download from where it left off, and repeat this until the whole file has been retrieved. It was one of the first clients to make use of the then-new Range HTTP header to support this feature.
If you are using the http protocol with wget then the format will be something like this:
wget --no-check-certificate https://login:passwd@site-address//path
or
wget ftp://ftp.gnu.org/pub/gnu/wget/wget-latest.tar.gz
Or more command info can be found by doing "man wget".
While wget is really cool, ncftp is another ftp protocol, that is sometimes much better than other existing methods, if you must use a ftp protocol for data download. A typical ncftp command could be:
ncftp -u login -p pass ftp://ftp.hostname.edu
Then use get command to get the files of interest.
It is very simple to use and is very user friendly. Wget is very reliable and robust. It is specially designed for the home network if you are working from home using an unreliable network. If a download does not complete due to a network problem, Wget will automatically try to continue the download from where it left off, and repeat this until the whole file has been retrieved. It was one of the first clients to make use of the then-new Range HTTP header to support this feature.
If you are using the http protocol with wget then the format will be something like this:
wget --no-check-certificate https://login:passwd@site-address//path
or
wget ftp://ftp.gnu.org/pub/gnu/wget/wget-latest.tar.gz
Or more command info can be found by doing "man wget".
While wget is really cool, ncftp is another ftp protocol, that is sometimes much better than other existing methods, if you must use a ftp protocol for data download. A typical ncftp command could be:
ncftp -u login -p pass ftp://ftp.hostname.edu
Then use get command to get the files of interest.
Thursday, July 16, 2009
Finding difference between 2 files
In bioinformatics work, one often encounters something like finding difference between 2 sequence files or finding Union and intersection between 2 datasets. This type of operation often needs comparison of one file with the other.
The most common way to do this job is to loop over first file and then loop over second file finding the match, if found terminate the internal loop reset the file pointer and again move on to the second element in the outer loop. This operation need O(M * N) time space. If size of M and N are huge than the amount of time space requirement is humongous. Using a binary search may reduce the time by O(log2 M * N), but the price will be paid again in sorting the files. If the files are non numeric, then the additional burden of converting strings to numbers will add to the price tag. So, recently, I have found using hash tables for finding an element, instead of looping through a second time is a great substitute for binary search. The files I was trying to search against were each having 2 million records. So, in a serial search, the file would have iterated 2 * 2 miilion times, comparing words 4 million times and resetting file 2 million times. While I was running this using the serial version the program ran for 2 days and was still running. The second approach just few minutes to run. Check this out..
Code for serial search:
#!/usr/bin/perl -w
# This script will find the difference between two list files
# Here find the difference between FH1 - FH2
open FH1, $ARGV[0] or die "Can't open file for reading $!\n";
open FH2, $ARGV[1] or die "Can't open file for reading $!\n";
my $flag = 1;
while(){
$flag = 1;
my $toBeSubtracted = $_;
while(){
if($_ eq $toBeSubtracted){
$flag =0;
last;
}
}
if($flag == 1){
print $toBeSubtracted;
}
seek(FH2,0,0);
}
Code using hash table:
#!/usr/bin/perl -w
# Subtract FH - FH1
open FH, $ARGV[0] or die "Can't open the file to be subtracted$!\n";
open FH1, $ARGV[1] or die "Can't open the file to be subtracted$!\n";
my %hash1;
my %hash2;
while(){
my $tmp = $_;
chomp($tmp);
$hash1{$tmp} = 1;
}
close(FH);
while(){
my $tmp = $_;
chomp($tmp);
$tmp =~ s/\t.*//g;
$hash2{$tmp} = 1;
}
close(FH1);
foreach my $key(keys %hash1){
if(exists($hash2{$key})){
#do nothing
}
else {
print "$key\n";
}
}
The most common way to do this job is to loop over first file and then loop over second file finding the match, if found terminate the internal loop reset the file pointer and again move on to the second element in the outer loop. This operation need O(M * N) time space. If size of M and N are huge than the amount of time space requirement is humongous. Using a binary search may reduce the time by O(log2 M * N), but the price will be paid again in sorting the files. If the files are non numeric, then the additional burden of converting strings to numbers will add to the price tag. So, recently, I have found using hash tables for finding an element, instead of looping through a second time is a great substitute for binary search. The files I was trying to search against were each having 2 million records. So, in a serial search, the file would have iterated 2 * 2 miilion times, comparing words 4 million times and resetting file 2 million times. While I was running this using the serial version the program ran for 2 days and was still running. The second approach just few minutes to run. Check this out..
Code for serial search:
#!/usr/bin/perl -w
# This script will find the difference between two list files
# Here find the difference between FH1 - FH2
open FH1, $ARGV[0] or die "Can't open file for reading $!\n";
open FH2, $ARGV[1] or die "Can't open file for reading $!\n";
my $flag = 1;
while(
$flag = 1;
my $toBeSubtracted = $_;
while(
if($_ eq $toBeSubtracted){
$flag =0;
last;
}
}
if($flag == 1){
print $toBeSubtracted;
}
seek(FH2,0,0);
}
Code using hash table:
#!/usr/bin/perl -w
# Subtract FH - FH1
open FH, $ARGV[0] or die "Can't open the file to be subtracted$!\n";
open FH1, $ARGV[1] or die "Can't open the file to be subtracted$!\n";
my %hash1;
my %hash2;
while(
my $tmp = $_;
chomp($tmp);
$hash1{$tmp} = 1;
}
close(FH);
while(
my $tmp = $_;
chomp($tmp);
$tmp =~ s/\t.*//g;
$hash2{$tmp} = 1;
}
close(FH1);
foreach my $key(keys %hash1){
if(exists($hash2{$key})){
#do nothing
}
else {
print "$key\n";
}
}
Cool unix tips
I work with large scale genome sequences, so I often have to handle huge files. The files many times have some unwanted characters and formats that I try to correct using vim substitute command. Although very powerful, substitution operation on huge files using vim is very time consuming!!
A very powerful option to vim substitution is sed(stream editor) utility. This utility is often under utilized because of lack of proper documentation. Some of very powerful file editing can be done using sed.
Simple Substitution:
sed 's/\/usr\/local\/bin/\/common\/bin/'new # will replace /usr/local/bin by /common/bin
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_'new
Some people use colons:
sed 's:/usr/local/bin:/common/bin:'new
Others use the "|" character.
sed 's|/usr/local/bin|/common/bin|'new
Replacing a pattern with something else:
In a file if the content is something like: '123 abc' and you want to make it 123 123 abc, then do the following:
sed 's/[0-9]*/& &/' < old > new # Here '&' serves as the pattern
Meaning of \1 and \2 till \9
These simply mean the pattern number. Altogether sed can remember upto 9 patterns.
Suppose say in a file you would like to replace "1234 abcd" by "abcd 1234" then do the following;
sed 's/\([0-9]*\) \([a-z]*\)/\2 \1/'. Here notice the space between the two patterns.
Substituting only in some lime lines then use the following:
sed '101,532 s/A/a/' or sed '101,$ s/A/a/'
Deleting lines beginning with # is easy:
sed '/^#/ d'
This one will remove comments and blank lines:
sed -e 's/#.*//' -e '/^$/ d'
This one will remove all tabs and blanks before end of the line:
sed -e 's/#.*//' -e 's/[ ^I]*$//' -e '/^$/ d'
A very powerful option to vim substitution is sed(stream editor) utility. This utility is often under utilized because of lack of proper documentation. Some of very powerful file editing can be done using sed.
Simple Substitution:
sed 's/\/usr\/local\/bin/\/common\/bin/'
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_'
Some people use colons:
sed 's:/usr/local/bin:/common/bin:'
Others use the "|" character.
sed 's|/usr/local/bin|/common/bin|'
Replacing a pattern with something else:
In a file if the content is something like: '123 abc' and you want to make it 123 123 abc, then do the following:
sed 's/[0-9]*/& &/' < old > new # Here '&' serves as the pattern
Meaning of \1 and \2 till \9
These simply mean the pattern number. Altogether sed can remember upto 9 patterns.
Suppose say in a file you would like to replace "1234 abcd" by "abcd 1234" then do the following;
sed 's/\([0-9]*\) \([a-z]*\)/\2 \1/'. Here notice the space between the two patterns.
Substituting only in some lime lines then use the following:
sed '101,532 s/A/a/' or sed '101,$ s/A/a/'
Deleting lines beginning with # is easy:
sed '/^#/ d'
This one will remove comments and blank lines:
sed -e 's/#.*//' -e '/^$/ d'
This one will remove all tabs and blanks before end of the line:
sed -e 's/#.*//' -e 's/[ ^I]*$//' -e '/^$/ d'
Subscribe to:
Posts (Atom)