As biologists most of us feel little bit awkward with mathematics and statistics, and more so with the application of mathematics and statistics in interpreting biological data. Recently we see a meteoric rise in availability of biological sequences. However, interpreting them properly still is a bottleneck.
HMMs are there for a long time now, very valuable in speech recognition, weather prediction etc. But recently, we find a surge in application of HMMs in interpreting biological sequences such as gene prediction, profile searches, domain prediction etc. A beginner however gets awed by the complexity and finds it difficult to understand completely how this amazing method/tool can help interpret sequences. Here is my attempt to simplify things a bit.
What is HMM?
Profile HMMs are based on position specific scoring matrices. Lets take the example of the following diagram where we have multiple sequences aligned.
The following are the characteristics:
1. Columns in a multiple sequence alignment is called as a state.
2. Number of alphabets taken together is called as order of the markov chain e.g; zero order, first order and so on.
3. PSSM(Position specific scoring matrix) is calculated for calculating the match states probabilities.
In a HMM, there is a transition probability and an emission probability and there is a state. Lets look at the simplest HMM heuristic:
Suppose say we have a multiple sequence alignment, as represented in Figure - 1.
[ state -> is the column in a multiple sequence alignment.
Transition probabilities are determined using deletions and insertions. Emission probabilities at each state are determined by counting the occurrence of each character in each column. ]
So, lets try to generate the Emission and Transition probabilities from this alignment from the figure.
Emission Probability:
A T G C
Pos1 1.0 0 0 0
Pos2 0 0 0.2 0.8
Pos3 0.2 0 0.4 0
Pos4 0.6 0 0.2 0
Transition Probability:( There are 3 transition states, match, deletion, insertion. So, there can be 9 + 2 transition states:
m->m m->i m->d i->m i->i i->d d->m d->i d->d s->m m->e
Trans 1 0 0 0 0 0 0 0 0 0 1 0
Trans 2 1 0 0 0 0 0 0 0 0 0 0
Trans 3 0.6 0 0.4 0 0 0 0 0 0 0 0
Trans 4 0.8 0 0.2 0 0 0 0 0 0 0 0
Trans 5 0 0 0 0 0 0 0 0 0 0 1
[In this table m stands for match, i stands for insertion and d for deletion, s for start and e for end]
Constructing the HMM:
HMMER software comes with a nice user guide. I recommend you read this guide from cover to cover and if you still did not get it, then continue reading my post.
Use of HMM in aligning against another sequence or profile:
After we create the HMM, we can use it to search for sequences that are similar to the ones used to create the HMM. So, we use the model to estimate the probability that a sequence belongs to the model. This probability is often expressed as negative log likelihood score and is dependent on the length of the sequence and the length of the model. Once we have a score, we need a way to assess its significance, just as we did for pairwise sequence alignments.
HMMs have some distinct advantages. They are built on a formal probabilistic basis. They also allow for more sensitive searching than pairwise alignments do. The probabilistic theory can be used to guide the scoring parameters; the theory also allows us to train an HMM using unaligned sequences if we don’t know or trust the alignment. There is a consistent theory behind gap and insertion penalties. Finally, less skill and human intervention is needed to train a good HMM than to train a hand-constructed profile. Because of this, we can create libraries of hundreds of profile HMMs and apply them on a large scale.
Its one major drawback is that its probabilistic basis does not capture higher-order correlations between sequence positions. It assumes that the identity of a particular position is independent of the identity of all of the other positions in the sequence, and we know that not to be true.
Publicly available program HMMER3 just does this.
Components of Hmmer3 Program:
phmmer: Makes a blastp like search.
jackhmmer: Iteratively searches a sequence against a sequence database (something similar to Psi-blast).
hmmbuild: Builds a profile HMM from multiple sequence alignment.
hmmsearch: Searches a profile hmm against a sequence database(Query: HMM, target: AA).
hmmscan: Searches a sequence against a profile hmm database(Query: AA; target: HMM e.g; pfam).
hmmalign: Make a multiple sequence alignment of many sequences to a existing profile.
hmmconvert: Converts profiles into hmm format and vice versa.
hmmemit: Generate sample sequences from profiles.
hmmfetch: Get a profile HMM by name or accession number from the HMM database
hmmpress: Format an hmm database into a binary format to be used by hmmscan.
hmmstat: Show summary statistics for each profile in an hmm database.
In the next post, we will discuss about Applications of Profile search and command line options of Hmmer3 with examples.
HMMs are there for a long time now, very valuable in speech recognition, weather prediction etc. But recently, we find a surge in application of HMMs in interpreting biological sequences such as gene prediction, profile searches, domain prediction etc. A beginner however gets awed by the complexity and finds it difficult to understand completely how this amazing method/tool can help interpret sequences. Here is my attempt to simplify things a bit.
What is HMM?
Profile HMMs are based on position specific scoring matrices. Lets take the example of the following diagram where we have multiple sequences aligned.
The following are the characteristics:
1. Columns in a multiple sequence alignment is called as a state.
2. Number of alphabets taken together is called as order of the markov chain e.g; zero order, first order and so on.
3. PSSM(Position specific scoring matrix) is calculated for calculating the match states probabilities.
In a HMM, there is a transition probability and an emission probability and there is a state. Lets look at the simplest HMM heuristic:
Suppose say we have a multiple sequence alignment, as represented in Figure - 1.
Figure 1 |
[ state -> is the column in a multiple sequence alignment.
Transition probabilities are determined using deletions and insertions. Emission probabilities at each state are determined by counting the occurrence of each character in each column. ]
So, lets try to generate the Emission and Transition probabilities from this alignment from the figure.
Emission Probability:
A T G C
Pos1 1.0 0 0 0
Pos2 0 0 0.2 0.8
Pos3 0.2 0 0.4 0
Pos4 0.6 0 0.2 0
Transition Probability:( There are 3 transition states, match, deletion, insertion. So, there can be 9 + 2 transition states:
m->m m->i m->d i->m i->i i->d d->m d->i d->d s->m m->e
Trans 1 0 0 0 0 0 0 0 0 0 1 0
Trans 2 1 0 0 0 0 0 0 0 0 0 0
Trans 3 0.6 0 0.4 0 0 0 0 0 0 0 0
Trans 4 0.8 0 0.2 0 0 0 0 0 0 0 0
Trans 5 0 0 0 0 0 0 0 0 0 0 1
[In this table m stands for match, i stands for insertion and d for deletion, s for start and e for end]
Constructing the HMM:
There are multiple algorithms that can be used to build and train HMMs.
i) EM, or Baum-Welsch, algorithm.
This algorithm uses a random MSA, such as the
one constructed by ClustalW. The number of match states is
defined as the number of conserved columns. This algorithm uses a
two stage, iterative process; first, it uses the aligned residues to
build a model, and then it realigns the sequences to the model.
These two steps are repeated until the model no longer changes.
ii)Viterbi algorithm:
Uses dynamic programming
to calculate the best estimate of the emission probabilities.
iii)“Surgery” algorithm:
Uses dynamic
adjustments of the HMM length to calculate the best
models.
HMMER software comes with a nice user guide. I recommend you read this guide from cover to cover and if you still did not get it, then continue reading my post.
Use of HMM in aligning against another sequence or profile:
After we create the HMM, we can use it to search for sequences that are similar to the ones used to create the HMM. So, we use the model to estimate the probability that a sequence belongs to the model. This probability is often expressed as negative log likelihood score and is dependent on the length of the sequence and the length of the model. Once we have a score, we need a way to assess its significance, just as we did for pairwise sequence alignments.
HMMs have some distinct advantages. They are built on a formal probabilistic basis. They also allow for more sensitive searching than pairwise alignments do. The probabilistic theory can be used to guide the scoring parameters; the theory also allows us to train an HMM using unaligned sequences if we don’t know or trust the alignment. There is a consistent theory behind gap and insertion penalties. Finally, less skill and human intervention is needed to train a good HMM than to train a hand-constructed profile. Because of this, we can create libraries of hundreds of profile HMMs and apply them on a large scale.
Its one major drawback is that its probabilistic basis does not capture higher-order correlations between sequence positions. It assumes that the identity of a particular position is independent of the identity of all of the other positions in the sequence, and we know that not to be true.
Publicly available program HMMER3 just does this.
Components of Hmmer3 Program:
phmmer: Makes a blastp like search.
jackhmmer: Iteratively searches a sequence against a sequence database (something similar to Psi-blast).
hmmbuild: Builds a profile HMM from multiple sequence alignment.
hmmsearch: Searches a profile hmm against a sequence database(Query: HMM, target: AA).
hmmscan: Searches a sequence against a profile hmm database(Query: AA; target: HMM e.g; pfam).
hmmalign: Make a multiple sequence alignment of many sequences to a existing profile.
hmmconvert: Converts profiles into hmm format and vice versa.
hmmemit: Generate sample sequences from profiles.
hmmfetch: Get a profile HMM by name or accession number from the HMM database
hmmpress: Format an hmm database into a binary format to be used by hmmscan.
hmmstat: Show summary statistics for each profile in an hmm database.
In the next post, we will discuss about Applications of Profile search and command line options of Hmmer3 with examples.