genomics: June 2012

Monday, June 25, 2012

HMMs what they are good at....

As biologists most of us feel little bit awkward with mathematics and statistics, and more so with the application of mathematics and statistics in interpreting biological data. Recently we see a meteoric rise in availability of biological sequences. However, interpreting them properly still is a bottleneck.
HMMs are there for a long time now, very valuable in speech recognition, weather prediction etc. But recently, we find a surge in application of HMMs in interpreting biological sequences such as gene prediction, profile searches, domain prediction etc. A beginner however gets awed by the complexity and finds it difficult to understand completely how this amazing method/tool can help interpret sequences. Here is my attempt to simplify things a bit.

What is HMM?

Profile HMMs are based on position specific scoring matrices. Lets take the example of the following diagram where we have multiple sequences aligned.
The following are the characteristics:
1. Columns in a multiple sequence alignment is called as a state.
2. Number of alphabets taken together is called as order of the markov chain e.g; zero order, first order and so on.
3. PSSM(Position specific scoring matrix) is calculated for calculating the match states probabilities.

In a HMM, there is a transition probability and an emission probability and there is a state. Lets look at the simplest HMM heuristic:

Suppose say we have a multiple sequence alignment, as represented in Figure - 1.

Figure 1

[ state -> is the column in a multiple sequence alignment.
Transition probabilities are determined using deletions and insertions. Emission probabilities at each state are determined by counting the occurrence of each character in each column. ]

So, lets try to generate the Emission and Transition probabilities from this alignment from the figure.

Emission Probability:
                  A                 T                   G                  C
Pos1          1.0               0                    0                   0

Pos2          0                  0                    0.2                0.8
Pos3          0.2               0                    0.4                0
Pos4          0.6               0                    0.2                0

Transition Probability:( There are 3 transition states, match, deletion, insertion. So, there can be 9 + 2 transition states:
                     m->m     m->i     m->d     i->m    i->i    i->d    d->m      d->i    d->d      s->m       m->e
Trans 1         0             0           0            0         0        0         0             0         0           1             0
Trans 2         1             0           0            0         0        0         0             0         0           0             0
Trans 3         0.6    0 0.4 0 0    0 0 0 0 0 0
Trans 4         0.8          0           0.2         0         0        0         0             0         0           0             0
Trans 5         0             0           0            0         0        0         0             0         0           0             1
[In this table m stands for match, i stands for insertion and d for deletion, s for start and e for end]

Constructing the HMM:

There are multiple algorithms that can be used to build and train HMMs.

i) EM, or Baum-Welsch, algorithm.

This algorithm uses a random MSA, such as the one constructed by ClustalW. The number of match states is defined as the number of conserved columns. This algorithm uses a two stage, iterative process; first, it uses the aligned residues to build a model, and then it realigns the sequences to the model. These two steps are repeated until the model no longer changes.

ii)Viterbi algorithm:

Uses dynamic programming to calculate the best estimate of the emission probabilities.

iii)“Surgery” algorithm:

Uses dynamic adjustments of the HMM length to calculate the best models.

HMMER software comes with a nice user guide. I recommend you read this guide from cover to cover and if you still did not get it, then continue reading my post.

Use of HMM in aligning against another sequence or profile:
After we create the HMM, we can use it to search for sequences that are similar to the ones used to create the HMM. So, we use the model to estimate the probability that a sequence belongs to the model. This probability is often expressed as negative log likelihood score and is dependent on the length of the sequence and the length of the model. Once we have a score, we need a way to assess its significance, just as we did for pairwise sequence alignments.

HMMs have some distinct advantages. They are built on a formal probabilistic basis. They also allow for more sensitive searching than pairwise alignments do. The probabilistic theory can be used to guide the scoring parameters; the theory also allows us to train an HMM using unaligned sequences if we don’t know or trust the alignment. There is a consistent theory behind gap and insertion penalties. Finally, less skill and human intervention is needed to train a good HMM than to train a hand-constructed profile. Because of this, we can create libraries of hundreds of profile HMMs and apply them on a large scale.

Its one major drawback is that its probabilistic basis does not capture higher-order correlations between sequence positions. It assumes that the identity of a particular position is independent of the identity of all of the other positions in the sequence, and we know that not to be true.

Publicly available program HMMER3 just does this.

Components of Hmmer3 Program:
phmmer: Makes a blastp like search.
jackhmmer: Iteratively searches a sequence against a sequence database (something similar to Psi-blast).
hmmbuild: Builds a profile HMM from multiple sequence alignment.
hmmsearch: Searches a profile hmm against a sequence database(Query: HMM, target: AA).
hmmscan: Searches a sequence against a profile hmm database(Query: AA; target: HMM e.g; pfam).
hmmalign: Make a multiple sequence alignment of many sequences to a existing profile.

hmmconvert: Converts profiles into hmm format and vice versa.
hmmemit: Generate sample sequences from profiles.
hmmfetch: Get a profile HMM by name or accession number from the HMM database
hmmpress: Format an hmm database into a binary format to be used by hmmscan.
hmmstat: Show summary statistics for each profile in an hmm database.

In the next post, we will discuss about Applications of Profile search and command line options of Hmmer3 with examples.

Thursday, June 07, 2012

Job opportunity as an environment intern!!!

I got this email to my official email account. Thought will share it here..

www.TerraScapes.org

Environmental Organizer Intern

Job Description:

Do you want to incorporate the environment into your everyday life through environmental activism, advocacy, social justice and event planning? Interning for Terra-Scapes Environmental Consulting is a unique opportunity for a student with a passion for the environment. Engaging the community through educational initiatives about current environmental concerns will be your goal. The Environmental Organizer will work on several pertinent environmental campaigns ranging from protecting our oceans and fisheries to the prohibition of additional chicken farms on the Eastern Shore.

The Environmental Organizer will have the opportunity to gain experience and forge connections in the environmental and green business communities. The position entails working with the following prominent environmental organizations: Pew Environment Group, Sierra Club, Chesapeake Bay Foundation, Lynnhaven River Now, Nature Conservancy, and more.

Responsibilities:

· Research, organize, and execute logistics for environmental workshops and events.

· Recruit and communicate with new and existing volunteers.

· Strengthen and build coalitions with new and existing groups, scientists, students, elected officials, and businesses.

· Work with the media on press releases, advisories, letters to the editor, and opinion editorials.

· Schedule and participate in meetings with local, state, and federal legislators regarding environmental policy initiatives.

· Database organization and entry using Google docs and Excel spreadsheets.

· Research and preparation of reports, documents, and memorandums using Microsoft Office.

· Assist in creation of materials for Terra-Scapes and its coalition partners.

· Develop plan to reach goals with each individual campaign.

Qualifications:

· Grasp office-related computer (Mac and PC) and internet applications.

· Have a passion for the environment and the growth of the green economy.

· Ability to prioritize, work independently, and follow through on concurrent projects and goals.

· Superior organizational skills and ability to meet deadlines required.

· Ability to communicate easily with a diverse range of people.

· Demonstrated experience working independently.

· Owns a vehicle and willing to travel throughout VA for specific workshops or conferences.

· Willingness to meet with Congressional members in Washington, D.C.

· Ability to work from your home/dorm.

Location: Primarily in Hampton Roads, but requires some travel throughout Virginia, North Carolina, Maryland, and Washington, D.C.

To apply, please email resume and cover letter to terra@terrascapes.org

Thanks again!

Terra Pascarosa Duff

TerraScapes Environmental Director

VA Pew Environment Group Representative

Chesapeake Bay Group Sierra Club Chair

www.TerraScapes.org

http://www.pewenvironment.org/

p: 757-291-4736 | f: 866-543-6696 | e: terra@terrascapes.org

"You must be the change you wish to see in the world."
— Mahatma Gandhi