Friday, March 22, 2013

Training Augustus gene predictor for your organism

Lately I have been asked by multiple people to solve the training problem of Augustus for their organism data. Although I have done it earlier, this time, I faced unusually long time in solving this issue. Augustus wants you to provide a training dataset either in genbank format or in gff format. It typically wants users to remove duplicate genes; proteins with 70% identity, in order to get over the over fitting problem. Also, remove any kind of overlapping genes from your training files. If you could not find a script that can do it efficiently, please contact me directly I will provide you one (sorry, I am yet to set my source code page up).

I have generated a genbank file from my gff and the annotation files (again script can be sent upon request). Howver, few things I have overlooked and they were leaving empty lines between 2 genomic LOCUS entries. If empty space is left then the randomSplitter.pl that is used to split the gb training file into 2 random parts fails. You either get all your data in .test file or all of them in .train file. Suppose you decide to move on with your .train file with all the data and .test as a empty file. Surprises that you will end up with std::bad_alloc() error from augustus, that has nothing to do with the memory of your system. So, if you have an autogenerated genbank file with empty lines, open the file in vim editor and get rid of empty lines using  :g/^$/d. 

Once done, try splitting your *.gb file with randomSplit.pl. This will do the magic. Now try to train your program and it should run file.

Here is a check list of commands what you should be doing to train for Augustus.


perl scripts/new_species.pl --species=sojae
perl scripts/randomSplit.pl psv5.gb 30
./bin/etraining --species=sojae psv5.gb.train
# mock prediction
./bin/augustus --species=sojae psv5.gb.test | tee firsttest.out
grep -A 22 Evaluation firsttest.out



This is what you get when you run the last command. This is a pretty good indicator that your training is almost good to go.




*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.861 |       0.674 |
---------------------------------------------/

--------------------------------------------------------------------------------
--------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |
            |             |
           | total/ | total/ |   TP |--------------------|--------------------|
sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |
            |             |
--------------------------------------------------------------------------------
--------------------------|
           |        |        |      |              24559 |              20733 |
            |             |
exon level |  34213 |  30387 | 9654 | ------------------ | ------------------ |
      0.318 |       0.282 |
           |  34213 |  30387 |      | 8709 | 4167 | 11683 | 9312 | 5502 | 5919 |
             |             |
--------------------------------------------------------------------------------
--------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level | 10348 | 12700 | 2547 | 7801 | 10153 |       0.201 |       0.246 |
----------------------------------------------------------------------------/

Sorry this output is jumbled but in essence it tells that 2547 genes were predicted exactly. Although exon and gene level results does not look that good, you could actually try and train the software using various types of exons that are accurate.

perl scripts/optimize_augustus.pl psv5.gb.train # This will take a while....