Wednesday, October 22, 2008

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

This paper appeared in September issue of Nucleic Acid Research(September; 36(16): e105.)
They generated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. They found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads. Wrong base calls are frequently preceded by base G. Base substitution error frequencies vary by 10- to 11-fold, with A > C transversion being among the most frequent and C > G transversions among the least frequent substitution errors.


Sequencing Stratergies

To achieve high throughput, the new approaches apply different strategies. 454 Life Sciences has adapted pyrosequencing to a microbead format to sequence 400 000 DNA fragments simultaneously, resulting in a per-run dataset of 100 Mbp with reads averaging 250 bp. SOLiD(Applied Biosystems’ Sequencing by Oligonucleotide Ligation and Detection) sequencing also uses templates immobilized onto microbeads. Here, the sequence of the template DNA is decoded by ligation assays involving oligonucleotides labeled with different fluorophores. The SOLiD read length is currently 25–35 bases, and 2–3 Gbp of data can be collected during an 8-day run. Solexa sequencing is based on amplifying single molecules attached to the surface of a flow cell to generate clusters of identical molecules, followed by sequencing using fluorophore-labeled reversible chain terminators. Solexa sequencing proceeds a base at a time and read length depends on the number of sequencing cycles. Current Illumina sequencing instrumentation achieves read lengths of 36 bases. The Solexa flow cell is composed of eight separately loadable lanes. Since each lane has a capacity of about 5 million reads, > 40 million reads can be generated in a run of 3 days, equivalent to > 1.3 Gbp.

This is quite a finding, since many of us are rapidly moving into high throughput sequencing...

No comments: