genomics: March 2012

Wednesday, March 14, 2012

Pipelines VS Semi-manual workflow

After working in both the systems for quite sometimes, I am finally here to compare both the processes into little bit more in detail.

Development Time and Error Detection:
Pipelines have usually long development incubation period and the final outcome is often not bug free. Something breaks somewhere it takes the entire crew to sit and break their heads for sometime before a solution emerges.
Semi-manual workflow system on the other hand is a compositions of scripts that are not connected with each other. Steps are defined, but the scripts need to be run one after the other(Can be bundled up in a shell though) under somebody's supervision till it completes. System development does not take a fortune with one or two dedicated developers. Bugs may be there but easy to fix since one knows exactly where it comes from.

Making Changes:

Pipelines: Making any kind of changes to a pipeline is not as easy as may be thought. It takes a lot of time and effort to add additional features to the existing ones because it needs a whole lot of development both upstream and downstream.
Semi-manual workflow system: Making changes is less cumbersome since each individual step occur as a separate unit and quick fixes can be inserted anywhere without making a life altering change and hence relatively easier.

Working with Dataset classes:
Pipelines: Pipelines often follow strict nomenclature and hierarchy distribution. Parameter passing is not automatic, is passed through classes or some form of XML files. These are hand edited. Everytime a pipeline fails, you need to re-run the whole process.
Semi-manual workflow system: Parameters are hand written set for each of the steps, little bit cumbersome but if documented well, one can do it effortlessly. The manual editing time for configuration files for a pipeline and semi-automated process is more or less same.

Having said that, I will give an example. I work on a database schema called as GUS. It is a huge schema system and requires quite a learning curve to understand and work on it. We have already passed that phase and I have created loose semi-automated steps for genome data upload. At the same time, I am working with another complex system based on slightly modified version of GUS, but has a pipeline built into it. After working incessantly for 5 months, we could only upload one genome with same amount of depth as compared to 4 genomes in a span of 15 days using semi-automated system. Given a choice, I will always bet for semi-automated system.