genomics: May 2012

Thursday, May 17, 2012

Extracting Fastq files from bam/sam files

Earlier I have blogged about converting colorspace files into fastq format, which is pretty easy [here]. However, this method does not go without the risk of getting completely wrong reads if there is an error in the read number in the middle.

There is another approach to it which is reliable - that is to extract the fastq files from the sam/bam alignment files. There are quite a few different ways you can achieve this. One of the easiest is to use samtools to extract matching and unmatching reads separately and merging them together. Something like this:

# Extracting mapping and non-mapping reads from bam files
samtools view -f4 input.bam > /data/bowtieOutput/unmapped.sam
# This is for extracting mapping reads from bam files
samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam

Remember, this method extracts files in sam format, so you may have to get into another step to extract fastq files from sam files

But the only problem with this approach is, you end up getting reads that are mapped to the reference sequence multiple times in multiple numbers. For unmapped reads this problem does not exist.

There is another easier approach is also to use bamtofastq . This is a cool software available here.

Simply use command:

# Extracts aligned reads
./bam2fastq --no-unaligned -o out.fastq input.bam
#Extracts un-aligned reads
./bam2fastq --no-aligned -o out.fastq1 input.bam

This is one of the fastest and easiest methods to extract mapping and unmapping reads from bam files into fastq format. However, this also has similar problem like the sam extractor - that is, it extracts reads mapped to multiple locations multiple times. In order to get around this the fastest way would be to make a uniq set from the aligned.fastq file. Here are the steps that needs to be followed to get it done faster.

1. Sort your bam file before running bam2fastq on them. This way the output fastq file will have all the redundant reads in a sorted order and hence easier to remove redundant.
2. Run bam2fastq on them as described above.
3. Write a small perl script to fish out the uniq reads from the fastq output file as below:

open FH, $ARGV[0] or die "Can't open file for reading $!\n";

my $prev="";

while(<FH>){

        if(/\@(\d+_\d+_\d+_)/){

                if($prev eq $1){
                        # Skips reading 3 lines
                        for(0..2){
                         <FH>
                        }
                next;
                }
        $prev = $1;
        }

        print $_;
}

My Results:

                         Reads recovered from bam2fastq               uniq reads
                         ----------------------------------------               -------------
expt 1                  23921306                                          8909517
expt 2                  22685196                                        8924548
expt 3                  40484226                                          15248005
expt 4                  13659968                                          5141253
expt 5                  17185935                                               6862160
expt 6              19855362    8978295
expt 7                  13189432                                                     7067910
expt 8                  14423653 6713690

This is quite consistent with what I wanted to achieve!!