Earlier I have blogged about converting colorspace files into fastq format, which is pretty easy [here]. However, this method does not go without the risk of getting completely wrong reads if there is an error in the read number in the middle.
There is another approach to it which is reliable - that is to extract the fastq files from the sam/bam alignment files. There are quite a few different ways you can achieve this. One of the easiest is to use samtools to extract matching and unmatching reads separately and merging them together. Something like this:
# Extracting mapping and non-mapping reads from bam files
samtools view -f4 input.bam > /data/bowtieOutput/unmapped.sam
# This is for extracting mapping reads from bam files
samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam
Remember, this method extracts files in sam format, so you may have to get into another step to extract fastq files from sam files
But the only problem with this approach is, you end up getting reads that are mapped to the reference sequence multiple times in multiple numbers. For unmapped reads this problem does not exist.
There is another easier approach is also to use bamtofastq . This is a cool software available here.
Simply use command:
# Extracts aligned reads
./bam2fastq --no-unaligned -o out.fastq input.bam
#Extracts un-aligned reads
./bam2fastq --no-aligned -o out.fastq1 input.bam
This is one of the fastest and easiest methods to extract mapping and unmapping reads from bam files into fastq format. However, this also has similar problem like the sam extractor - that is, it extracts reads mapped to multiple locations multiple times. In order to get around this the fastest way would be to make a uniq set from the aligned.fastq file. Here are the steps that needs to be followed to get it done faster.
1. Sort your bam file before running bam2fastq on them. This way the output fastq file will have all the redundant reads in a sorted order and hence easier to remove redundant.
2. Run bam2fastq on them as described above.
3. Write a small perl script to fish out the uniq reads from the fastq output file as below:
open FH, $ARGV[0] or die "Can't open file for reading $!\n";
my $prev="";
while(<FH>){
if(/\@(\d+_\d+_\d+_)/){
if($prev eq $1){
# Skips reading 3 lines
for(0..2){
<FH>
}
next;
}
$prev = $1;
}
print $_;
}
My Results:
Reads recovered from bam2fastq uniq reads
---------------------------------------- -------------
expt 1 23921306 8909517
expt 2 22685196 8924548
expt 3 40484226 15248005
expt 4 13659968 5141253
expt 5 17185935 6862160
expt 6 19855362 8978295
expt 7 13189432 7067910
expt 8 14423653 6713690
This is quite consistent with what I wanted to achieve!!
There is another approach to it which is reliable - that is to extract the fastq files from the sam/bam alignment files. There are quite a few different ways you can achieve this. One of the easiest is to use samtools to extract matching and unmatching reads separately and merging them together. Something like this:
# Extracting mapping and non-mapping reads from bam files
samtools view -f4 input.bam > /data/bowtieOutput/unmapped.sam
# This is for extracting mapping reads from bam files
samtools view -F4 input.bam > /data/bowtieOutput/mapped.sam
Remember, this method extracts files in sam format, so you may have to get into another step to extract fastq files from sam files
But the only problem with this approach is, you end up getting reads that are mapped to the reference sequence multiple times in multiple numbers. For unmapped reads this problem does not exist.
There is another easier approach is also to use bamtofastq . This is a cool software available here.
Simply use command:
# Extracts aligned reads
./bam2fastq --no-unaligned -o out.fastq input.bam
#Extracts un-aligned reads
./bam2fastq --no-aligned -o out.fastq1 input.bam
This is one of the fastest and easiest methods to extract mapping and unmapping reads from bam files into fastq format. However, this also has similar problem like the sam extractor - that is, it extracts reads mapped to multiple locations multiple times. In order to get around this the fastest way would be to make a uniq set from the aligned.fastq file. Here are the steps that needs to be followed to get it done faster.
1. Sort your bam file before running bam2fastq on them. This way the output fastq file will have all the redundant reads in a sorted order and hence easier to remove redundant.
2. Run bam2fastq on them as described above.
3. Write a small perl script to fish out the uniq reads from the fastq output file as below:
open FH, $ARGV[0] or die "Can't open file for reading $!\n";
my $prev="";
while(<FH>){
if(/\@(\d+_\d+_\d+_)/){
if($prev eq $1){
# Skips reading 3 lines
for(0..2){
<FH>
}
next;
}
$prev = $1;
}
print $_;
}
My Results:
Reads recovered from bam2fastq uniq reads
---------------------------------------- -------------
expt 1 23921306 8909517
expt 2 22685196 8924548
expt 3 40484226 15248005
expt 4 13659968 5141253
expt 5 17185935 6862160
expt 6 19855362 8978295
expt 7 13189432 7067910
expt 8 14423653 6713690
This is quite consistent with what I wanted to achieve!!