Friday, August 27, 2010

Plotting SAM output

Output from Nextgeneration sequence alignment to genome assembly comes in SAM(Sequence Alignment Map) format. SAM files can be large, so a binary format called BAM is used most often for ease of handling. While full documentation on samtools can be found here , documentation on SAM format can be found here. For quick reference, let me put the SAM alignment format here:


[qname][flag][rname][pos][mapq][cigar][mrnm][mpos][isize][seq][qual][tag][vtype]
[qname]: Query Name
[flag]:      Is a bitwise operator represented in  decimal format, where the value when converted into binary should have the following meaning:
0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair
0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1
0x0004 the query sequence itself is unmapped
0x0008 the mate is unmapped 1
0x0010 strand of the query (0 for forward; 1 for reverse strand)
0x0020 strand of the mate 1
0x0040 the read is the first read in a pair 1,2
0x0080 the read is the second read in a pair 1,2
0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)
0x0200 the read fails platform/vendor quality checks
0x0400 the read is either a PCR duplicate or an optical duplicate
Where;
1. Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.
2. If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should be present and 0x40 and 0x80 are both zero.
Example: In our case, we mostly get 0 or 16 as the value, where 0 means(00000000000) forward strand and 16 means(00000010000) reverse strand. We are NOT concerned about rest of the bits because ours is not a paired end alignment.
CIGAR FORMAT:
M Alignment match (can be a sequence match or mismatch)
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read (clipped sequence present in )
H Hard clip on the read (clipped sequence NOT present in )
P Padding (silent deletion from the padded reference sequence)
Lets not discuss about the other fields
Samtools view command:
samtools view  $DATAFILE/sorted.bam super_0:1000-30000 | cut -f 2,3,4,10 > tmp2
[One thing to remember here is the sorted bam files need to be indexed before using this command. So, in other words keep the index files(sorted.bam.bai ) in the same directory.
super_0 25699   ATTTAAACTAAGCTACGCTTCCTCACATACACGCGTACACGTGTAAGC 

OR
If you want to see it in human readable format use '-X' after 'samtools view'
The output could be:
        super_0 35110   CGGTTGCTAGCGTTAGTGCTGAGGAAACCCTTTAGATCGTAATCCAGT
r       super_0 41561   TGTCGTGTGTACTGAGAAACTTGTATGATGTCTGAATTCTTCAGGCTG



Where the first line means the forward strand and the second line means the reverse strand