Converting SAM/BAM to FASTQ

Data Types: SAM, BAM

Usage

genocat --fastq [genozip files...]

Viewing only R1 reads:

genocat --fastq --FLAG +0x40 [genozip files...]

Viewing only R2 reads:

genocat --fastq --FLAG +0x80 [genozip files...]

Outputting all SAM fields on the FASTQ description line:

genocat --fastq=all [genozip files...]

Outputting fq.gz (BGZF-compressed FASTQ):

genocat myfile.bam.genozip --output myfile.fq.gz

Description

Displays the contents of the SAM / BAM data in FASTQ format:

  • The alignments are outputted as FASTQ reads in the order they appear in the SAM/BAM file.

  • Alignments with FLAG 0x10 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed.

  • Alignments with FLAG 0x0800 (supplementary) or 0x0100 (secondary) are dropped.

  • Alignments with FLAG 0x40 (first segment in the template) have a /1 added after the read name unless –FLAG is specified as well.

  • Alignments with FLAG 0x80 (last segment in the template) have a /2 added after the read name unless –FLAG is specified as well.

  • If –output specifies a filename ending with .fq[.gz] or .fastq[.gz] then –fastq is activated implicitly.

Example:

Consider a simple paired-end SAM file with three alignments, all with the same QNAME. The first two are the primary alignments of R1 and R2, and the third is a secondary alignment of R1:

$ genozip myfile.sam  # compress with genozip

$ genocat myfile.sam.genozip
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593  99      chr1    33656   0       14M1D136M       =       33901   386     CCTAATGCTATCCCCCCCCCGCCCCCCACGCCCTGACAAGCCCCCGTGTGTGATGTTTTCCGCCCCCTGTCCAAGCCTTCCCATTGTTCAATTCCCCCCTGTGAGTGAGAACATGCAGGGTTTGGGTTTCTGTCTTTGTGATAGTTTGCT  AAAFFJJJJJJJJJJJJJJ-AAJJFJJ-77F-77FFJ----AJJ---7-77-A-<FJ-FF-7-AJJ---7-A-F-A-<FJ-7<<JJFJ-AF-<7AJ-<<-7--7A----<7-A-77-77A-AF-A7FJJ7J<FJ7J<-A--AA7-AA--7  NM:i:16 MD:Z:0T13^T5A9C1A6G5A4A7C2T5A9G3T15A21T6T24     AS:i:72 XS:i:72 MQ:i:0  ms:i:5286       mc:i:34049      MC:Z:141M8S
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593  147     chr1    33901   0       141M8S  =       33656   -386    CACATTTTCTTAATCCAGTCTGTCATTAATGGACATTTGGGTTGGTTCAAAGTCTTTGCTATTGTGAATAGTGCCACAATAAACATACATGTGCATGTGTCTTTATAGTAGCACGATTTATAATCCTTTGGGTATATACCCAGTAATGG   JJFAF7-7F7<--<AA-JJF<JAJAA<JJ<A--F-<AJJFFAAAJAJJJF7FJJJ7FJFJFJFJJJJJ7FJ<JJF<FJJJJJJJJ<JFJAJJF<<-JJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA   NM:i:3  MD:Z:6C20G42C70 AS:i:126        XS:i:129        MQ:i:0  ms:i:3466       mc:i:33656      MC:Z:14M1D136M
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593  355     chr22   33656   0       14M1D136M       =       33901   386     CCTAATGCTATCCCCCCCCCGCCCCCCACGCCCTGACAAGCCCCCGTGTGTGATGTTTTCCGCCCCCTGTCCAAGCCTTCCCATTGTTCAATTCCCCCCTGTGAGTGAGAACATGCAGGGTTTGGGTTTCTGTCTTTGTGATAGTTTGCT  AAAFFJJJJJJJJJJJJJJ-AAJJFJJ-77F-77FFJ----AJJ---7-77-A-<FJ-FF-7-AJJ---7-A-F-A-<FJ-7<<JJFJ-AF-<7AJ-<<-7--7A----<7-A-77-77A-AF-A7FJJ7J<FJ7J<-A--AA7-AA--7  NM:i:16 MD:Z:0T13^T5A9C1A6G5A4A7C2T5A9G3T15A21T6T24     AS:i:72 XS:i:72 MQ:i:0  ms:i:5286       mc:i:34049      MC:Z:141M8S
Let’s now display the data as FASTQ. Notice that:
1. The last SAM line is eliminated because it is a secondary alignment.
2. The read names have /1 and /2 added to them.
genocat myfile.sam.genozip --fastq
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1
CCTAATGCTATCCCCCCCCCGCCCCCCACGCCCTGACAAGCCCCCGTGTGTGATGTTTTCCGCCCCCTGTCCAAGCCTTCCCATTGTTCAATTCCCCCCTGTGAGTGAGAACATGCAGGGTTTGGGTTTCTGTCTTTGTGATAGTTTGCT
+
AAAFFJJJJJJJJJJJJJJ-AAJJFJJ-77F-77FFJ----AJJ---7-77-A-<FJ-FF-7-AJJ---7-A-F-A-<FJ-7<<JJFJ-AF-<7AJ-<<-7--7A----<7-A-77-77A-AF-A7FJJ7J<FJ7J<-A--AA7-AA--7
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2
CCATTACTGGGTATATACCCAAAGGATTATAAATCGTGCTACTATAAAGACACATGCACATGTATGTTTATTGTGGCACTATTCACAATAGCAAAGACTTTGAACCAACCCAAATGTCCATTAATGACAGACTGGATTAAGAAAATGTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJ-<<FJJAJFJ<JJJJJJJJF<FJJ<JF7JJJJJFJFJFJF7JJJF7FJJJAJAAAFFJJA<-F--A<JJ<AAJAJ<FJJ-AA<--<7F7-7FAFJJ
let’s now output only the R1 reads (the first SAM line in this case). Notice that:
1. /1 is not added.
2. –fq is an alternative spelling for activating the –fastq option.
$ genocat myfile.sam.genozip --fq --FLAG +0x40
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593
CCTAATGCTATCCCCCCCCCGCCCCCCACGCCCTGACAAGCCCCCGTGTGTGATGTTTTCCGCCCCCTGTCCAAGCCTTCCCATTGTTCAATTCCCCCCTGTGAGTGAGAACATGCAGGGTTTGGGTTTCTGTCTTTGTGATAGTTTGCT
+
AAAFFJJJJJJJJJJJJJJ-AAJJFJJ-77F-77FFJ----AJJ---7-77-A-<FJ-FF-7-AJJ---7-A-F-A-<FJ-7<<JJFJ-AF-<7AJ-<<-7--7A----<7-A-77-77A-AF-A7FJJ7J<FJ7J<-A--AA7-AA--7
Finally, we can output all SAM fields to the FASTQ description lines with --fq=all:
$ genocat myfile.sam.genozip --fq=all
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1   FLAG:99 RNAME:chr1      POS:33656       MAPQ:0  CIGAR:14M1D136M RNEXT:= PNEXT:33901     TLEN:386       NM:i:16 MD:Z:0T13^T5A9C1A6G5A4A7C2T5A9G3T15A21T6T24     AS:i:72 XS:i:72 MQ:i:0  ms:i:5286       mc:i:34049      MC:Z:141M8S
CCTAATGCTATCCCCCCCCCGCCCCCCACGCCCTGACAAGCCCCCGTGTGTGATGTTTTCCGCCCCCTGTCCAAGCCTTCCCATTGTTCAATTCCCCCCTGTGAGTGAGAACATGCAGGGTTTGGGTTTCTGTCTTTGTGATAGTTTGCT
+
AAAFFJJJJJJJJJJJJJJ-AAJJFJJ-77F-77FFJ----AJJ---7-77-A-<FJ-FF-7-AJJ---7-A-F-A-<FJ-7<<JJFJ-AF-<7AJ-<<-7--7A----<7-A-77-77A-AF-A7FJJ7J<FJ7J<-A--AA7-AA--7
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2   FLAG:147        RNAME:chr1      POS:33901       MAPQ:0  CIGAR:141M8S    RNEXT:= PNEXT:33656
        TLEN:-386       NM:i:3  MD:Z:6C20G42C70 AS:i:126        XS:i:129        MQ:i:0  ms:i:3466       mc:i:33656      MC:Z:14M1D136M
CCATTACTGGGTATATACCCAAAGGATTATAAATCGTGCTACTATAAAGACACATGCACATGTATGTTTATTGTGGCACTATTCACAATAGCAAAGACTTTGAACCAACCCAAATGTCCATTAATGACAGACTGGATTAAGAAAATGTG
+
AAFFFJJ