Filtering BAM or FASTQ reads by species using kraken2¶
kraken2 is a tool (not associated with Genozip) that analyzes a FASTQ or FASTA file against a database typically derived from reference sequences of a large number of species (for example: human plus bacteria) and assigns an NCBI Taxonomy ID (taxid) to each read - generating a kraken output file containing these assignments.
Genozip allows filtering of SAM, BAM, FASTQ and FASTA files by taxid, using the kraken data (i.e. the kraken output file). This allows, for example, filtering contamination out of an existing BAM file without needing to re-map it.
Already have a BAM (or SAM, FASTQ of FASTA) file and kraken data?
Step 1: Compress the BAM (or SAM, FASTQ of FASTA) file and the kraken data with genozip:
genozip myfile.bam genozip mykrakendata.kraken
Note: If your kraken file doesn’t have a
.kraken file name extension, use the
--input kraken option to tell genozip that this is kraken data (also accepted:
Step2: See Filter it.
Preparing the kraken file
From a BAM genozip file:
mkfifo mysample.fq genocat myfile.bam.genozip --fastq -o mysample.fq & kraken2 --db $kraken_db --report mykrakendata.kraken.report mysample.fq | genozip - --input kraken --output mykrakendata.kraken.genozip rm mysample.fq
Note that we use named pipe (generated by mkfifo) for in-memory communication from genocat to kraken2. This is to overcome kraken’s inability to use a regular pipe for the input file.
From a paird FASTQ genozip file:
mkfifo mysample.kraken mysample.R1.fq mysample.R2.fq genocat myfile.fq.genozip --component 1 -o mysample.R1.fq & genocat myfile.fq.genozip --component 2 -o mysample.R2.fq & kraken2 --db <kraken-db> --paired --output mkfifo.kraken --report mykrakendata.kraken.report mysample.R1.fq mysample.R2.fq & genozip mykrakendata.kraken rm mysample.kraken mysample.R1.fq mysample.R2.fq
Note that in this case we split the paired FASTQ data to two different pipes to overcomes kraken’s inability to consume a paired fastq in interleaved format.
Filtering an existing file with the kraken data
This will display only the lines of the BAM file containing human data (NCBI Taxonomy ID 9606):
genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid 9606
This will display all the lines of the BAM file that DO NOT contain human data:
genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid ^9606
A +0 suffix can be used include or exclude, in addition, also unclassified reads:
genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid 9606+0
The –kraken and –taxid options work the same for SAM/BAM, FASTQ and FASTA files.
Adding taxonomy information to a Genozip file
You can generate a SAM/BAM, FASTQ or FASTA file with taxonomy information baked in:
genozip myfile.bam --kraken mykrakendata.kraken.genozip
This stores internally in the Genozip file, for each line, its taxid. Unless you need the kraken file for other purposes, you can discard it at this point.
In the case of SAM and BAM (but not FASTQ and FASTA), the taxid is will be visible as an additional field tx:i, for example:
Now, you can filter the file, as before, but this time it will be a lot faster since Genozip doesn’t need to read the kraken file:
genocat myfile.bam.genozip --taxid 9606
You can see the breakdown of Taxonomy IDs in the file with:
genocat myfile.bam.genozip --show-counts TAXID
Finally, you can see the line-by-line taxid assignments with:
only included lines
only excluded lines
genocat myfile.bam.genozip --show-kraken
What if FASTQ R1 and R2 files are krakened separately?
Normally, one would run kraken2 on both R1 and R2 FASTQ files together using kraken2’s
--paired option. However, sometimes this is not possible, and kraken2 must be run on each of the R1 and R2 files separately. This can happen for example, if the R1 and R2 files were generated using
genocat --fastq (see: Converting SAM/BAM to FASTQ) which might result in the reads in R1 and R2 files being in a different order, if the SAM/BAM file is sorted instead of collated.
To handle this, we prepare a concatenated kraken.genozip file, from the output of kraken2 on both FASTQs:
genozip myfile-R1.kraken myfile-R2.kraken --output mykrakendata.kraken
--taxid taxidwill include the line if either component classifies this line as taxid.
--taxid ^taxidwill include the line if neither of component classifies this line as taxid.
+0, a line is considered unclassified if both components list it as unclassified.
Kraken data containing reads with /1 and /2
genocat --krakenand can handle kraken data containing reads with /1 or /2, similar the way two components are handled, as described above.
genocat --taxidmakes the assumption that all read names (QNAMEs in SAM terminology) that appear in the viewed file also appear (possibly with a /1 or /2 suffix) in the kraken data. If a read name appears in the viewed file, but is absent from the kraken data, its line will be silently assigned the most common taxid according to the kraken data.