Filtering BAM or FASTQ reads by species using kraken2

kraken2 is a tool (not associated with Genozip) that analyzes a FASTQ or FASTA file against a database typically derived from reference sequences of a large number of species (for example: human plus bacteria) and assigns an NCBI Taxonomy ID (taxid) to each read - generating a kraken output file containing these assignments.

Genozip allows filtering of SAM, BAM, FASTQ and FASTA files by taxid, using the kraken data (i.e. the kraken output file). This allows, for example, filtering contamination out of an existing BAM file without needing to re-map it.


Already have a BAM (or SAM, FASTQ of FASTA) file and kraken data?

Step 1: Compress the BAM (or SAM, FASTQ of FASTA) file and the kraken data with genozip:

genozip myfile.bam
genozip mykrakendata.kraken

Note: If your kraken file doesn’t have a .kraken file name extension, use the --input kraken option to tell genozip that this is kraken data (also accepted: kraken.gz kraken.bz2 kraken.xz).

Step2: See Filter it.


Preparing the kraken file

From a BAM genozip file:

mkfifo mysample.fq
genocat myfile.bam.genozip --fastq -o mysample.fq &
kraken2 --db $kraken_db --report mykrakendata.kraken.report mysample.fq | genozip - --input kraken --output mykrakendata.kraken.genozip
rm mysample.fq

Note that we use named pipe (generated by mkfifo) for in-memory communication from genocat to kraken2. This is to overcome kraken’s inability to use a regular pipe for the input file.

From a paird FASTQ genozip file:

mkfifo mysample.kraken mysample.R1.fq mysample.R2.fq
genocat myfile.fq.genozip --component 1 -o mysample.R1.fq &
genocat myfile.fq.genozip --component 2 -o mysample.R2.fq &
kraken2 --db <kraken-db> --paired --output mkfifo.kraken --report mykrakendata.kraken.report mysample.R1.fq mysample.R2.fq &
genozip mykrakendata.kraken
rm mysample.kraken mysample.R1.fq mysample.R2.fq

Note that in this case we split the paired FASTQ data to two different pipes to overcomes kraken’s inability to consume a paired fastq in interleaved format.


Filtering an existing file with the kraken data

This will display only the lines of the BAM file containing human data (NCBI Taxonomy ID 9606):

genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid 9606

This will display all the lines of the BAM file that DO NOT contain human data:

genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid ^9606

A +0 suffix can be used include or exclude, in addition, also unclassified reads:

genocat myfile.bam.genozip --kraken mykrakendata.kraken.genozip --taxid 9606+0

The –kraken and –taxid options work the same for SAM/BAM, FASTQ and FASTA files.


Adding taxonomy information to a Genozip file

You can generate a SAM/BAM, FASTQ or FASTA file with taxonomy information baked in:

genozip myfile.bam --kraken mykrakendata.kraken.genozip

This stores internally in the Genozip file, for each line, its taxid. Unless you need the kraken file for other purposes, you can discard it at this point.

In the case of SAM and BAM (but not FASTQ and FASTA), the taxid is will be visible as an additional field TX, for example: TX:i:9606.

Now, you can filter the file, as before, but this time it will be a lot faster since Genozip doesn’t need to read the kraken file:

genocat myfile.bam.genozip --taxid 9606

You can see the breakdown of Taxonomy IDs in the file with:

genocat myfile.bam.genozip --show-counts TAXID

Finally, you can see the line-by-line taxid assignments with:

--show-kraken

all lines

--show-kraken=included

only included lines

--show-kraken=excluded

only excluded lines

genocat myfile.bam.genozip --show-kraken

What if FASTQ R1 and R2 files are krakened separately?

Normally, one would run kraken2 on both R1 and R2 FASTQ files together using kraken2’s --paired option. However, sometimes this is not possible, and kraken2 must be run on each of the R1 and R2 files separately. This can happen for example, if the R1 and R2 files were generated using genocat --fastq (see: Converting SAM/BAM to FASTQ) which might result in the reads in R1 and R2 files being in a different order, if the SAM/BAM file is sorted instead of collated.

To handle this, we prepare a concatenated kraken.genozip file, from the output of kraken2 on both FASTQs:

genozip myfile-R1.kraken myfile-R2.kraken --output mykrakendata.kraken
The following logic applies with filtering a file with a kraken.genozip consisting of two components:

• Filtering with --taxid taxid will include the line if either component classifies this line as taxid.

• Filtering with --taxid ^taxid will include the line if neither of component classifies this line as taxid.

• For the purposes of +0, a line is considered unclassified if both components list it as unclassified.

Note: The R1 and R2 files must contain exactly the same reads (QNAMEs) - its ok if they are in a different order

Kraken data containing reads with /1 and /2

genozip --kraken and genocat --kraken and can handle kraken data containing reads with /1 or /2, similar the way two components are handled, as described above.

This can happen, for example, when feeding kraken2 with an interleaved FASTQ file rather than two separate FASTQ files.

Note: Genozip requires that if the kraken file has a read name with /1, then it must also contain the corresponding read name with /2 and vice versa - their order is not important and they needn’t be consecutive.

Assumptions

genocat --taxid makes the assumption that all read names (QNAMEs in SAM terminology) that appear in the viewed file also appear (possibly with a /1 or /2 suffix) in the kraken data. If a read name appears in the viewed file, but is absent from the kraken data, its line will be silently assigned the most common taxid according to the kraken data.

It is ok if the kraken data contains additional read names not present in the viewed file.