Downsampling and sharding

Data Types: VCF, SAM, BAM, FASTQ, FASTA, GFF3/GVF, 23andMe

Usage

genocat --downsample <rate>[,<shard>] [genozip files...]

Description

Shows one line (or read in the case of FASTQ) in every rate lines. The optional shard parameter (0-based) determines which of the rate lines is shown. The default value of shard is 0.

Downsampling is applied as the final filter after all other filters (--interleave, --grep, --regions, --no-header, --luft etc) are applied.

Example:

Getting the middle read of every 3 consecutive FASTQ reads (i.e. read 1 of every {0,1,2}):

$ genocat my-file.fq.genozip

@A00910:85:HYGWJDSXX:1:1101:3025:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTTGGGGGTGGGGATCCCTATCTTAGCTGTTGCAATCCCTGGGCTGCTTCAGTGTTAATAACATTCCAAA
+
#FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:8160:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NATTATGAGAGAGTGCTTTTTACAATGTTAATGACATGTTATAATAAAGTAATCTTACAATAAACAAGAA
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:9028:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NCTACAATGTGTGACAACAATAATGTAAAAGGTAGATGAAATTAAAGTACCTAGCAATATTAGGAAATTG
+
#FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFF:F:,FF
@A00910:85:HYGWJDSXX:1:1101:15067:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTGTAGCATGCTCTTTGGTGCAAATTGACGAGCAGATTCTAAAAGTCACAGAGAAATGCAAAAGACCCTG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:
@A00910:85:HYGWJDSXX:1:1101:16007:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTTCAGAGGCTTCCGGCTAAATAGTAATACAAGTAGCACAAACAACAGAGTGAGAATGTTTATCACACTC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:16984:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTTCTATTTTGCCCCTGAGGGTGCATCCCGAAGAGGGAAGCTATTGATTTTTAACACTAGACACATAAAC
+
#:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:20636:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTATATACCTATTTTCATATTTTTGTCAGTGTTGGTCAGATTTTTAGAAGTGAGATTTGCTAGCAAAAAT
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:21811:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NCTTTCAAGAGCAGCCCCAGCTCCTTAAGCTGCTGGTCCTGGTGCATCTGCTGACTTTCATGTAGAAGAT
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:1714:1016 1:N:0:CAACGAGAGC+GAATTGAGTG
NATATTGGTCTTATGATCATAAATTTTCTCAGCATTTATATTCTGAAGAATATATATTTCCTGTTTATTT
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFF

$ genocat my-file.fq.genozip --downsample 3,1

@A00910:85:HYGWJDSXX:1:1101:8160:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NATTATGAGAGAGTGCTTTTTACAATGTTAATGACATGTTATAATAAAGTAATCTTACAATAAACAAGAA
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:16007:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NTTCAGAGGCTTCCGGCTAAATAGTAATACAAGTAGCACAAACAACAGAGTGAGAATGTTTATCACACTC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF
@A00910:85:HYGWJDSXX:1:1101:21811:1000 1:N:0:CAACGAGAGC+GAATTGAGTG
NCTTTCAAGAGCAGCCCCAGCTCCTTAAGCTGCTGGTCCTGGTGCATCTGCTGACTTTCATGTAGAAGAT
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF