… highlight:: none
genocat¶
Display contents or metadata of a file compressed with genozip
.
Usage: genocat
[options]… [files]…
One or more file names must be given.
General options
- -f, --force Force overwrite of the output file.¶
- -D, --subdirs If a file name on the command line is a directory include all files of that directory (recursively).¶
- -o, --output output-filename. Output to this filename.¶
- -p, --password password. Provide password to access file(s) that were compressed with --password.¶
- -x, --index Create an index file alongside the decompressed file. The index file is created as described:¶
Data type
Tool used
SAM/BAM
samtools index
FASTQ
samtools faidx
FASTA
samtools faidx
VCF
bcftools index
Other types
--index
not supported
- -z, --bgzf level. Compress the output to the BGZF format (.gz extension). Note that by default genocat does not compress with BGZF (except for BAM that is compressed with level 1). Use this option if downstream tools require it.¶
- -q, --quiet Don't show the progress indicator or warnings.¶
- -Q, --noisy The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.¶
- -@, --threads number. Specify the maximum number of threads. By default genozip allocates 1.1 threads per core in order to maximize usage of all available cores. An exception is on Mac and Windows (including WSL) where the default allocation is 0.75 threads per core to maintain the operating system's UI's feeling of interactivity.¶
- -w, --stats Show the internal structure of a genozip file and the associated compression statistics.¶
- -W, --STATS Show more detailed statistics.¶
- Note: specifying -W or -w twice, results in the header line of the statistics printed to stderr, thereby surviving piping stdout to grep
- --show-filename Show the file name for each file.¶
- --validate[=valid] Validates that the file(s) are valid genozip files. By default reports files that are invalid. With --validate=valid reports files that are valid, and if run on a single exit code indicates validity.¶
- -T, --files-from filename. An alternative to providing input file names on the command line. filename it a textual file containing a newline-separated list of files. If filename is - (a hyphen) data is taken from stdin rather than a file.¶
- --log filename. Send non-file output to a log file instead of the terminal.¶
- --echo Output the full command line upon successful or failed completion of execution.¶
- -h, --help Show a link to this page.¶
- --help=attributions Show attributions.¶
- -L, --license, --licence Show the license terms and conditions for this product as accepted. Combine with --force to see the most up-do-date version of the license. If you wish to change your license to the most recent one - re-register with genozip --register.¶
- -V, --version Display Genozip's version number.¶
- -e, --reference filename. Load a reference file prior to decompressing. Used for files compressed with --reference.¶
- Note: this is equivalent of setting the environment variable $GENOZIP_REFERENCE with the reference filename.
- --show-reference Show the name and MD5 of the reference file that needs to be provided to uncompress this file.¶
Subsetting options (options resulting in modified display of the data)
- --downsample rate[,shard]. Show only one in every <rate> lines (reads in the case of FASTQ ; sequences in the case FASTA). The optional <shard> parameter indicates which of the shards is shown - it must be a value between 0 and rate-1. Other subsetting options (if any) will be applied to the surviving lines only.¶
- --component component-number. View a specific component of a genozip file. <component-number> is the number of the component as it appears in the genols list - the first component being number 1.¶
- -r, --regions [^]chr|chr:pos|pos|chr:from-to|chr:from-|chr:-to|from-to|from-|-to|from+len[,...]. (VCF SAM/BAM GFF3/GVF FASTA 23andMe Chain Reference) Show one or more regions of the file.¶
- Examples:
genocat myfile.vcf.genozip -r 22:1000-2000
Positions 1000 to 2000 on contig 22
genocat -e myfile.ref.genozip -r 22:2000-1000
Reverse complement of positions 1000 to 2000 on contig 22 (reference file only)
genocat myfile.sam.genozip -r 22:1000+151
151 bases, starting pos 1000, on contig 22
genocat -e myfile.ref.genozip -r 22:1000-151
Reverse complement of 151 bases, from 1000 to 850, on contig 22 (reference file only)
genocat myfile.vcf.genozip -r -2000,2500-
Two ranges on all contigs
genocat myfile.sam.genozip -r chr21,chr22
Contigs chr21 and chr22 in their entirety
genocat myfile.vcf.genozip -r ^MT,Y
All contigs, excluding MT and Y
genocat myfile.vcf.genozip -r ^-1000
All contigs, excluding positions up to 1000
genocat myfile.fa.genozip -r chrM
Contig chrM
Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file.Note: Indels are considered part of a region if their start position is.Note: Multiple-r
arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument.Note: For Reference files, use in combination with –reference (or -e).Note: For FASTA and Chain files, only whole-contig regions are possible.Note: For Chain files this applies to the Primary contig (qName).
- -R, --regions-file [^]filename. (VCF SAM/BAM GFF3/GVF FASTA 23andMe Chain Reference) Show regions from a list in tab-separated file. To include all regions EXCEPT those in the file٫ prefix the filename with ^. If filename is - (or ^-) data is taken from stdin rather than a file.¶
- ::
# Comment lines starting with a # are ignored. chr22 17000000 17000099 chr22 17000000 +100 chr22 17000000
- --grep string. Show only lines (FASTA: sequences ; FASTQ: reads ; CHAIN: sets) in which <string> is a case-sensitive substring of the lines (FASTA: description). This does not affect showing the file header.¶
- -g, --grep-w string. Same as --grep, but restrict to whole words.¶
- -n, --lines [first]-[last] or [first]. Show a certain range of lines. <first> and <last> are numbers of lines in the file (starting from 1).¶
- Examples:
|
displays the 1001 lines between 1000 and 2000 |
|
displays all lines starting from 1000 (optional =) |
|
displays lines 1 to 2000 ( |
|
displays 10 lines starting from line 1000 |
Note on outputting as BAM: The numbering excludes the BAM header.Note on FASTQ: The numbering is of reads rather than lines.Note: The entire file header is included if any part of it is.Note: Line numbers are taken before any additional filters are applied.
- --head [num_lines]. Show <num_lines> lines from the start of the file.¶
- --tail [num_lines]. Show <num_lines> lines from the end of the file.¶
VCF options
- -s, --samples [^]sample[,...] or num_samples. Show a subset of samples (individuals). No other fields (such as AF, AC) are updated.¶
- Examples:
genocat myfile.vcf.genozip -s HG00255,HG00256
show two samples
genocat myfile.vcf.genozip -s ^HG00255,HG00256
show all samples except these two
genocat myfile.vcf.genozip -s 5
show the first 5 samples
Note: This does not change the INFO data (including the AC and AN tags).Note: Sample names are case-sensitive.Note: Multiple-s
arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument.
- -G, --drop-genotypes. Output the data without the samples and FORMAT column. No other fields (such as AF, AC) are updated.¶
- --GT-only. Within samples output only genotype (GT) data - dropping the other subfields.¶
- --snps-only. Drops variants that are not a Single Nucleotide Polymorphism (SNP).¶
- --indels-only. Drops variants that are not Insertions or Deletions (indel).¶
- --unsorted. If a file contains a "reconstruction plan" (see genozip --sort) the file will be displayed sorted by default. --unsorted overrides this behavior and shows the file in its unsorted form. This is useful if the file was highly unsorted causing sorting during genocat to consume a lot of memory.¶
- -1, --header-one. Output only the last line on the header (the line with the field and sample names).¶
- --bcf Output as BCF. Note: bcftools needs to be installed for this option to work.¶
- --luft. Render a DVCF file in Luft coordinates (absent this option, a DVCF will be rendered in Primary coordinates).¶
- --single-coord. Remove all DVCF-specific lines from the VCF header and remove the DVCF INFO annotations. This leave the file as a normal VCF file in single coordinates - either the Luft coordinates (when combined with ``--luft``) or Primary coordinates.¶
- -y, --show-dvcf. For each variant show its coordinate system (Primary or Luft or Both) and its oStatus. May be used with or without --luft.¶
- --show-ostatus. Add oSTATUS to the INFO field. May be used with or without of --luft.¶
- --show-counts=o\$TATUS. Show summary statistics of variant lift outcome.¶
- --show-counts=COORDS. Show summary statistics of variant coordinates.¶
- --no-PG. When modifying the data in a file using genocat Genozip normally adds a "##genozip_command" line to the VCF header. With this option it doesn't.¶
- --gpos. Replaces (CHROM,POS) with a coordinate in GPOS (Global POSition) terms. GPOS is a single genome-wide coordinate defined by a reference file, in which contigs appear in the order of the original FASTA data used to generate the reference file. Must be used in combination with --reference. The mapping of CHROM to GPOS can be viewed with "genocat --show-ref-contigs <reference-file.ref.genozip>".¶
SAM and BAM options
- --FLAG {+-^}value. Filter lines based on the FLAG value: <value> is a decimal or hexadecimal value and should be prefixed by + - or ^:¶
+
INCLUDES lines in which ALL flags in value are set in the line’s FLAG
-
INCLUDES lines in which NO flags in value are set in the line’s FLAG
^
EXCLUDES lines in which ALL flags in value are set in the line’s FLAG
Example: –FLAG -192 includes only lines in which neither FLAG 64 nor 128 are set. This can also be expressed as –FLAG -0xC0
The FLAGs are defined in the SAM specification as follows:
Decimal
Hex
Meaning
1
0x1
template having multiple segments in sequencing
2
0x2
each segment properly aligned according to the aligner
4
0x4
segment unmapped
8
0x8
next segment in the template unmapped
16
0x10
SEQ being reverse complemented
32
0x20
SEQ of the next segment in the template being reverse complemented
64
0x40
the first segment in the template
128
0x80
the last segment in the template
256
0x100
secondary alignment
512
0x200
not passing filters, such as platform/vendor quality controls
1024
0x400
PCR or optical duplicate
2048
0x800
supplementary alignment
- --MAPQ [^]value. Filter lines based on the MAPQ value: INCLUDE (or EXCLUDE if <value> is prefixed with ^) lines with a MAPQ greater or equal to <value>¶
- --bases [^]value. Filter lines based on the IUPAC characters (bases) of the sequence data.¶
- Examples:
genocat --bases ACGTN
displays only lines in which all characters of the SEQ are one of A,C,G,T,N
genocat --bases ^ACGTN
displays only lines in which NOT all characters of the SEQ are one of A,C,G,T,N
Note: In SAM/BAM, all lines missing a sequence (i.e. SEQ=*) are included in positive –bases filters (the first example above) and excluded in negative ones.Note: The list of IUPAC chacacters can be found here: IUPAC codes
- --bam Output as BAM. Note: this option is implicit if --output specifies a filename ending with .bam¶
- --sam Output as SAM. This option is the default in genocat on SAM and BAM data and is implicit if --output specifies a filename ending with .sam¶
- --fastq[=all] Output as FASTQ. Note: this option is implicit if --output specifies a filename ending with .fq or .fastq. If --fastq=all is specified all SAM fields are outputted to the FASTQ file.¶
- see more details: Converting SAM/BAM to FASTQ
- --no-PG. When modifying the data in a file using genocat Genozip normally adds information about the modification in the file header. With this option it doesn't.¶
FASTQ options
- --interleaved[=both|either] For FASTQ data compressed with --pair: Show every pair of paired-end FASTQ files with their reads interleaved: first one read of the first file ; then a read from the second file ; then the next read from the first file and so on. Optional argument 'both' (default) or 'either' determines whether both reads of a pair or only one is required for the pair to survive when combining with a subsetting option such as --grep.¶
- --header-only. Output only the description lines.¶
- --seq-only. Output only the sequence (nucleotide) lines.¶
- --qual-only. Output only the quality lines.¶
- --bases [^]value. Filter lines based on the IUPAC characters (bases) of the sequence data (see SAM/BAM options).¶
FASTA options
- -1, --header-one. Output the sequence name up to the first space or tab.¶
- -H, --no-header. Don't output the header lines.¶
- --header-only. Output only the header lines.¶
- --sequential. Output in sequential format - each sequence in a single line.¶
- --phylip Output a Multi-FASTA in Phylip format. All sequences must be the same length.¶
Phylip options
- --fasta (Phylip only) Output as Multi-FASTA.¶
Reference file options
- --reference <file> --regions <regions> [--header-only] View one or more regions of a reference file¶
- Note: For reverse complement, use a reverse range, eg -r1000000-999995 or equivalently -r1000000-6Note: –regions-file maybe used intead of –regionsNote: Combine with –no-header to suppress output of the chromosome name.Note: Short forms of the options (eg -e instead of –reference) are fine too.
- --gpos. In combination with --reference and --regions or --regions-file - shows coordinates in GPOS (Global POSition) terms - a single genome-wide numeric coordinate - rather than (CHROM,POS).¶
- --show-ref-contigs. Show the details of the reference file contigs.¶
- --show-ref-iupacs. Show non-ACTGN `IUPAC <http://www.bioinformatics.org/sms/iupac.html>`_ pseudo-bases in the reference file.¶
Chain file options
- --show-chain Show chain file alignments.¶
- --show-chain-contigs Show the details of the chain file contigs.¶
23andMe options
- --vcf Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37). Note: Indel variants ('DD' 'DI' 'II') as well as uncalled sites ('--') are discarded.¶
Filtering using kraken data
- -K, --kraken filename. Load a .kraken.genozip file for use with --taxid.¶
- -k, --taxid [^]taxid[+0]. Show only lines than match the Taxonomy ID <taxid>. ^ for a negative search. +0 means <taxid> AND unclassified. Requires either using in combination with --kraken or for the file to have been compressed with genozip --kraken.¶
- --show-kraken[=INCLUDED|EXCLUDED]. In combination with --taxid reports whether each line is included or excluded. =INCLUDED or =EXCLUDED reports only a subset of lines accordingly. Combine with --count for a fast report without display the file itself.¶
Analysis options
- --contigs. (VCF SAM BAM FASTA GFF3/GVF 23andMe) List the names of the chromosomes (or contigs) included in the file. Altnernative option names: --list-chroms --chroms.¶
- --sex. (SAM BAM FASTQ) Determine whether a SAM/BAM is a Male or a Female. Limitations when using on FASTQ. See "Sex assignment" for details.¶
- --coverage[=all|=one]. (SAM BAM FASTQ) Shows the coverage and depth of each contig. Approximate values when using on FASTQ. "Coverage and Depth" for details.¶
- --idxstats. (SAM BAM FASTQ) Shows the count of mapped and unmapped reads by contig. Approximate values when using on FASTQ. Same output format as samtools idxstats.¶
- --count. Rather than displaying the file content just report the number of lines (FASTQ: reads ; CHAIN: sets) (excluding the header) that would have been displayed. Useful in combination with filtering options.¶