genocat

Display contents or metadata of a file compressed with genozip.

Usage: genocat [options]… [files]…

One or more file names must be given.

Reference-file related options

-e, --reference filename.  Load a reference file prior to decompressing. Required only for files compressed with --reference. When no non-reference file is specified display the reference data itself (typically used in combination with --regions).

-E, --REFERENCE filename.  With no non-reference file specified. Display the reverse complement of the reference data itself. Typically used in combination with --regions.

--show-reference  Show the name and MD5 of the reference file that needs to be provided to uncompress this file.

Subsetting options (options resulting in modified display of the data)

--downsample rate[,shard].  Show only one in every <rate> lines (or reads in the case of FASTQ), optional <shard> parameter indicates which of the shards is shown. Other subsetting options (if any) will be applied to the surviving lines only.

--interleaved  For FASTQ data compressed with --pair: Show every pair of paired-end FASTQ files with their reads interleaved: first one read of the first file ; then a read from the second file ; then the next read from the first file and so on.

-r, --regions [^]chr|chr:pos|pos|chr:from-to|chr:from-|chr:-to|from-to|from-|-to|from+len[,...].  (FASTA SAM/BAM GVF 23andMe Chain) Show one or more regions of the file. Examples:

genocat myfile.vcf.genozip -r 22:1000-2000

Positions 1000 to 2000 on contig 22

genocat myfile.sam.genozip -r 22:1000+151

151 bases, starting pos 1000, on contig 22

genocat myfile.vcf.genozip -r -2000,2500-

Two ranges on all contigs

genocat myfile.sam.genozip -r chr21,chr22

Contigs chr21 and chr22 in their entirety

genocat myfile.vcf.genozip -r ^MT,Y

All contigs, excluding MT and Y

genocat myfile.vcf.genozip -r ^-1000

All contigs, excluding positions up to 1000

genocat myfile.fa.genozip  -r chrM

Contig chrM

Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file.

Note: Indels are considered part of a region if their start position is.

Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument.

Note: For FASTA and Chain files, only whole-contig regions are possible.

Note: For Chain files this applies to the source contig (qName).

-s, --samples [^]sample[,...].  (VCF) Show a subset of samples (individuals). Examples:

genocat myfile.vcf.genozip -s HG00255,HG00256

show two samples

genocat myfile.vcf.genozip -s ^HG00255,HG00256

show all samples except these two

Note: This does not change the INFO data (including the AC and AN tags).

Note: Sample names are case-sensitive.

Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument.

-g, --grep string.  (FASTQ FASTA) Show only records in which <string> is a case-sensitive substring of the description.

-G, --drop-genotypes.  (VCF) Output the data without the samples and FORMAT column.

-H, --no-header.  Don't output the header lines.

-1, --header-one.  (VCF FASTA) VCF: Output only the last line on the header (the line with the field and sample names). FASTA: Output the sequence name up to the first space or tab.

--header-only.  Output only the header lines.

--GT-only.  (VCF) Within samples output only genotype (GT) data - dropping the other subfields.

--sequential.  (FASTA) Output in sequential format - each sequence in a single line.

Analysis options

--list-chroms.  (VCF SAM BAM FASTA GVF 23andMe) List the names of the chromosomes (or contigs) included in the file.

--show-sex.  (SAM BAM) Determine whether a SAM/BAM is a Male or a Female. See "Sex assignment" use case.

--show-coverage[=all].  (SAM BAM) Shows the coverage and depth of each contig. Without =all it shows only contigs that are chromosomes and groups the other contigs under "Other contigs". See "Coverage and Depth" use case.

--show-coverage-chrom.  (SAM BAM) Same as --show-coverage but shows only contigs that are chromosomes.

Translation options (convertion from one format to another)

--bam  (SAM and BAM only) Output as BAM. Note: this option is implicit if --output specifies a filename ending with .bam

--sam  (SAM and BAM only) Output as SAM. This option is the default in genocat on SAM and BAM data.

--no-PG  (SAM and BAM only) When converting a file from SAM to BAM or vice versa Genozip normally adds a @PG line in the header. With this option it doesn't.

--fastq  (SAM and BAM only) Output as FASTQ. The alignments are outputted as FASTQ reads in the order they appear in the SAM/BAM file. Alignments with FLAG 16 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed. Alignments with FLAG 4 (unmapped) or 256 (secondary) are dropped. Alignments with FLAG 64 (or 128) (the first (or last) segment in the template) have a '1' (or '2') added after the read name. Usually (if the original order of the SAM/BAM file has not been tampered with) this would result in a valid interleaved FASTQ file. Note: this option is implicit if --output specifies a filename ending with .fq[.gz] or .fastq[.gz]

--bcf  (VCF only) Output as BCF. Note: bcftools needs to be installed for this option to work.

--phylip  (FASTA only) Output a Multi-FASTA in Phylip format. All sequences must be the same length.

--fasta  (Phylip only) Output as Multi-FASTA.

--vcf  (23andMe only) Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37). Note: INDEL genotypes ('DD' 'DI' 'II') as well as uncalled sites ('--') are discarded.

General options

-c, --stdout  Send output to standard output instead of a file.

-f, --force  Force overwrite of the output file.

-z, --bgzf level.  Compress the output to the BGZF format (.gz extension) using libdeflate at the compression level specified by the argument. Argument specifies the compression level from 0 (no compression) to 12 (best yet slowest compression). If you are not sure what value to choose - 6 is a popular option. Note: by default (absent this option) genozip will attempt to re-create the same BGZF compression as in the original file. Whether genozip succeeds in re-creating the exact same BGZF compression ratio depends on the compression library used by the application that generated the original file.

-^, --replace  Replace the source file with the result file rather than leaving it unchanged.

-o, --output output-filename.  Output to this filename.

-p, --password password.  Provide password to access file(s) that were compressed with --password.

-x, --index  Create an index file alongside the decompressed file. The index file is created as described:

Data type

Tool used

SAM/BAM

samtools index

FASTQ

samtools faidx

FASTA

samtools faidx

VCF

bcftools index

Other types

--index not supported


-q, --quiet  Don't show the progress indicator or warnings.

-Q, --noisy  The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.

-@, --threads number.  Specify the maximum number of threads. By default genozip uses all the threads it needs to maximize usage of all available cores.

-w, --show-stats   Show the internal structure of a genozip file and the associated compression statistics.

-W, --SHOW-STATS   Show more detailed statistics.

--validate  Validates that the file(s) are valid genozip files.

-h, --help[=topic]  Show this help page. Optional topic can be:

topic

genozip

list of genozip options

genounzip

list of genounzip options

genocat

list of genocat options

genols

list of genols options

dev

list of developer options

input

list of possible arguments of –input


-L, --license, --licence  Show the license terms and conditions for this product.

-V, --version  Display Genozip's version number.