A universal compressor for genomic files
- File Types
- Dual-coordinate VCF (DVCF)
- Filtering BAM or FASTQ reads by species using kraken2
- Downsampling and sharding
- Fixing contigs names (eg chr22 ⇆ 22)
- Converting SAM/BAM to FASTQ
- Converting a 23andMe 'Raw Genetic File' to VCF
- Converting MultiFASTA to Phylip and back
- Per-contig coverage
- Sex assignment
- Source code
- Publications & Citing
Sign up to receive low-frequency updates related to Genozip.
Genozip is a universal compressor for genomic files - it is optimized to compress FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF3/GVF, PHYLIP, Chain, Kraken and 23andMe files, but it can also compress any other file (including non-genomic files).
Typically, a 2X-5X improvement over the existing compression is achieved when compressing already-compressed files like .fastq.gz .bam vcf.gz, and much higher ratios in some other cases.
Yes, Genozip can compress already-compressed files (.gz .bz2 .xz .bam .cram).
The compression is lossless - the decompressed file is 100% identical to the original file (some exceptions apply).
Genozip consists of four command line tools:
genozip compresses files
genounzip decompresses files
genols shows metadata of compressed files and directories
- genocat is the workhorse for using genozip in analytical pipelines:
Display the contents of a compressed file - possibly piping it into a downstream tool
Subset a compressed file - show a specific part of its contents
Translate a compressed file to another format (eg BAM to FASTQ or Multi-FASTA to Phylip)
- From Conda (Linux & Mac):
conda config --add channels conda-forge
conda install genozip
- Linux binaries (x86-64, statically linked, works on most Linux systems)
- Windows installer:
- Compile it yourself from Github (tested on Linux, Mac and Windows):
- Download: latest release
Publications & Citing¶
Lan, D., et al. (2021) Genozip: a universal extensible genomic data compressor Bioinformatics, 37, 2225–2230
Lan, D., et al. (2020) genozip: a fast and efficient compression tool for VCF files Bioinformatics, 36, 4091–4092
Lan, D (2021) The Variant Call Format - Dual Coordinates Extension (DVCF) Specification doi:10.6084/m9.figshare.14685816 (preprint)
Technical questions, bug reports and feature requests: email@example.com
Commercial license inquiries: firstname.lastname@example.org
Requests for support for compression of additional public or proprietary file formats: email@example.com
THIS SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.