A universal compressor for genomic files
- File Types
- Dual-coordinate VCF (DVCF)
- Filtering BAM or FASTQ reads by species using kraken2
- Downsampling and sharding
- Fixing contigs names (eg chr22 ⇆ 22)
- Converting SAM/BAM to FASTQ
- Converting a 23andMe 'Raw Genetic File' to VCF
- Converting MultiFASTA to Phylip and back
- Per-contig coverage
- Sex assignment
- Source code
- Publications & Citing
- Release Notes
Sign up to receive low-frequency updates related to Genozip.
Genozip is a universal compressor for genomic files - it is optimized to compress FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF3/GVF, PHYLIP, Chain, Kraken and 23andMe files, but it can also compress any other file (including non-genomic files).
Typically, a 2X-5X improvement over the existing compression is achieved when compressing already-compressed files like .fastq.gz .bam vcf.gz, and much higher ratios in some other cases.
Yes, Genozip can compress already-compressed files (.gz .bz2 .xz .bam .cram).
The compression is lossless - the decompressed file is 100% identical to the original file (some exceptions apply).
Genozip consists of four command line tools:
genozip compresses files
genounzip decompresses files
genols shows metadata of compressed files and directories
- genocat is the workhorse for using genozip in analytical pipelines:
Display the contents of a compressed file - possibly piping it into a downstream tool
Subset a compressed file - show a specific part of its contents
Translate a compressed file to another format (eg BAM to FASTQ or Multi-FASTA to Phylip)
- From Conda (Linux & Mac):
conda config --add channels conda-forge
conda install genozip
- Linux binaries (x86-64, statically linked, works on most Linux systems)
- Windows installer:
- Compile it yourself from Github (tested on Linux, Mac and Windows):
- Download: latest release
Publications & Citing¶
Lan, D., et al. (2021) Genozip: a universal extensible genomic data compressor Bioinformatics, 37, 2225–2230
Lan, D., et al. (2020) genozip: a fast and efficient compression tool for VCF files Bioinformatics, 36, 4091–4092
Lan, D (2021) The Variant Call Format - Dual Coordinates Extension (DVCF) Specification doi:10.6084/m9.figshare.14685816 (preprint)
Technical questions, bug reports and feature requests: firstname.lastname@example.org
Commercial license inquiries: email@example.com
Requests for support for compression of additional public or proprietary file formats: firstname.lastname@example.org
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS, COPYRIGHT HOLDERS OR DISTRIBUTORS OF THIS SOFTWARE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.