Note on versioning:
- Major version change occurs when the Genozip file format is
extended. Note that a new Genozip version can always
uncompressed files generated by an older Genozip version
(Genozip is backward-compatible starting v8)
- Minor version changes with bug fixes and minor feature
updates
- Some minor versions are skipped due to failed deployment
pipelines
13.0.20 13/July/2022
- Bug fixes
13.0.19 19/June/2022
- Bug fixes
- BAM: Better support for OQ:Z
13.0.18 21/May/2022
- Bug fixes
13.0.17 18/May/2022
- Bug fixes
13.0.16 7/April/2022
- Bug fixes
13.0.15 31/March/2022
- --tar: added support for long filenames
13.0.14 29/March/2022
- Support --tar, where file uids are very large - e.g. when
pulled from Active Directory using SSSD
13.0.13 25/March/2022
- Bug fixes
13.0.12 8/March/2022
- Bug fixes
13.0.11 23/Jan/2022
- VCF: Better compression of FORMAT/PS and FORMAT/PID
13.0.10 22/Jan/2022
- Bug fixes
13.0.9 21/Jan/2022
- Bug fixes
13.0.8 8/January/2022
- SAM/BAM/FASTQ: Better compression of BGI read names
- New advanced option: --debug-qname
13.0.7 2/January/2022
- SAM/BAM: bowtie2: Better compression of AS and YS fields
- Display warning if excessive dictionary size
- Fix bug where genounzip sometimes output uncompressed files
when a .gz-recompression was expected
13.0.6 14/December/2021
- Better compression of QNAME (SAM/BAM/FASTQ/Kraken)
- Improve memory effeciency when compressing BAM integer
arrays (eg B:i)
- New advanced option: --show-ref-diff - see the difference
between two Genozip reference files.
- New LongR codec for quality scores introduced in 13.0.5 now
used also in default mode (not just in --best as before)
13.0.5 3/December/2021
- SAM/BAM, FASTQ: Better compression of PacBio and Nanopore
quality scores (in --best mode)
- VCF: Better compression for FORMAT/PS, FORMAT/GT, FORMAT/GQ,
INFO/DP
- Illumina formats: Better compression for .locs files
- --multiseq (renamed from --multifasta) now works on FASTQ
files too, in addition to FASTA
13.0.4 19/November/2021
- Better support for EasyBuild installations - see
https://genozip.com/using-on-hpc.html
- new option: --subdirs to recursively compress subdirectories
- SAM/BAM: Better compression of CP:i
- Performance enhancements
13.0.3 15/November/2021
- VCF: better compression of FORMAT/GL in --best
- DVCF: more granular statuses, canceled --ext-ostatus
- CRAM: now compresses as BAM instead of SAM
13.0.2 8/November/2021
- Native (binary) handling of BAM integers --> faster
compression, up to 2X faster in BAM files with long arrays (eg
PacBio subreads, IonTorrent)
- Advanced option: --biopsy
13.0.1 4/November/2021
- Significant speed improvement in compression and
decompression due to faster dict_id->did_i mapping,
elimination of
thread synchronization bottlenecks and integration of fast
htscodecs.
- Faster --fast mode, and better --best mode
- SAM/BAM compression improvements: Better compression of
MAPQ, MQ:i, XS:i, TLEN, CIGAR, ms:i (biobambam)
- VCF compression improvements: better compression for sample
fields GL,PP,PL,PRI,GP,DS,AD and files generated by VarScan.
- VCF: Liftover fidelity improvements in DVCF
- VCF: genocat --indels-only now also excludes variants which
have an INFO/SVTYPE field
- FASTA compression improvements: Better compression of amino
acid sequences
- GFF/GFF3: Support more variations of the format, including
output of Maker and GFF (not only GFF3) output of Ensembl
- Accept human contig names eg NC_000001.10 as equivalent to 1
or chr1 (and similarly for human chromosomes 1-22,X,Y)
- Rename advanced option --debug-allthesame -> --debug-generate
- Advanced option: --show-containers can now accept an
argument for additional output
- Much faster --make-reference
12.0.42 13/October/2021
- SAM/BAM: better compression for Z5:i, XM:i
- bug fixes
12.0.41 12/October/2021
- SAM/BAM: significantly better compression for sorted (by
POS) files
12.0.37 28/September/2021
- SAM/BAM: better NM:i, MD:Z, MQ:i, QNAME compression
- FASTQ: --optimize-DESC now generates read names similar to
the NCBI format, eg "@sample.6"
- New advanced option: --show-wrong-md
12.0.36 24/September/2021
- New: genozip --match-chrom-to-reference: rewrite contig
names to match the provided reference file (eg "22"->"chr22"),
see https://genozip/match-chrom.html
- Chain files can now be subsetted with --regions
- More accurate progress indicator
- Better support for contigs named by accession number, eg
"GL000192.1", "chrUn_JTFH01001867v2_decoy", "chr4_gl383528_alt"
- New advanced option: --debug-seg
- Advanced option --show-ref-alts renamed --show-chrom2ref
- faster --downsample when used with very large values
- Relax section 2f of the license
- Removed obsolete --with-chr
- Bug fixes
12.0.34 9/September/2021
- Compile without -march=native for genozip-linux-x86_64 and
Windows Installer distributions
- Better support for IUPAC "bases" in reference files
- DVCF: detect many-to-one coordinate mapping and generate an
.overlaps file
- VCF: better INFO/ANN, INFO/CLNHGVS, INFO/CSQ compression
- SAM/BAM/FASTQ/FASTA/Kraken: better compression of QNAME /
Description
- SAM/BAM: Faster compression and decompression of SA, OA, XA
with long reads
- Faster --test
- More robust compression from a URL, even on flakey
connections
- Improved --STATS report
- bug fixes
12.0.33 31/August/2021
- Fix backward compatability issue of decompressing FASTQ
files compressed with --pair in v9.0.12 or earlier
- FASTQ: genocat --seq-only and genocat --qual-only: output
only the Sequence / Quality lines
- DVCF: genocat --single-coord : Generates single-coordinate
("normal") VCF. Can be used with or without --luft
- DVCF: genocat --contigs --luft: show the contigs of the
Luft coordinate
- FASTA: support --make-reference when contigs in FASTA are
sequential (i.e. not broken into short lines)
- SAM/BAM/FASTQ/FASTA/Kraken: better compression of QNAME /
Description
- SAM/BAM/FASTQ: Faster compression of long reads
- bug fixes
12.0.32 26/August/2021
- Faster loading of reference files
12.0.31 23/August/2021
- Better GFF3 compression
- Many minor bug fixes and cosmetics
12.0.30 15/August/2021
- Extended the genocat --fastq option for converting SAM/BAM
to FASTQ: now --fastq=all emits all the SAM/BAM fields in the
FASTQ description lines.
12.0.26 13/August/2021
- FASTA improvements: compression improvement for amino acid
FASTAs; support --downsample ; support .fas and .frn filename
extensions
12.0.25 9/August/2021
- DVCF: added tag renaming ; RengAlg attribute of ##INFO and
##FORMAT now enclosed in quotes.
- VCF: better compression for INFO/CLNDN, INFO/CLNHGVS,
INFO/RS, INFO/ALLELEID
- When loading implicit reference files with a relative path,
first try path relative to current directory, and then
relative to file's directory.
12.0.14 29/July/2021
- FASTQ: support files where the 3rd line is a copy of the 1st
line, except with '+' prefix instead of '@'
- DVCF: a. lift-over of complex indels ; b. consistent sorting
of lines that have the same CHROM/POS c. alt chrom names also
if VCF has no contigs in header
- snips with len > 512K (technical improvement)
12.0.13
- Option change: --grep for fastq now tests the entire read,
not just the description
- Option change: --interleaved now has an optional paramter
--interleave=either or --interleave=both (default: both)
describing how to handle in case of subsetting with eg --grep
12.0.12 24/July/2021
- Better support for compressing gff3 files
12.0.10 + 12.0.11 22/July/2021
- bug fixes
12.0.7 + 12.0.8 16/July/2021
- license and copyright update
12.0.6 15/July/2021
- new option: --tar to archive with genozip. See:
https://genozip.com/archiving.html
- new option: --files-from/-T - An alternative to providing
input file names on the command line
- bug fixes
12.0.5 9/July/2021
- Compressing low-coverage SAM/BAM files without a reference:
better compression ratio, better decompression performance
12.0.4 8/July/2021
- safer implementation of --replace
- updated non-commercial license
- bug fixes
12.0.3 6/July/2021
- new option: --licfile. See genozip.com/using-on-hpc.html
12.0.2 2/July/2021
a. Dual-coordinate VCF files: genozip --chain to create a
dual-coordinates file ; genocat --luft ("lifted") to see
https://genozip.com/dvcf.html
b. Filtering by taxonomy using kraken2 files - support for
compressing kraken output files --kraken, --taxid, see:
https://genozip.com/kraken.html.
c. Many other improvements:
- Support genocat --sort for VCF files ; --sort implied for
dual-coordinates VCF files. Disabled by --unsorted.
- Fixed bug with subsetting samples in VCF (genocat --sample).
The fix will work on files compressed with 11.0.9 onwards.
- Now, genounzip always unbinds files, and genocat always
concatenates
d. New options:
- new option: genocat --component <component-number> - to view
a single component of a bound file (including one of the fastq
files in a paired file)
- new option: genocat --lines [start]-[last] - show a subset
of lines of the file.
- new option: genocat --head [num_lines] - show lines from the
start of the file.
- new option: genocat --tail [num_lines] - show lines from the
end of the file.
- new option: genocat --count - displays the number of lines
(or reads in the case of FASTQ), that
survived any filters applied (--regions, --grep, --taxid,
--FLAGS, --MAPQ, --bases, --component, --one-vb etc)
- new option: --show-filename - Show the file name for each
file
- new option: genocat --FLAG - filters a SAM or BAM file by
the value of the FLAG field
- new option: genocat --MAPQ - filters a SAM or BAM file by
the value of the MAPQ field
- new option: genocat --bases - filters a SAM/BAM/FASTQ for
SEQ base values
- new option: --echo displays the command line and a timestamp
upon completion ofq execution (successful or failed)
- new option: genocat --regions-file - reading regions from as
an alternative to --regions
- new option: genocat --show-chain - for chain files - show
chain file alignments
- new option: genocat --show-chain-contigs - for chain files -
show contig list
- new option: genocat --with-chr - for chain files - changes
eg 22->chr22 and MT->chrM for all qNames
- discontinued support for GTShark codec - use genozip v11 to
decompress old VCF files compressed with --gtshark
- VCF: better compression of FORMAT/F2R1, INFO/MLEAC, INFO/AA,
FORMAT/MB, FORMAT/SB, FORMAT/ADALL, FORMAT/ADF, FORMAT/ADR,
FORMAT/AF, FORMAT/SAC
e. Option changes:
- genozip - resturctured optimization options for VCF:
--optimize-phred, --GL-to-PL, --GP-to-PP
- genounzip now always unbinds files, and the --unbind is
canceled. the --prefix can now used to set a prefix.
- genocat --fastq will NOT add /1 or /2 to R1 and R2 reads in
the case that --FLAG is specified as well
- genocat --grep now works with most data types, --grep-w
restrict to whole words
- genocat --samples now also accepts a number - "--samples 5"
shows the first 5 samples
- genocat --validate=valid displays filenames that are valid
genozip files. No change when used without "=valid".
- genols --unbind (-u) option is renamed --list (-l). genozip
--list option is canceled.
- --sex, --coverage, --stats and --STATS replace --show-sex,
--show-coverage --show-stats, --SHOW-STATS respectively
- --chroms, and --contigs are accepted as alternative names
for --list-chroms
- Setting the environment variable GENOZIP_REFERENCE is now
equivalent to --reference
- Default number of threads is now 75% of cores for Windows
and Mac and 110% of cores for Linux (modifiable with --threads)
f. New advanced options:
- new option: genocat --show-dvcf - shows line-by-line result
of the liftover (applied to a dual coordinate VCF file)
- new option: genozip --show-kraken - used in combination with
--taxid
- new option: --show-uncompress. Shows uncompressing of
section data.
- new option: --show-flags. Shows internal flags after
initialization.
- new option: --show-plan. Shows reconstruction plan
- new option: --show-ref-iupacs. Show non-ACGT iupac codes in
a reference file
- new option: --debug-stats. For debugging development of
stats.c
- new option: --debug-allthesame. For debugging development of
the allthesame algorithm
g. Much improved website genozip.com
11.0.11 24/March/2021
- Fix bug with concatenating multiple files with genocat
- genocat <file1> <file2>.... now will show the header of only
the first file. To show headers of all files use
genounzip --stdout instead.
11.0.9 20/March/2021
- Fix bug with --downsample in combination with --interleaved
11.0.8 9/March/2021
- Added sharding with genocat --downsample <rate>,<shard>
- Added support for compressing UCSC chain files
- Better --show-coverage
- Bug fixes
11.0.7 5/March/2021
- Added genozip --idxstats - identical output to samtools
idxstats
- Much improved genocat --show-sex and --show-coverage
- Bug fixes
11.0.6 2/March/2021
- Added genocat --show-coverage and --show-coverage-chrom
- Added "Male-XXY" result to genocat --show-sex
- Added genocat --validate
- Better hash table sizing algorithm - reduced memory
consumption
- Developer tools: Added <bytes> option to
--debug-memory[=bytes]
- Developer: add kill -USR1 - --show-memory of a running
process
- Bug fixes
11.0.5 27/Feb/2021
- Added genocat --show-sex for sex assignment of a SAM/BAM file
- Improve Windows installer
- Windows: add genozip directory to Path in registry, if not
already there
- Bug fixes
11.0.4 20/Feb/2021
- bug fixes
11.0.3 20/Feb/2021
- Added registration requirement to the non-commerical license
(2.d.)
- Bug fixes
- windows installer relocated from windows/ to docs/
11.0.2 13/Feb/2021
- Bug fixes
11.0.0 11/Feb/2021
- VCF: introduce a PBWT based codec for compression of the
haplotype matrix. Retire hapmat and gtshark codecs.
backward compatability is providing for decompressing VCF
files compressed in earlier versions of genozip with hapmat or
gtshark
- SAM: better handling of optional fields SA, OA, XA
- Better memory management in Linux
- Reduce core oversubscription from 1.4 to 1.2
- Add --multifasta option for better compression of a FASTA
where the contigs are quite similar to each other
10.0.9 11/Jan/2021
- bug fixes
10.0.8 10/Jan/2021
- VCF: better handling of INFO/SF
10.0.5 8/Jan/2021
- VCF: better handling of FORMAT fields DP, AD, ADF, ADR,
AD_ALL, PL and INFO fields DP, BaseCounts
10.0.4 8/Jan/2021
- VCF: better handling of FORMAT/DS
- Bug fixes
10.0.3 7/Jan/2021
- Better --gtshark mode for VCF
10.0.2 7/Jan/2021
- Better memory usage in ZIP (canceled Context.node_i)
- Better handling of VCF haplotype matrices with hetreogeneous
ploidy
- Bug fixes
10.0.0 31/Dec/2020
- Increased MAX_FIELDS from 64 to 2048. This sets the maximum
number of INFO and FORMAT tags in VCF,
maximum number of optional fields in SAM/BAM and maximum
ATTR in GVF.
- Set size of vblock dynamically
- VCF: support FORMAT tags that begin with a character other
than a letter (eg a digit)
- VCF: better handling of INFO arrays
- VCF: better handling of VEP fields: CSQ, DP_HIST, GQ_HIST,
AGE_HISTOGRAM_HET, AGE_HISTOGRAM_HOM
- VCF: better handling of FORMAT/DP and FORMAT/GQ - transposed
matrix
- Several other bug fixes
- Backward compatible with Genozip 8 and 9 - v8 and
v9-compressed files can be read by v10
9.0.22 28/Dec/2020
- more consistent --bgzf, --sam, --bam behavior in genocat
- better --stats
- minor bug fixes
9.0.21
bug fixes, including major bug with mc:i optional field in SAM
9.0.20 27/Dec/2020
- allow --output to a named pipe (fifo) (not available on
Windows)
- genounzip --bgzf now requires a level parameter (0 to 12). 0
means no compression, and hence --plain flag is canceled.
- bug fixes
9.0.17 20/Dec/2020
- refactor access to the reference file - to using memory
mapping and cache files - a lot faster and consumes less memory
- when compressing a .gz (or BAM), test BGZF blocks against
zlib too (with all compression levels), in addition to
libdeflate
- append /1 and /2 to the qname in fastq files in both
--interleave of a paired fastq file and --fastq of a sam/bam
file
- renamed --test-seg to --seg-only
- bug fixes
9.0.15 14/Dec/2020
- bug fixes and minor improvements
9.0.14 12/Dec/2020
- added better selection of stdout vs stderr for messages
(info_stream)
- bug fixes
9.0.13 10/Dec/2020
- added genocat --interleave: displays pairs of FASTQ files
compressed with --pair with their reads interleaved.
9.0.12 8/Dec/2020
- bug fix
9.0.11 7/Dec/2020
- Fixed critical bug introduced in 9.0.0 in which FASTQ files
that were compressed with BGZF (i.e. fq.gz),
and genozipped with --pair, did not compress correctly
- Added Phylip data type
- genozip --pair can now compress any number of fastq
files - every 2 consecutive files are considered a pair
- genocat --header-one now works of FASTA too: Output the
sequence name up to the first space or tab
- genocat --phylip new translator - outputs a multi-fasta
file in Phylip format
- genocat --fasta new translator - outputs a Phylip file
in multi-fasta format
- bug fixes
- Developer options:
--xthreads Use only 1 thread for the main PIZ/ZIP
dispatcher. This doesn't affect thread use of other dispatchers
--show-headers now accepts a section-type as an optional
argument
9.0.10 2/Dec/2020
- Added the --index option for genounzip / genocat to create
an index file alongside the decompressed file
9.0.8-9 2/Dec/2020
- bug fixes
9.0.7 1/Dec/2020
- New flags:
genocat --downsample <rate> - show only one in every X
lines (or reads)
genocat --one-vb <vb> - show data from a single VB
- bug fixes
9.0.1-6 1/Dec/2020
- bug fixes and minor improvements
9.0.0 29/Nov/2020
Functionality:
- Native compression of BAM (no longer using samtools for BAM)
- Native reading and writing of BGZF data
- New data type: "generic" for compressing any file beyond our
supported genomic formats
- Framework supports file translations SAM->BAM, BAM->SAM,
SAM/BAM->FASTQ, 23andM3->VCF
- Framework supports binary source files
- Backwards compatible with v8 - v8-compressed files can be
read by v9
- When decompressing a file that was originally compressed
with BGZF (eg BAM, fq.gz...) - the BGZF blocks are
reconstructed,
with an attempt to guess the original compression level
- File is now always verified - if md5 is not selected, then
Adler32 is used
- New / changed flags:
--sam (new flag) for genounzip/genocat - reconstruct a
SAM/BAM file as SAM
--bam (new flag) for genounzip/genocat - reconstruct a
SAM/BAM file as BAM
--no-PG (new flag) refrain from adding a @PG record to the
header when converting SAM->BAM or BAM->SAM
--fastq (new flag) for genounzip/genocat - reconstruct a
SAM/BAM file as FASTQ
--vcf (new flag) genounzip/genocat - reconstruct a 23andMe
file as a VCF
--plain (new flag) in genounzip / genocat - negates
implicit --bgzf
--dump-local and --dump-b250 (renamed from dump-one-local
and dump-one-b250) now output a file per VB
--bytes (new flag) for genols - show sizes in bytes
--dump-section (new flag)
--show-bgzf (new flag) for genozip - show bgzf blocks
--show-containers (new flag) for genounzip/genocat - show
flow of container reconstruction
--show-time can now accept an optional argument eg.
--show-time=compressor
--show-txt-contigs - shows contigs from the SAM/BAM header
(SQ lines)
--show-mutex - shows locks and unlocks of a particular
mutex
--unbind in genols (new flag) - shows the components of
bound files
--show-dict and show-b250 now accept an optional paramter
+ removed --show-one-dict and --show-one-b250
--show-digest show (md5 or Adler32) updates
--stdout - flag canceled for compression (genozip),
available for decompression (genounzip, genocat, genozip -d)
--input - renamed from --input-type
Compression improvements:
- For b250 sections that have all the same entry - store the
entry only once. If the entry is word_index=0, drop the section
- Improvements in codec assignment algorithm, and use it for
dictionary and some other section types in addition to b250
and local
- 30% improvment in dictionary size of disk due to
consolidation of fragments and codec assignment.
- Multi-threaded decompression of dictionaries.
- Speed improvements by having bsc and zlib use libdeflate's
version of adler32 and crc32
Cleanup:
- removed support for Visual C compiler
8.0.4 8/Nov/2020
- 10X improvement in --gtshark speed by moving to in-memory
comms using fifo
- fix thread safety issue in bit_array.c
8.0.3 23/Oct/2020
- Support samtools with or without --no-PG
- Fix reading and writing BAM files using samtools
- Fix bug in genocat --show-headers
- Add back gtshark as a codec for VCF allele data, --gtshark
option
8.0.2 20/Oct/2020
- Bug fixes
- Improved 'genocat --show-headers'
8.0.0 16/Oct/2020
- Added libbsc codec
- Dynamic selection of codec between lzma, bz2, bsc for each
local and b250 buffer
- --show-ref-seq can now work in combination with --regions in
genocat/genounzip
- Better license registration flow
- Consume ~0.5GB (for human data) less RAM in genounzip of SAM
files compressed without a reference
- In --regions, allow specification or ranges using length eg
"chr22:1000+151" - equivalent to "chr22:1000-1150"
- Canceled optimize-SEQ (benefits were tiny if any, but it
slowed down --optimize considerably)
- Added --best to contrast --fast. --best doesn't have any
additional effect as its the default mode of genozip.
- Added =prefix option to --unbind, to add a prefix when
unbinding
- --reference in genounzip is now optional - will use original
reference filename absent --reference
- Not backward compatible
7.0.5 10/Oct/2020
- Add --show-stats and --SHOW-STATS to genocat/genounzip by
introducing a new section SEC_STATS ; remove limitation of
only one file when -w or -W
7.0.4 4/Oct/2020
- Bug fixes
7.0.3 3/Oct/2020
- Bug fixes
7.0.2 2/Oct/2020
- Even better SAM BD/BI codec
7.0.1 29/Sep/2020
- bug fixes
- new --test-seg debug option
- change default number of threads to 1.4 * number of cores
7.0.0 28/Sep/2020
- Re-write the VCF segmenter to use the modern infrastructure
of recursive data definition. In the process, some little-used
features were discontinued: --gtshark, --sblocks. Non-GT
subfields are now compressed as is (not transposed), and each
field on its own. Samples as well as the GT field are
defined as Structured.
- Removed gloptimization - too small of a benefit for
non-standard code
- Change all data types to be fully recursive starting at
TOPLEVEL, removing data-type specific reconstruction loop
- Added caching of Structured in PIZ
- Better BD and BI compression for SAM
- Not backward compatible
- Bug fixes
6.0.11 21/Aug/2020
- Bug fixes
6.0.3 19/Aug/2020
- Added new data type for reference files - and an option for
creating a reference file from a FASTA - --make-reference
- Added compression against reference for FASTQ, SAM and VCF -
new options --reference and --REFERENCE
- Added --pair to compresses pairs of paired-end fastq files
together, resulting in significantly better compression
- Added Domqual compression method, for handling dominant
quality scores such as Illumina binned quality scores in FASTQ
and SAM
- Added ACGT compression codec for nucleotide sequences
- Added support for compressing CRAM files
- Added better compression for FORMAT/PS, INFO/AC, INFO/AF,
INFO/AN, INFO/SVLEN in VCF
- Added --optimize-DESC for FASTQ optimization
- Added --optimize-SEQ for FASTQ, FASTA, SAM optimization
- Added many options including --list-chroms,
--dump-one-local, --show-reference, --show-ref-index,
--show-ref-seq, --show-chrom2ref,
--show-ref-contigs, --show-ref-hash
- Removed backward compatability with versions v1 and v5. Use
genozip version 5 to decompress files of all previous versions.
5.0.9 16/June/2020
- fix bug with compressing VCF / GVF with an INFO / ATTRS
field of '.'
5.0.7 2/June/2020
- bug fixes
5.0.5 31/May/2020
- Updated license
- Added user registration
- Added full support for compressing SAM/BAM, FASTQ, FASTA,
GVF and 23andMe files
- Compression improvements for VCF files with any of these:
1. lots of non-GT FORMAT subfields
2. ID data
3. END INFO subfield
4. MIN_DP FORMAT subfield
- Added genounzip output options: --bcf for VCF files and
--bam for SAM files
- Added --input-type - tell genozip what type of file this is
- if re-directing or file has non-standard extension
- Added --stdin-size - tell genozip the size of a redirected
input file, for faster execution
- Added --show-index for genounzip and genocat - see index
embedded in a genozip file
- Added --fast option for (a lot) faster compression, with
(somewhat) reduced compression ratio
- Added --grep for genocat FASTQ
- Added --debug-progress and --show-hash, useful mostly for
genozip developers
- Reduce default vblock from 128MB to 16MB
- Cancel option --strip
list
Note: some versions numbers are skipped due to failed conda
builds (every build attempt consumes a version number)
4.0.11 30/March/2020
- bug fixes
4.0.10 28/March/2020
- updated license
- added --header-one to genocat
- query user whether to overwrite an existing file
- better error messages when running external tools
- bug fixes
4.0.9 27/March/2020
- improve performance for --samples --drop-genotypes --gt-only
--strip and --regions - skip reading and decompressing
all unneeded sections (previously partially implemented, now
complete)
- bug fixes
4.0.6 25/March/2020
- bug fixes
4.0.4 24/March/2020
- add support for compressing a file directly from a URL
- remove support for 32-bit Windows (its been broken for a
while)
4.0.2 23/March/2020
- genozip can now compress .bcf .bcf.gz .bcf.bgz and .xz files
- genounzip can now de-compress into a bgzip-ed .vcf.gz file
4.0.0 21/March/2020
- a bug that existed in versions 2.x.x and 3.x.x, related to
an edge case in compression of INFO subfields.
fixing the bug resulted in the corrected intended file
format that is slightly different than that used in v2/3.
Because of this file format change, we are increasing the
major version number. Backward compatibility is provided
for correctly decompressing all files compressed with v2/3.
- VCF files that contain lines with Windows-style line ending
\r\n will now compress losslessly preserving the line
ending
3.0.12 20/March/2020
- added genocat --GT-only
3.0.11 20/March/2020
- added genocat --strip
3.0.9 19/March/2020
- bug fixes
3.0.2 18/March/2020
- changed default number of sample blocks from 1024 for
non-gtshark and 16384 in gtshark to 4096 for both modes.
- bug fixes
3.0.0 17/March/2020
- added --gtshark allowing the final stage of allele
compression to be done with gtshark (provided it is installed
on the computer an accessible on the path) instead of the
default bzlib. This required a change to the genozip
file format and hence increment in major version. As usual,
genozip is backward compatible -
newer versions of genozip can uncompress files compressed
with older versions.
2.1.4 16/March/2020
- rewrote the Hash subsystem -
(1) by removing a thread synchronization bottleneck, genozip
now scales better with number of cores (esp better in files
with very large dictionaries)
(2) more advanced shared memory management reduces the
overall memory consumption of hash tables, and allows to make
them bigger - improving speed
- --show-sections now shows all dictionaries, not just FORMAT
and INFO
- --added optimization for VQSLOD
2.1.3 14/March/2020
- Fixed bug in optimization in GL in --optimize
2.1.2 13/March/2020
- Added --optimize and within it optimization for PL and GL
2.1.1 12/March/2020
- Reduced thread serialization to improve CPU core scalability
- New developer options --show-threads and --debug-memory
- Many bug fixes
- Improved help text
2.1.0 9/March/2020
- Rewrote VCF file data reader to avoid redudant copies and
passes on the data
- Moved to size-constained rather than number-of-lines
constrained variant blocks - change in --vblocks logic.
- Make MD5 calculation non-default, requires --md5. genounzip
--test possible only if file was compressed with --md5
- Improved memory consumption for large VCFs with a single or
small number of samples
2.0.0 6/March/2020
- New genozip file format
- backward compatibility to decompress genozip v1 files
- Columns 1-9 (CHROM to FORMAT) are now put into their own
dictionaries (except REF and ALT that are compressed together)
- Each INFO tag is its own dictionary
- --vblock for setting the variant block size
- Allow variant blocks larger than 64K and set the default
variant block size based on the number of samples to balance
compression ratio with memory consumption.
- --sblock for setting the sample block size
- change haplotype permutation to keep within sample block
boundaries
- create "random access" (index) section
- new genozip header section with payload that is list of all
sections - at end of file
- due to random access, .genozip files must be read from a
file only and can no longer be streamed from stdin during
genounzip / genocat
- all dictionaries are moved to the end of the genozip file,
and are read upfront before any VB, to facilitate random
access.
- genocat --regions to filter specific chromosomes and
regions. these are accessed via random access
- genocat --samples to see specific samples only
- genocat --no-header to skip showing the VCF header
- genocat --header-only to show only the VCF header
- genocat --drop-genotypes to show only columns CHROM-INFO
- Many new developer --show-* options (see genozip -h -f)
- Better, more compressable B250 encoding
- --test for both genozip (compressed and then tests) and
genounzip (tests without outputting)
- Support for --output in genocat
- Added --noisy which overrides default --quiet when
outputting to stdout (using --stdout, or the default in
genocat)
- --list can now show metadata for encrypted files too
- Many bug fixes, performance and memory consumption
optimizations
1.1.3 7/Feb/2020
- --unbind option - required storing the VCF header of all
files, and keeping md5 for both the bound file and each
component
- Improvement in memory and thread management - to reduce
memory consumption when compressing very large files (100s of
GB to TBs)
- Separate --help text for each command
- Optimize MD5 performance (move to 32b and eliminate memory
copying)
- Many bug fixes.