Release NotesΒΆ

Note on versioning:
Major version change occurs when genozip file format changes 
(Genozip is backward-compatible starting v8 - i.e. newer 
genounzip/genocat can decompress old files)
Minor version changes with bug fixes and minor feature updates
Some minor versions are skipped due to failed deployment 
pipelines

12.0.42 13/October/2021
- SAM/BAM: better compression for Z5:i, XM:i
- bug fixes

12.0.41 12/October/2021
- SAM/BAM: significantly better compression for sorted (by 
POS) files 

12.0.37 28/September/2021
- SAM/BAM: better NM:i, MD:Z, MQ:i, QNAME compression
- FASTQ: --optimize-DESC now generates read names similar to 
the NCBI format, eg "@sample.6"
- New advanced option: --show-wrong-md

12.0.36 24/September/2021
- New: genozip --match-chrom-to-reference: rewrite contig 
names to match the provided reference file (eg "22"->"chr22"), 
see https://genozip/match-chrom.html
- Chain files can now be subsetted with --regions
- More accurate progress indicator
- Better support for contigs named by accession number, eg 
"GL000192.1", "chrUn_JTFH01001867v2_decoy", "chr4_gl383528_alt"
- New advanced option: --debug-seg
- Advanced option --show-ref-alts renamed --show-chrom2ref
- faster --downsample when used with very large values 
- Relax section 2f of the license
- Removed obsolete --with-chr
- Bug fixes

12.0.34 9/September/2021
- Compile without -march=native for genozip-linux-x86_64 and 
Windows Installer distributions
- Better support for IUPAC "bases" in reference files
- DVCF: detect many-to-one coordinate mapping and generate an 
.overlaps file
- VCF: better INFO/ANN, INFO/CLNHGVS, INFO/CSQ compression
- SAM/BAM/FASTQ/FASTA/Kraken: better compression of QNAME / 
Description 
- SAM/BAM: Faster compression and decompression of SA, OA, XA 
with long reads 
- Faster --test
- More robust compression from a URL, even on flakey 
connections
- Improved --STATS report
- bug fixes

12.0.33 31/August/2021
- Fix backward compatability issue of decompressing FASTQ 
files compressed with --pair in v9.0.12 or earlier 
- FASTQ: genocat --seq-only and genocat --qual-only: output 
only the Sequence / Quality lines
- DVCF:  genocat --single-coord : Generates single-coordinate 
("normal") VCF. Can be used with or without --luft
- DVCF:  genocat --contigs --luft: show the contigs of the 
Luft coordinate 
- FASTA: support --make-reference when contigs in FASTA are 
sequential (i.e. not broken into short lines)
- SAM/BAM/FASTQ/FASTA/Kraken: better compression of QNAME / 
Description 
- SAM/BAM/FASTQ: Faster compression of long reads
- bug fixes

12.0.32 26/August/2021
- Faster loading of reference files

12.0.31 23/August/2021
- Better GFF3 compression
- Many minor bug fixes and cosmetics

12.0.30 15/August/2021
- Extended the genocat --fastq option for converting SAM/BAM 
to FASTQ: now --fastq=all emits all the SAM/BAM fields in the 
FASTQ description lines.

12.0.26 13/August/2021
- FASTA improvements: compression improvement for amino acid 
FASTAs;  support --downsample ; support .fas and .frn filename 
extensions

12.0.25 9/August/2021
- DVCF: added tag renaming ; RengAlg attribute of ##INFO and 
##FORMAT now enclosed in quotes. 
- VCF: better compression for INFO/CLNDN, INFO/CLNHGVS, 
INFO/RS, INFO/ALLELEID
- When loading implicit reference files with a relative path, 
first try path relative to current directory, and then 
relative to file's directory.

12.0.14 29/July/2021
- FASTQ: support files where the 3rd line is a copy of the 1st 
line, except with '+' prefix instead of '@'
- DVCF: a. lift-over of complex indels ; b. consistent sorting 
of lines that have the same CHROM/POS c. alt chrom names also 
if VCF has no contigs in header
- snips with len > 512K (technical improvement)

12.0.13
- Option change: --grep for fastq now tests the entire read, 
not just the description
- Option change: --interleaved now has an optional paramter 
--interleave=either or --interleave=both (default: both) 
describing how to handle in case of subsetting with eg --grep

12.0.12 24/July/2021
- Better support for compressing gff3 files

12.0.10 + 12.0.11 22/July/2021
- bug fixes

12.0.7 + 12.0.8 16/July/2021
- license and copyright update

12.0.6 15/July/2021
- new option: --tar to archive with genozip. See: 
https://genozip.com/archiving.html
- new option: --files-from/-T - An alternative to providing 
input file names on the command line
- bug fixes

12.0.5 9/July/2021
- Compressing low-coverage SAM/BAM files without a reference: 
better compression ratio, better decompression performance

12.0.4 8/July/2021
- safer implementation of --replace
- updated non-commercial license
- bug fixes

12.0.3 6/July/2021
- new option: --licfile. See genozip.com/using-on-hpc.html

12.0.2 2/July/2021
a. Dual-coordinate VCF files: genozip --chain to create a 
dual-coordinates file ; genocat --luft ("lifted") to see 
https://genozip.com/dvcf.html
b. Filtering by taxonomy using kraken2 files - support for 
compressing kraken output files --kraken, --taxid, see: 
https://genozip.com/kraken.html.
c. Many other improvements:
- Support genocat --sort for VCF files ; --sort implied for 
dual-coordinates VCF files. Disabled by --unsorted.
- Fixed bug with subsetting samples in VCF (genocat --sample). 
The fix will work on files compressed with 11.0.9 onwards.
- Now, genounzip always unbinds files, and genocat always 
concatenates
d. New options:
- new option: genocat --component <component-number> - to view 
a single component of a bound file (including one of the fastq 
files in a paired file)
- new option: genocat --lines [start]-[last] - show a subset 
of lines of the file.
- new option: genocat --head [num_lines] - show lines from the 
start of the file.
- new option: genocat --tail [num_lines] - show lines from the 
end of the file.
- new option: genocat --count - displays the number of lines 
(or reads in the case of FASTQ), that
  survived any filters applied (--regions, --grep, --taxid, 
--FLAGS, --MAPQ, --bases, --component, --one-vb etc)
- new option: --show-filename - Show the file name for each 
file
- new option: genocat --FLAG - filters a SAM or BAM file by 
the value of the FLAG field
- new option: genocat --MAPQ - filters a SAM or BAM file by 
the value of the MAPQ field
- new option: genocat --bases - filters a SAM/BAM/FASTQ for 
SEQ base values
- new option: --echo displays the command line and a timestamp 
upon completion ofq execution (successful or failed)
- new option: genocat --regions-file - reading regions from as 
an alternative to --regions
- new option: genocat --show-chain - for chain files - show 
chain file alignments
- new option: genocat --show-chain-contigs - for chain files - 
show contig list
- new option: genocat --with-chr - for chain files - changes 
eg 22->chr22 and MT->chrM for all qNames  
- discontinued support for GTShark codec - use genozip v11 to 
decompress old VCF files compressed with --gtshark
- VCF: better compression of FORMAT/F2R1, INFO/MLEAC, INFO/AA, 
FORMAT/MB, FORMAT/SB, FORMAT/ADALL, FORMAT/ADF, FORMAT/ADR, 
FORMAT/AF, FORMAT/SAC
e. Option changes:
- genozip - resturctured optimization options for VCF: 
--optimize-phred, --GL-to-PL, --GP-to-PP
- genounzip now always unbinds files, and the --unbind is 
canceled. the --prefix can now used to set a prefix.
- genocat --fastq will NOT add /1 or /2 to R1 and R2 reads in 
the case that --FLAG is specified as well
- genocat --grep now works with most data types, --grep-w 
restrict to whole words
- genocat --samples now also accepts a number - "--samples 5" 
shows the first 5 samples
- genocat --validate=valid displays filenames that are valid 
genozip files. No change when used without "=valid".
- genols --unbind (-u) option is renamed --list (-l). genozip 
--list option is canceled.
- --sex, --coverage, --stats and --STATS replace --show-sex, 
--show-coverage --show-stats, --SHOW-STATS respectively
- --chroms, and --contigs are accepted as alternative names 
for --list-chroms
- Setting the environment variable GENOZIP_REFERENCE is now 
equivalent to --reference
- Default number of threads is now 75% of cores for Windows 
and Mac and 110% of cores for Linux (modifiable with --threads)
f. New advanced options:
- new option: genocat --show-dvcf - shows line-by-line result 
of the liftover (applied to a dual coordinate VCF file)
- new option: genozip --show-kraken - used in combination with 
--taxid
- new option: --show-uncompress. Shows uncompressing of 
section data.
- new option: --show-flags. Shows internal flags after 
initialization.
- new option: --show-plan. Shows reconstruction plan
- new option: --show-ref-iupacs. Show non-ACGT iupac codes in 
a reference file
- new option: --debug-stats. For debugging development of 
stats.c
- new option: --debug-allthesame. For debugging development of 
the allthesame algorithm
g. Much improved website genozip.com

11.0.11 24/March/2021
- Fix bug with concatenating multiple files with genocat
- genocat <file1> <file2>.... now will show the header of only 
the first file. To show headers of all files use 
  genounzip --stdout instead.

11.0.9  20/March/2021
- Fix bug with --downsample in combination with --interleaved

11.0.8  9/March/2021
- Added sharding with genocat --downsample <rate>,<shard>
- Added support for compressing UCSC chain files
- Better --show-coverage
- Bug fixes

11.0.7  5/March/2021
- Added genozip --idxstats - identical output to samtools 
idxstats
- Much improved genocat --show-sex and --show-coverage
- Bug fixes

11.0.6  2/March/2021
- Added genocat --show-coverage and --show-coverage-chrom
- Added "Male-XXY" result to genocat --show-sex
- Added genocat --validate
- Better hash table sizing algorithm - reduced memory 
consumption
- Developer tools: Added <bytes> option to 
--debug-memory[=bytes]
- Developer: add kill -USR1 - --show-memory of a running 
process
- Bug fixes

11.0.5  27/Feb/2021
- Added genocat --show-sex for sex assignment of a SAM/BAM file
- Improve Windows installer
- Windows: add genozip directory to Path in registry, if not 
already there
- Bug fixes

11.0.4  20/Feb/2021
- bug fixes

11.0.3  20/Feb/2021
- Added registration requirement to the non-commerical license 
(2.d.)
- Bug fixes
- windows installer relocated from windows/ to docs/

11.0.2  13/Feb/2021
- Bug fixes

11.0.0  11/Feb/2021
- VCF: introduce a PBWT based codec for compression of the 
haplotype matrix. Retire hapmat and gtshark codecs. 
  backward compatability is providing for decompressing VCF 
files compressed in earlier versions of genozip with hapmat or 
gtshark
- SAM: better handling of optional fields SA, OA, XA
- Better memory management in Linux
- Reduce core oversubscription from 1.4 to 1.2
- Add --multifasta option for better compression of a FASTA 
where the contigs are quite similar to each other

10.0.9  11/Jan/2021
- bug fixes

10.0.8  10/Jan/2021
- VCF: better handling of INFO/SF

10.0.5  8/Jan/2021
- VCF: better handling of FORMAT fields DP, AD, ADF, ADR, 
AD_ALL, PL and INFO fields DP, BaseCounts

10.0.4  8/Jan/2021
- VCF: better handling of FORMAT/DS
- Bug fixes

10.0.3  7/Jan/2021
- Better --gtshark mode for VCF

10.0.2  7/Jan/2021
- Better memory usage in ZIP (canceled Context.node_i)
- Better handling of VCF haplotype matrices with hetreogeneous 
ploidy
- Bug fixes

10.0.0  31/Dec/2020
- Increased MAX_FIELDS from 64 to 2048. This sets the maximum 
number of INFO and FORMAT tags in VCF, 
  maximum number of optional fields in SAM/BAM and maximum 
ATTR in GVF.
- Set size of vblock dynamically
- VCF: support FORMAT tags that begin with a character other 
than a letter (eg a digit)
- VCF: better handling of INFO arrays
- VCF: better handling of VEP fields: CSQ, DP_HIST, GQ_HIST, 
AGE_HISTOGRAM_HET, AGE_HISTOGRAM_HOM
- VCF: better handling of FORMAT/DP and FORMAT/GQ - transposed 
matrix
- Several other bug fixes
- Backward compatible with Genozip 8 and 9 - v8 and 
v9-compressed files can be read by v10

9.0.22  28/Dec/2020
- more consistent --bgzf, --sam, --bam behavior in genocat
- better --stats
- minor bug fixes

9.0.21
bug fixes, including major bug with mc:i optional field in SAM

9.0.20  27/Dec/2020
- allow --output to a named pipe (fifo) (not available on 
Windows)
- genounzip --bgzf now requires a level parameter (0 to 12). 0 
means no compression, and hence --plain flag is canceled.
- bug fixes 

9.0.17  20/Dec/2020
- refactor access to the reference file - to using memory 
mapping and cache files - a lot faster and consumes less memory
- when compressing a .gz (or BAM), test BGZF blocks against 
zlib too (with all compression levels), in addition to 
libdeflate
- append /1 and /2 to the qname in fastq files in both 
--interleave of a paired fastq file and --fastq of a sam/bam 
file
- renamed --test-seg to --seg-only
- bug fixes 

9.0.15  14/Dec/2020
- bug fixes and minor improvements

9.0.14  12/Dec/2020
- added better selection of stdout vs stderr for messages 
(info_stream)
- bug fixes

9.0.13  10/Dec/2020
- added genocat --interleave: displays pairs of FASTQ files 
compressed with --pair with their reads interleaved.

9.0.12  8/Dec/2020
- bug fix

9.0.11  7/Dec/2020
- Fixed critical bug introduced in 9.0.0 in which FASTQ files 
that were compressed with BGZF (i.e. fq.gz),
  and genozipped with --pair, did not compress correctly
- Added Phylip data type
- genozip --pair       can now compress any number of fastq 
files - every 2 consecutive files are considered a pair
- genocat --header-one now works of FASTA too: Output the 
sequence name up to the first space or tab
- genocat --phylip     new translator - outputs a multi-fasta 
file in Phylip format
- genocat --fasta      new translator - outputs a Phylip file 
in multi-fasta format
- bug fixes
- Developer options:
    --xthreads Use only 1 thread for the main PIZ/ZIP 
dispatcher. This doesn't affect thread use of other dispatchers
    --show-headers now accepts a section-type as an optional 
argument
  
9.0.10  2/Dec/2020
- Added the --index option for genounzip / genocat to create 
an index file alongside the decompressed file

9.0.8-9  2/Dec/2020
- bug fixes

9.0.7  1/Dec/2020
- New flags:
    genocat --downsample <rate> - show only one in every X 
lines (or reads)
    genocat --one-vb <vb>       - show data from a single VB
- bug fixes

9.0.1-6 1/Dec/2020
- bug fixes and minor improvements

9.0.0 29/Nov/2020
Functionality:
- Native compression of BAM (no longer using samtools for BAM)
- Native reading and writing of BGZF data
- New data type: "generic" for compressing any file beyond our 
supported genomic formats
- Framework supports file translations SAM->BAM, BAM->SAM, 
SAM/BAM->FASTQ, 23andM3->VCF
- Framework supports binary source files
- Backwards compatible with v8 - v8-compressed files can be 
read by v9
- When decompressing a file that was originally compressed 
with BGZF (eg BAM, fq.gz...) - the BGZF blocks are 
reconstructed,
  with an attempt to guess the original compression level
- File is now always verified - if md5 is not selected, then 
Adler32 is used

- New / changed flags: 
    --sam (new flag) for genounzip/genocat - reconstruct a 
SAM/BAM file as SAM
    --bam (new flag) for genounzip/genocat - reconstruct a 
SAM/BAM file as BAM
    --no-PG (new flag) refrain from adding a @PG record to the 
header when converting SAM->BAM or BAM->SAM
    --fastq (new flag) for genounzip/genocat - reconstruct a 
SAM/BAM file as FASTQ
    --vcf (new flag) genounzip/genocat - reconstruct a 23andMe 
file as a VCF
    --plain (new flag) in genounzip / genocat - negates 
implicit --bgzf
    --dump-local and --dump-b250 (renamed from dump-one-local 
and dump-one-b250) now output a file per VB
    --bytes (new flag) for genols - show sizes in bytes
    --dump-section (new flag)
    --show-bgzf (new flag) for genozip - show bgzf blocks
    --show-containers (new flag) for genounzip/genocat - show 
flow of container reconstruction
    --show-time can now accept an optional argument eg. 
--show-time=compressor
    --show-txt-contigs - shows contigs from the SAM/BAM header 
(SQ lines)
    --show-mutex - shows locks and unlocks of a particular 
mutex
    --unbind in genols (new flag) - shows the components of 
bound files
    --show-dict and show-b250 now accept an optional paramter 
+ removed --show-one-dict and --show-one-b250
    --show-digest show (md5 or Adler32) updates
    --stdout - flag canceled for compression (genozip), 
available for decompression (genounzip, genocat, genozip -d)
    --input - renamed from --input-type
    
Compression improvements:
- For b250 sections that have all the same entry - store the 
entry only once. If the entry is word_index=0, drop the section
- Improvements in codec assignment algorithm, and use it for 
dictionary and some other section types in addition to b250 
and local
- 30% improvment in dictionary size of disk due to 
consolidation of fragments and codec assignment.
- Multi-threaded decompression of dictionaries.
- Speed improvements by having bsc and zlib use libdeflate's 
version of adler32 and crc32

Cleanup:
- removed support for Visual C compiler

8.0.4 8/Nov/2020
- 10X improvement in --gtshark speed by moving to in-memory 
comms using fifo
- fix thread safety issue in bit_array.c 

8.0.3 23/Oct/2020
- Support samtools with or without --no-PG
- Fix reading and writing BAM files using samtools
- Fix bug in genocat --show-headers
- Add back gtshark as a codec for VCF allele data, --gtshark 
option

8.0.2 20/Oct/2020
- Bug fixes
- Improved 'genocat --show-headers'

8.0.0 16/Oct/2020
- Added libbsc codec
- Dynamic selection of codec between lzma, bz2, bsc for each 
local and b250 buffer
- --show-ref-seq can now work in combination with --regions in 
genocat/genounzip
- Better license registration flow
- Consume ~0.5GB (for human data) less RAM in genounzip of SAM 
files compressed without a reference 
- In --regions, allow specification or ranges using length eg 
"chr22:1000+151" - equivalent to "chr22:1000-1150"
- Canceled optimize-SEQ (benefits were tiny if any, but it 
slowed down --optimize considerably)
- Added --best to contrast --fast. --best doesn't have any 
additional effect as its the default mode of genozip.
- Added =prefix option to --unbind, to add a prefix when 
unbinding
- --reference in genounzip is now optional - will use original 
reference filename absent --reference
- Not backward compatible

7.0.5 10/Oct/2020
- Add --show-stats and --SHOW-STATS to genocat/genounzip by 
introducing a new section SEC_STATS ; remove limitation of 
only one file when -w or -W

7.0.4 4/Oct/2020
- Bug fixes

7.0.3 3/Oct/2020
- Bug fixes

7.0.2 2/Oct/2020
- Even better SAM BD/BI codec

7.0.1 29/Sep/2020
- bug fixes
- new --test-seg debug option
- change default number of threads to 1.4 * number of cores

7.0.0 28/Sep/2020
- Re-write the VCF segmenter to use the modern infrastructure 
of recursive data definition. In the process, some little-used
  features were discontinued: --gtshark, --sblocks. Non-GT 
subfields are now compressed as is (not transposed), and each  
  field on its own. Samples as well as the GT field are 
defined as Structured.
- Removed gloptimization - too small of a benefit for 
non-standard code
- Change all data types to be fully recursive starting at 
TOPLEVEL, removing data-type specific reconstruction loop
- Added caching of Structured in PIZ
- Better BD and BI compression for SAM
- Not backward compatible
- Bug fixes

6.0.11 21/Aug/2020
- Bug fixes

6.0.3 19/Aug/2020
- Added new data type for reference files - and an option for 
creating a reference file from a FASTA - --make-reference
- Added compression against reference for FASTQ, SAM and VCF - 
new options --reference and --REFERENCE
- Added --pair to compresses pairs of paired-end fastq files 
together, resulting in significantly better compression
- Added Domqual compression method, for handling dominant 
quality scores such as Illumina binned quality scores in FASTQ 
and SAM
- Added ACGT compression codec for nucleotide sequences
- Added support for compressing CRAM files
- Added better compression for FORMAT/PS, INFO/AC, INFO/AF, 
INFO/AN, INFO/SVLEN in VCF
- Added --optimize-DESC for FASTQ optimization
- Added --optimize-SEQ for FASTQ, FASTA, SAM optimization
- Added many options including --list-chroms, 
--dump-one-local, --show-reference, --show-ref-index, 
--show-ref-seq, --show-chrom2ref,
  --show-ref-contigs, --show-ref-hash
- Removed backward compatability with versions v1 and v5. Use 
genozip version 5 to decompress files of all previous versions.

5.0.9 16/June/2020
- fix bug with compressing VCF / GVF with an INFO / ATTRS 
field of '.'

5.0.7 2/June/2020
- bug fixes

5.0.5 31/May/2020
- Updated license
- Added user registration
- Added full support for compressing SAM/BAM, FASTQ, FASTA, 
GVF and 23andMe files
- Compression improvements for VCF files with any of these:
    1. lots of non-GT FORMAT subfields 
    2. ID data 
    3. END INFO subfield 
    4. MIN_DP FORMAT subfield
- Added genounzip output options: --bcf for VCF files and 
--bam for SAM files
- Added --input-type - tell genozip what type of file this is 
- if re-directing or file has non-standard extension
- Added --stdin-size - tell genozip the size of a redirected 
input file, for faster execution
- Added --show-index for genounzip and genocat - see index 
embedded in a genozip file
- Added --fast option for (a lot) faster compression, with 
(somewhat) reduced compression ratio
- Added --grep for genocat FASTQ
- Added --debug-progress and --show-hash, useful mostly for 
genozip developers
- Reduce default vblock from 128MB to 16MB
- Cancel option --strip
list
Note: some versions numbers are skipped due to failed conda 
builds (every build attempt consumes a version number)

4.0.11 30/March/2020
- bug fixes

4.0.10 28/March/2020
- updated license
- added --header-one to genocat
- query user whether to overwrite an existing file
- better error messages when running external tools
- bug fixes

4.0.9 27/March/2020
- improve performance for --samples --drop-genotypes --gt-only 
--strip and --regions - skip reading and decompressing
  all unneeded sections (previously partially implemented, now 
complete)
- bug fixes
  
4.0.6 25/March/2020
- bug fixes

4.0.4 24/March/2020
- add support for compressing a file directly from a URL
- remove support for 32-bit Windows (its been broken for a 
while)

4.0.2 23/March/2020
- genozip can now compress .bcf .bcf.gz .bcf.bgz and .xz files
- genounzip can now de-compress into a bgzip-ed .vcf.gz file

4.0.0 21/March/2020
- a bug that existed in versions 2.x.x and 3.x.x, related to 
an edge case in compression of INFO subfields. 
  fixing the bug resulted in the corrected intended file 
format that is slightly different than that used in v2/3.
  Because of this file format change, we are increasing the 
major version number. Backward compatibility is provided
  for correctly decompressing all files compressed with v2/3.

- VCF files that contain lines with Windows-style line ending 
\r\n will now compress losslessly preserving the line 
  ending

3.0.12 20/March/2020
- added genocat --GT-only

3.0.11 20/March/2020
- added genocat --strip

3.0.9 19/March/2020
- bug fixes

3.0.2 18/March/2020
- changed default number of sample blocks from 1024 for 
non-gtshark and 16384 in gtshark to 4096 for both modes.
- bug fixes

3.0.0 17/March/2020
- added --gtshark allowing the final stage of allele 
compression to be done with gtshark (provided it is installed
  on the computer an accessible on the path) instead of the 
default bzlib. This required a change to the genozip 
  file format and hence increment in major version. As usual, 
genozip is backward compatible -
  newer versions of genozip can uncompress files compressed 
with older versions.

2.1.4 16/March/2020
- rewrote the Hash subsystem - 
  (1) by removing a thread synchronization bottleneck, genozip 
now scales better with number of cores (esp better in files 
with very large dictionaries)
  (2) more advanced shared memory management reduces the 
overall memory consumption of hash tables, and allows to make 
them bigger - improving speed
- --show-sections now shows all dictionaries, not just FORMAT 
and INFO
- --added optimization for VQSLOD

2.1.3 14/March/2020
- Fixed bug in optimization in GL in --optimize

2.1.2 13/March/2020
- Added --optimize and within it optimization for PL and GL

2.1.1 12/March/2020
- Reduced thread serialization to improve CPU core scalability
- New developer options --show-threads and --debug-memory
- Many bug fixes
- Improved help text

2.1.0 9/March/2020
- Rewrote VCF file data reader to avoid redudant copies and 
passes on the data
- Moved to size-constained rather than number-of-lines 
constrained variant blocks - change in --vblocks logic.
- Make MD5 calculation non-default, requires --md5. genounzip 
--test possible only if file was compressed with --md5
- Improved memory consumption for large VCFs with a single or 
small number of samples

2.0.0 6/March/2020
- New genozip file format 
- backward compatibility to decompress genozip v1 files
- Columns 1-9 (CHROM to FORMAT) are now put into their own 
dictionaries (except REF and ALT that are compressed together)
- Each INFO tag is its own dictionary
- --vblock for setting the variant block size
- Allow variant blocks larger than 64K and set the default 
variant block size based on the number of samples to balance 
  compression ratio with memory consumption.
- --sblock for setting the sample block size
- change haplotype permutation to keep within sample block 
boundaries
- create "random access" (index) section 
- new genozip header section with payload that is list of all 
sections - at end of file
- due to random access, .genozip files must be read from a 
file only and can no longer be streamed from stdin during 
genounzip / genocat
- all dictionaries are moved to the end of the genozip file, 
and are read upfront before any VB, to facilitate random 
access.
- genocat --regions to filter specific chromosomes and 
regions. these are accessed via random access
- genocat --samples to see specific samples only
- genocat --no-header to skip showing the VCF header
- genocat --header-only to show only the VCF header
- genocat --drop-genotypes to show only columns CHROM-INFO
- Many new developer --show-* options (see genozip -h -f)
- Better, more compressable B250 encoding
- --test for both genozip (compressed and then tests) and 
genounzip (tests without outputting)
- Support for --output in genocat
- Added --noisy which overrides default --quiet when 
outputting to stdout (using --stdout, or the default in 
genocat)
- --list can now show metadata for encrypted files too
- Many bug fixes, performance and memory consumption 
optimizations

1.1.3 7/Feb/2020
- --unbind option - required storing the VCF header of all 
files, and keeping md5 for both the bound file and each 
component
- Improvement in memory and thread management - to reduce 
memory consumption when compressing very large files (100s of 
GB to TBs)
- Separate --help text for each command
- Optimize MD5 performance (move to 32b and eliminate memory 
copying)
- Many bug fixes.