Matching contig names of the file to those in the reference file¶
The unfortunate reality of the bioinformatics world is that contigs may appear with different names in different reference files, causing problems in analysis pipelines.
Examples:
Genozip offers a command line option, --match-chrom-to-reference
, that updates the contigs of a file to match those of the reference file.
--match-chrom-to-reference
works for all file types that have contigs: SAM/BAM, VCF, GFF3/GVF, 23andMe, and chain files.
Example
Notice that in the example below, the contig name 1 was updated to chr1 both in the SAM header and the 3rd column of the data line, which is the RNAME field.
> cat example.sam
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 28M1I6M1I39M4D68M7S = 10159 324 CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:, AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33 NM:i:12 RG:Z:1
> genozip example.sam --reference hg19.p13.plusMT.full_analysis_set.ref.genozip --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)
> genocat example.sam.genozip
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 28M1I6M1I39M4D68M7S = 10159 324 CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:, AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33 NM:i:12 RG:Z:1
Which fields are updated?
File type |
Fields updated |
---|---|
SAM, BAM |
@SQ lines in the file header; RNAME (column 3) and RNEXT (column 7); the contig name in the optional SA, OA and XA fields |
VCF |
##contig lines in the file header; CHROM (column 1) |
Chain |
qName and tName fields |
GFF3, GVF |
SequenceId (column 1), optional “chr=” attribute |
23andMe |
Chromosome (column 2) |
How are contig names converted?
Each contig name that appears in the file, is searched for in the reference file, in its unmodified form as well as variations of its name. Once a match is found in the reference file, the length of the contig, if known, is compared to make sure it is indeed the same contig. If a match is not found in the reference file, the contig name is not changed.
The variations of the contig names considered are:
For chromosomes with a numeric suffix (eg chr22) as well as chrX, chrY, chrW and chrZ - with or without the chr prefix - the name with or without a lower case chr prefix is considered.
NC_000001 to NC_000024 with a version number (eg NC_000001.10) are considered as equivalent to 1,…22,X,Y (with or without a chr prefix)
For the Mitochondria chromosome, four options are considered - M, MT, chrM, chrMT
For contigs that contain an embedded Accession Number, the following variations are considered:
Example contig name |
Interpreted as… |
---|---|
chr4_gl383528_alt |
Accession Number GL383528 version 1 |
chrUn_JTFH01001867v2_decoy |
Accession Number JTFH01001867 version 2 |
GL000192.1 |
Accession Number GL000192 version 1 |
About the reference file
The reference file (hg19.p13.plusMT.full_analysis_set.ref.genozip
in the example above) is created from a FASTA reference genome, for example:
> genozip --make-reference hg19.p13.plusMT.full_analysis_set.fa.gz
genozip hg19.p13.plusMT.full_analysis_set.fa.gz : Done (8 minutes 59 seconds)
Questions? support@genozip.com