Part 3. CNVs Filtering and Secondary Annotations
Annotation Filtering and SVTYPE Selection
This multi-step workflow filters copy number variant (CNV) calls based on annotations (i.e, functional relevance, genomic location, allele frequency) and SVTYPE.
It is mostly based on granite software.
The output vcf file is checked for integrity to ensure the format is correct and the file is not truncated.
- CWL: workflow_granite-filtering_SV_selector_plus_vcf-integrity-check.cwl
Requirements
The input is a single annotated vcf file with the CNV calls.
The annotations must include genes and transcripts information (VEP) and allele frequency from gnomAD SV (Sansa).
Gene list
The gene list step uses granite to clean VEP annotations for transcripts that are not mapping to any gene of interest (not present in the CGAP Portal). This step does not remove any variants, but only modifies VEP annotations.
Inclusion List
The inclusion list step uses granite to filter-in exonic and functionally relevant variant based on VEP annotations. This step removes a large number of CNVs from the initial call set.
Exclusion List
The exclusion list step uses granite to filter-out common variants based on gnomAD SV population allele frequency (AF > 0.01).
Variants without gnomAD SV annotations are retained.
CNV Type Selection
This step uses SV_type_selector.py script to filter out unwanted CNV types.
Currently we are not filtering any of the possible CNV types as we retain both deletions (DEL) and duplications (DUP).
Output
The output is a filtered vcf file with fewer entries than the input vcf file.
The content of the remaining entries is identical to the input (no information added or removed) minus the information removed by the gene list step.
Secondary Annotation
This workflow runs a series of scripts to add additional annotations to the CNV calls in vcf format:
liftover_hg19.py(https://github.com/dbmi-bgm/cgap-scripts) to add lift-over coordinates for breakpoint locations for hg19/GRCh37 genome buildSV_worst_and_locations.pyto add information for breakpoint locations relative to impacted transcripts and the most severe consequence from VEP annotationsSV_cytoband.pyto add Cytoband information for the breakpoint locations
SV_worst_and_locations.py also implement some filtering and can result in fewer variants in the resulting vcf that is eventually checked for integrity.
Note: These scripts only work on DEL and DUP calls. Inversions (INV), break-end (BND), and insertions (INS) are not supported.
- CWL: workflow_SV_secondary_annotation_plus_vcf-integrity-check.cwl
Requirements
This workflow requires a single vcf file with the CNV calls that went through Annotation Filtering and SVTYPE Selection.
It also requires:
- The hg38/GRCh38 to hg19/GRCh37 chain file for lift-over (https://cgap-annotations.readthedocs.io/en/latest/liftover_chain_files.html#hg38-grch38-to-hg19-grch37)
- The hg38/GRCh38 Cytoband reference file from UCSC (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#cytoband)
Both the Cytoband annotation and the lift-over step require the END tag in the INFO field in the vcf file.
Annotation and Possible Filtering
- For
liftover_hg19.py, three lines are added to the header:
##INFO=<ID=hg19_chr,Number=.,Type=String,Description="CHROM in hg19 using LiftOver from pyliftover">
##INFO=<ID=hg19_pos,Number=.,Type=Integer,Description="POS in hg19 using LiftOver from pyliftover (converted back to 1-based)">
##INFO=<ID=hg19_end,Number=1,Type=Integer,Description="END in hg19 using LiftOver from pyliftover (converted back to 1-based)">
The data associated with these tags are also added to the INFO field of the vcf file for qualifying variants using the following criteria:
- For the lift-over process to hg19/GRCh37 coordinates, variants with successful conversions at both breakpoints will include data for the
hg19_chrand bothhg19_pos(breakpoint 1) andhg19_end(breakpoint 2) tags in the INFO field. If the conversion fails (e.g., if the coordinates do not have a corresponding location in hg19/GRCh37), the tags and any lift-over information will not be included in the output. Note that each breakpoint is treated separately, so it is possible for a variant to have data forhg19_chrandhg19_pos, but nothg19_end, orhg19_chrandhg19_end, but nothg19_pos- Given that pyliftover does not convert ranges, the single-point coordinate in hg38/GRCh38 corresponding to each variant CHROM and POS (or
END) are used as query, and the hg19/GRCh37 coordinate (result) will also be a single-point coordinate
- For
SV_worst_and_locations.py, three new fields are added to theCSQtag in INFO field initially created by VEP. These are:
Most_severe, which will have a value of1if the transcript is the most severe, and will otherwise be blankVariant_5_prime_location, which gives the location for breakpoint 1 relative to the transcript (options below)Variant_3_prime_location, which gives the location for breakpoint 2 relative to the transcript (options below)
Options for the location fields include:
Indeterminate, Upstream, Downstream, Intronic, Exonic, 5_UTR, 3_UTR, Upstream_or_5_UTR, 3_UTR_or_Downstream, or Within_miRNA.
Additionally, for each variant this step removes annotated transcripts that do not possess one of the following biotypes: protein_coding, miRNA, or polymorphic_pseudogene.
If after this cleaning a variant no longer has any annotated transcripts, that variant is also filtered out of the vcf file.
- For
SV_cytoband.py, the following two lines are added to the header:
##INFO=<ID=Cyto1,Number=1,Type=String,Description="Cytoband for SV start (POS) from hg38 cytoBand.txt.gz from UCSC">
##INFO=<ID=Cyto2,Number=1,Type=String,Description="Cytoband for SV end (INFO END) from hg38 cytoBand.txt.gz from UCSC">
Each variant will receive a Cyto1 annotation which corresponds to the Cytoband position of breakpoint 1 (which is POS in the vcf), and a Cyto2 annotation which corresponds to the Cytoband position of breakpoint 2 (which is END in the INFO field).
Output
The output is an annotated vcf file where secondary annotations are added to qualifying variants as described above.
Some variants may be additionally filtered out as described.
Length Filtering
Note: We are NOT currently length filtering BICseq2 CNV calls. The workflow is turned off by specifying a maximum length argument that is larger than chr1 (250000000 bp), the same value used to run VEP.
This step uses SV_length_filter.py to remove the largest CNVs from the calls in the vcf file. The resulting vcf file is checked for integrity.
- CWL: workflow_SV_length_filter_plus_vcf-integrity-check.cwl
Requirements
This workflow requires a single vcf file with the CNV calls and a parameter to define the maximum length allowed for the SVs.
Filtering
Currently none.
Output
This is the final vcf file that is ingested into the CGAP Portal.
VCF Annotation Cleaning
This step uses SV_annotation_VCF_cleaner.py script to remove most of VEP annotations to create a smaller vcf file for HiGlass visualization.
This improves the loading speed in the genome browser.
The resulting vcf file is checked for integrity.
- CWL: workflow_SV_annotation_VCF_cleaner_plus_vcf-integrity-check.cwl
Requirements
This workflow expects the final vcf file that is ingested into the CGAP Portal as input.
Cleaning
VEP annotations are removed from the vcf file and the REF and ALT fields are simplified using the SV_annotation_VCF_cleaner.py script.
Output
The output is a modified version of the final vcf file that is ingested into the CGAP Portal, that has been cleaned for the HiGlass genome browser.
This file is also ingested into the CGAP Portal but only used for visualization.