Part 3. CNVs Filtering and Secondary Annotations

Annotation Filtering and `SVTYPE` Selection

This multi-step workflow filters copy number variant (CNV) calls based on annotations (i.e, functional relevance, genomic location, allele frequency) and SVTYPE. It is mostly based on granite software. The output vcf file is checked for integrity to ensure the format is correct and the file is not truncated.

CWL: workflow_granite-filtering_SV_selector_plus_vcf-integrity-check.cwl

Requirements

The input is a single annotated vcf file with the CNV calls. The annotations must include genes and transcripts information (VEP) and allele frequency from gnomAD SV (Sansa).

Gene list

The gene list step uses granite to clean VEP annotations for transcripts that are not mapping to any gene of interest (not present in the CGAP Portal). This step does not remove any variants, but only modifies VEP annotations.

Inclusion List

The inclusion list step uses granite to filter-in exonic and functionally relevant variant based on VEP annotations. This step removes a large number of CNVs from the initial call set.

Exclusion List

The exclusion list step uses granite to filter-out common variants based on gnomAD SV population allele frequency (AF > 0.01). Variants without gnomAD SV annotations are retained.

CNV Type Selection

This step uses SV_type_selector.py script to filter out unwanted CNV types. Currently we are not filtering any of the possible CNV types as we retain both deletions (DEL) and duplications (DUP).

Output

The output is a filtered vcf file with fewer entries than the input vcf file. The content of the remaining entries is identical to the input (no information added or removed) minus the information removed by the gene list step.

20 Unrelated Filtering

This step uses 20_unrelated_SV_filter.py script to identify common and artifactual CNVs in 20 unrelated individuals and filter them out. CNV calls for each of the 20 unrelated individuals were generated with BICseq2 (see: https://cgap-annotations.readthedocs.io/en/latest/unrelated_references.html).

CWL: workflow_20_unrelated_SV_filter_plus_vcf-integrity-check.cwl

Requirements

The input is a single annotated vcf file with the CNV calls, alongside a tar archive of the vcf files with the CNV calls for the 20 unrelated individuals. This step currently work with both DEL and DUP (which are provided to the SVTYPE argument), accounting for all possible CNV types.

Matching and Filtering

When comparing CNV calls from the input vcf file to the calls for an unrelated vcf file, the following logic applies to define a match:

SVTYPE must match

breakpoints at 5’ end must be +/- 50 bp from each other

breakpoints at 3’ end must be +/- 50 bp from each other

CNVs must reciprocally overlap by a minimum of 80%

This produces a filtered vcf file that only contains CNVs shared by a maximum of n individuals. The default is currently n = 1, such that CNVs shared by 2 or more of the 20 unrelated individuals are filtered out.

Output

The output is a filtered vcf file containing fewer entries compared to the input vcf file. The variants that remain after filtering will receive an additional annotation, UNRELATED=n, where n is the number of matches found within the 20 unrelated CNV calls.

Secondary Annotation

This workflow runs a series of scripts to add additional annotations to the CNV calls in vcf format:

liftover_hg19.py (https://github.com/dbmi-bgm/cgap-scripts) to add lift-over coordinates for breakpoint locations for hg19/GRCh37 genome build

SV_worst_and_locations.py to add information for breakpoint locations relative to impacted transcripts and the most severe consequence from VEP annotations

SV_cytoband.py to add Cytoband information for the breakpoint locations

SV_worst_and_locations.py also implement some filtering and can result in fewer variants in the resulting vcf that is eventually checked for integrity.

Note: These scripts only work on DEL and DUP calls. Inversions (INV), break-end (BND), and insertions (INS) are not supported.

CWL: workflow_SV_secondary_annotation_plus_vcf-integrity-check.cwl

Requirements

This workflow requires a single vcf file with the CNV calls that went through Annotation Filtering and SVTYPE Selection. It also requires:

The hg38/GRCh38 to hg19/GRCh37 chain file for lift-over (https://cgap-annotations.readthedocs.io/en/latest/liftover_chain_files.html#hg38-grch38-to-hg19-grch37)

The hg38/GRCh38 Cytoband reference file from UCSC (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#cytoband)

Both the Cytoband annotation and the lift-over step require the END tag in the INFO field in the vcf file.

Annotation and Possible Filtering

For liftover_hg19.py, three lines are added to the header:

##INFO=<ID=hg19_chr,Number=.,Type=String,Description="CHROM in hg19 using LiftOver from pyliftover">
##INFO=<ID=hg19_pos,Number=.,Type=Integer,Description="POS in hg19 using LiftOver from pyliftover (converted back to 1-based)">
##INFO=<ID=hg19_end,Number=1,Type=Integer,Description="END in hg19 using LiftOver from pyliftover (converted back to 1-based)">

The data associated with these tags are also added to the INFO field of the vcf file for qualifying variants using the following criteria:

For the lift-over process to hg19/GRCh37 coordinates, variants with successful conversions at both breakpoints will include data for the hg19_chr and both hg19_pos (breakpoint 1) and hg19_end (breakpoint 2) tags in the INFO field. If the conversion fails (e.g., if the coordinates do not have a corresponding location in hg19/GRCh37), the tags and any lift-over information will not be included in the output. Note that each breakpoint is treated separately, so it is possible for a variant to have data for hg19_chr and hg19_pos, but not hg19_end, or hg19_chr and hg19_end, but not hg19_pos

Given that pyliftover does not convert ranges, the single-point coordinate in hg38/GRCh38 corresponding to each variant CHROM and POS (or END) are used as query, and the hg19/GRCh37 coordinate (result) will also be a single-point coordinate

For SV_worst_and_locations.py, three new fields are added to the CSQ tag in INFO field initially created by VEP. These are:

Most_severe, which will have a value of 1 if the transcript is the most severe, and will otherwise be blank

Variant_5_prime_location, which gives the location for breakpoint 1 relative to the transcript (options below)

Variant_3_prime_location, which gives the location for breakpoint 2 relative to the transcript (options below)

Options for the location fields include: Indeterminate, Upstream, Downstream, Intronic, Exonic, 5_UTR, 3_UTR, Upstream_or_5_UTR, 3_UTR_or_Downstream, or Within_miRNA.

Additionally, for each variant this step removes annotated transcripts that do not possess one of the following biotypes: protein_coding, miRNA, or polymorphic_pseudogene. If after this cleaning a variant no longer has any annotated transcripts, that variant is also filtered out of the vcf file.

For SV_cytoband.py, the following two lines are added to the header:

##INFO=<ID=Cyto1,Number=1,Type=String,Description="Cytoband for SV start (POS) from hg38 cytoBand.txt.gz from UCSC">
##INFO=<ID=Cyto2,Number=1,Type=String,Description="Cytoband for SV end (INFO END) from hg38 cytoBand.txt.gz from UCSC">

Each variant will receive a Cyto1 annotation which corresponds to the Cytoband position of breakpoint 1 (which is POS in the vcf), and a Cyto2 annotation which corresponds to the Cytoband position of breakpoint 2 (which is END in the INFO field).

Output

The output is an annotated vcf file where secondary annotations are added to qualifying variants as described above. Some variants may be additionally filtered out as described.

Length Filtering

Note: We are NOT currently length filtering BICseq2 CNV calls. The workflow is turned off by specifying a maximum length argument that is larger than chr1 (250000000 bp), the same value used to run VEP.

This step uses SV_length_filter.py to remove the largest CNVs from the calls in the vcf file. The resulting vcf file is checked for integrity.

CWL: workflow_SV_length_filter_plus_vcf-integrity-check.cwl

Requirements

This workflow requires a single vcf file with the CNV calls and a parameter to define the maximum length allowed for the SVs.

Filtering

Currently none.

Output

This is the final vcf file that is ingested into the CGAP Portal.

VCF Annotation Cleaning

This step uses SV_annotation_VCF_cleaner.py script to remove most of VEP annotations to create a smaller vcf file for HiGlass visualization. This improves the loading speed in the genome browser. The resulting vcf file is checked for integrity.

CWL: workflow_SV_annotation_VCF_cleaner_plus_vcf-integrity-check.cwl

Requirements

This workflow expects the final vcf file that is ingested into the CGAP Portal as input.

Cleaning

VEP annotations are removed from the vcf file and the REF and ALT fields are simplified using the SV_annotation_VCF_cleaner.py script.

Output

The output is a modified version of the final vcf file that is ingested into the CGAP Portal, that has been cleaned for the HiGlass genome browser. This file is also ingested into the CGAP Portal but only used for visualization.

References

granite.

Part 3. CNVs Filtering and Secondary Annotations

Annotation Filtering and SVTYPE Selection

Requirements

Gene list

Inclusion List

Exclusion List

CNV Type Selection

Output

20 Unrelated Filtering

Requirements

Matching and Filtering

Output

Secondary Annotation

Requirements

Annotation and Possible Filtering

Output

Length Filtering

Requirements

Filtering

Output

VCF Annotation Cleaning

Requirements

Cleaning

Output

References

Annotation Filtering and `SVTYPE` Selection