Part 2. CNVs Annotation

Sansa and VEP

This workflow uses Sansa and VEP (Ensembl Variant Effect Predictor) to annotate the identified copy number variants (CNVs) and combines the annotations into a single vcf file. The resulting vcf file is checked for integrity.

Note: The gnomAD SV database, which we annotate against, only contains calls from SV algorithms. This is currently the best available dataset for SVs, but does not contain calls generated by CNV algorithms.

CWL: workflow_sansa_vep_combined_annotation_plus_vcf-integrity-check.cwl

Sansa

sansa.sh script executes Sansa to identify CNVs that match the hg38/GRCh38 lift-over of the gnomAD v2 structural variants database (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#gnomad-structural-variants). The script sorts the input vcf file and runs the following command for Sansa:

sansa annotate -m -n -b 50 -r 0.8 -s all -d $gnomAD $vcf

-m returns all CNVs (including those without matches to the database)

-n allows for matches between different SV types (to allow CNV to match DEL and DUP)

-b 50 is the default parameter for maximum breakpoint offset (in bp) between the newly-identified CNV and the SV in gnomAD SV

-r 0.8 is nearly the default parameter (default is 0.800000012) for minimum reciprocal overlap between SVs

-s all provides all matches instead of automatically selecting the single best match

-d $gnomAD is the gnomAD SV database

$vcf is the input vcf file

VEP

vep-annot_SV.sh script executes VEP to annotate the CNVs with genes and transcripts. The script produces an annotated vcf containing all variants.

A maximum CNV size slightly larger than chr1 in the hg38/GRCh38 genome (--max_sv_size 250000000) is used in the VEP command in order to avoid filtering of large CNVs. The --overlaps option is also included to record the overlap between the VEP features and the CNVs (reported in bp and percentage). The --canonical option is included to flag canonical transcripts.

Combine Sansa and VEP

The outputs from Sansa and VEP are combined using combine_sansa_and_VEP_vcf.py script. The vcf file generated by VEP is used as a scaffold, onto which gnomAD SV annotations from Sansa are added.

When multiple matches are identified in the gnomAD SV database, the following logic applies to select the best (and rarest) match:

Select a type-matched CNV (if possible), and the rarest type-matched variant from gnomAD SV (using AF) if there are multiple matches

If none of the options are a type-match, select the rarest variant from gnomAD SV (using AF)

Note: CNV is a variant class in gnomAD SV, but not in the BICseq2 output. Since DELs and DUPs are types of CNVs, we prioritize as follows: (I) we first search for type-matches between DEL and DEL or DUP and DUP, (II) if a type-match is not found for the variant, we then search for type-matches between DEL and CNV or DUP and CNV, (III) all other combinations (e.g., INV and CNV, or DEL and DUP) are considered to not be type-matched.

These rules were set given limitations on the number of values the gnomAD SV fields can have for filtering in the CGAP Portal and to avoid loss of rare variants in the upcoming filtering steps. The final output is a vcf file with annotations for both gene/transcript and gnomAD SV population frequencies.

Confidence Classes

This workflow assigns a confidence class to each of the CNVs identified by the pipeline.

CWL: BICseq2_add_confidence.cwl

Confidence classes are calculated and assigned using the SV_confidence.py script. A single vcf is required as input, and the file must contain the information supporting each of the calls created by BICseq2. The possible confidence classes are:

HIGH

LOW

The confidence classes are calculated based on the following parameters:

length: the length of the variant, calculated as the absolute value of the SVLEN field

log-ratio: the BICseq2_log2_copyRatio parameter calculated by BICseq2

Each variant is classified based on the following criteria:

High Confidence Calls

length > 1 Mbp & (log-ratio > 0.4 || log-ratio < -0.8)

Low Confidence Calls

All the other variants.

The calculated confidence classes are added as the new FORMAT field CF to the sample:

##FORMAT=<ID=CF,Number=.,Type=String,Description="Confidence class based on length and copy ratio (HIGH, LOW)">

References

ensembl-vep. Sansa.