Part 2. CNVs Annotation
Sansa and VEP
This workflow uses Sansa and VEP (Ensembl Variant Effect Predictor) to annotate the identified copy number variants (CNVs) and combines the annotations into a single vcf file.
The resulting vcf file is checked for integrity.
Note: The gnomAD SV database, which we annotate against, only contains calls from SV algorithms. This is currently the best available dataset for SVs, but does not contain calls generated by CNV algorithms.
- CWL: workflow_sansa_vep_combined_annotation_plus_vcf-integrity-check.cwl
Sansa
sansa.sh script executes Sansa to identify CNVs that match the hg38/GRCh38 lift-over of the gnomAD v2 structural variants database (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#gnomad-structural-variants).
The script sorts the input vcf file and runs the following command for Sansa:
sansa annotate -m -n -b 50 -r 0.8 -s all -d $gnomAD $vcf
-mreturns all CNVs (including those without matches to the database)-nallows for matches between different SV types (to allow CNV to match DEL and DUP)-b 50is the default parameter for maximum breakpoint offset (in bp) between the newly-identified CNV and the SV in gnomAD SV-r 0.8is nearly the default parameter (default is 0.800000012) for minimum reciprocal overlap between SVs-s allprovides all matches instead of automatically selecting the single best match-d $gnomADis the gnomAD SV database$vcfis the inputvcffile
VEP
vep-annot_SV.sh script executes VEP to annotate the CNVs with genes and transcripts.
The script produces an annotated vcf containing all variants.
A maximum CNV size slightly larger than chr1 in the hg38/GRCh38 genome (--max_sv_size 250000000) is used in the VEP command in order to avoid filtering of large CNVs.
The --overlaps option is also included to record the overlap between the VEP features and the CNVs (reported in bp and percentage).
The --canonical option is included to flag canonical transcripts.
Combine Sansa and VEP
The outputs from Sansa and VEP are combined using combine_sansa_and_VEP_vcf.py script.
The vcf file generated by VEP is used as a scaffold, onto which gnomAD SV annotations from Sansa are added.
When multiple matches are identified in the gnomAD SV database, the following logic applies to select the best (and rarest) match:
- Select a type-matched CNV (if possible), and the rarest type-matched variant from gnomAD SV (using
AF) if there are multiple matches- If none of the options are a type-match, select the rarest variant from gnomAD SV (using
AF)
Note: CNV is a variant class in gnomAD SV, but not in the BICseq2 output. Since DELs and DUPs are types of CNVs, we prioritize as follows: (I) we first search for type-matches between DEL and DEL or DUP and DUP, (II) if a type-match is not found for the variant, we then search for type-matches between DEL and CNV or DUP and CNV, (III) all other combinations (e.g., INV and CNV, or DEL and DUP) are considered to not be type-matched.
These rules were set given limitations on the number of values the gnomAD SV fields can have for filtering in the CGAP Portal and to avoid loss of rare variants in the upcoming filtering steps.
The final output is a vcf file with annotations for both gene/transcript and gnomAD SV population frequencies.
Confidence Classes
This workflow assigns a confidence class to each of the CNVs identified by the pipeline.
- CWL: BICseq2_add_confidence.cwl
Confidence classes are calculated and assigned using the SV_confidence.py script.
A single vcf is required as input, and the file must contain the information supporting each of the calls created by BICseq2.
The possible confidence classes are:
- HIGH
- LOW
The confidence classes are calculated based on the following parameters:
- length: the length of the variant, calculated as the absolute value of the
SVLENfield- log-ratio: the BICseq2_log2_copyRatio parameter calculated by BICseq2
Each variant is classified based on the following criteria:
High Confidence Calls
length > 1 Mbp & (log-ratio > 0.4 || log-ratio < -0.8)
Low Confidence Calls
All the other variants.
The calculated confidence classes are added as the new FORMAT field CF to the sample:
##FORMAT=<ID=CF,Number=.,Type=String,Description="Confidence class based on length and copy ratio (HIGH, LOW)">