Part 2. Structural Variants Annotation
Sansa and VEP
This workflow uses Sansa and VEP (Ensembl Variant Effect Predictor) to annotate the identified structural variants (SVs) and combines the annotations into a single vcf file.
The resulting vcf file is checked for integrity.
- CWL: workflow_sansa_vep_combined_annotation_plus_vcf-integrity-check.cwl
Sansa
sansa.sh script executes Sansa to identify SVs that match the hg38/GRCh38 lift-over of the gnomAD v2 structural variants database (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#gnomad-structural-variants).
The script sorts the input vcf file and runs the following command for Sansa:
sansa annotate -m -n -b 50 -r 0.8 -s all -d $gnomAD $vcf
-mreturns all SVs (including those without matches to the database)-nallows for matches between different SV types (to allow SV to match DEL and DUP)-b 50is the default parameter for maximum breakpoint offset (in bp) between the newly-identified SV and the SV in gnomAD SV-r 0.8is nearly the default parameter (default is 0.800000012) for minimum reciprocal overlap between SVs-s allprovides all matches instead of automatically selecting the single best match-d $gnomADis the gnomAD SV database$vcfis the inputvcffile
VEP
vep-annot_SV.sh script executes VEP to annotate the SVs with genes and transcripts.
The script produces an annotated vcf containing all variants.
A maximum SV size slightly larger than chr1 in the hg38/GRCh38 genome (--max_sv_size 250000000) is used in the VEP command to avoid filtering large SVs.
The --overlaps option is also included to record the overlap between the VEP features and the SVs (reported in bp and percentage).
The --canonical option is included to flag canonical transcripts.
Combine Sansa and VEP
The outputs from Sansa and VEP are combined using combine_sansa_and_VEP_vcf.py script.
The vcf file generated by VEP is used as a scaffold, onto which gnomAD SV annotations from Sansa are added.
When multiple matches are identified in the gnomAD SV database, the following logic applies to select the best (and rarest) match:
- Select a type-matched SV (if possible), and the rarest type-matched variant from gnomAD SV (using
AF) if there are multiple matches- If none of the options are a type-match, select the rarest variant from gnomAD SV (using
AF)
Note: CNV is a variant class in gnomAD SV, but not in the Manta output. Since DELs and DUPs are types of CNVs, we prioritize as follows: (I) we first search for type-matches between DEL and DEL or DUP and DUP, (II) if a type-match is not found for the variant, we then search for type-matches between DEL and CNV or DUP and CNV, (III) all other combinations (e.g., INV and CNV, or DEL and DUP) are considered to not be type-matched.
These rules were set given limitations on the number of values the gnomAD SV fields can have for filtering in the CGAP Portal and to avoid loss of rare variants in the upcoming filtering steps.
The final output is a vcf file with annotations for both gene/transcript and gnomAD SV population frequencies.
Confidence Classes
This workflow assigns a confidence class to each of the SVs identified by the pipeline.
- CWL: manta_add_confidence.cwl
Confidence classes are calculated and assigned using the SV_confidence.py script.
A single vcf is required as input, and the file must contain the information supporting each of the calls created by Manta.
The possible confidence classes are:
- HIGH
- MEDIUM
- LOW
- NA (Not Available): assigned only to insertions, which are currently not ingested into the portal
Confidence classes are calculated based on the following parameters:
- length: the length of the variant, calculated as the absolute value of the
SVLENfield- split-reads: the number of alternative split reads, based on the
SRfield- spanning-reads: the number of alternative spanning reads, based on the
PRfield- split-read-ratio: the proportion of alternative split reads out of the total number of reference and alternative split reads, based on the
SRfield- spanning-read-ratio: the proportion of alternative spanning reads out of the total number of reference and alternative spanning reads, based on the
PRfield
For each variant, all the samples are classified according to the following criteria:
High Confidence Calls
length > 250bp & split-reads >= 5 & split-read-ratio >= 0.3 & spanning-reads >= 5 & spanning-read-ratio >= 0.3
or
length =< 250bp & split-reads > 5 & split-read-ratio > 0.3
Note: In the case of translocations, the length parameter is not taken into consideration. These SVs are examined based on the number of split reads and spanning reads and have the same priority as variants which are greater than 250 bp.
Medium Confidence Calls
length > 250bp & split-reads >= 3 & split-read-ratio >= 0.3 & spanning-reads >= 3 & spanning-read-ratio >= 0.3
or
length =< 250bp & split-reads > 3 & split-read-ratio > 0.3
Low Confidence Calls
All the other variants.
The calculated confidence classes are added as the new FORMAT field CF to each sample. The definition is added to the header:
##FORMAT=<ID=CF,Number=.,Type=String,Description="Confidence class based on length and copy ratio (HIGH, LOW)">