Part 2. Structural Variants Annotation

Sansa and VEP

This workflow uses Sansa and VEP (Ensembl Variant Effect Predictor) to annotate the identified structural variants (SVs) and combines the annotations into a single vcf file. The resulting vcf file is checked for integrity.

CWL: workflow_sansa_vep_combined_annotation_plus_vcf-integrity-check.cwl

Sansa

sansa.sh script executes Sansa to identify SVs that match the hg38/GRCh38 lift-over of the gnomAD v2 structural variants database (https://cgap-annotations.readthedocs.io/en/latest/variants_sources.html#gnomad-structural-variants). The script sorts the input vcf file and runs the following command for Sansa:

sansa annotate -m -n -b 50 -r 0.8 -s all -d $gnomAD $vcf

-m returns all SVs (including those without matches to the database)

-n allows for matches between different SV types (to allow SV to match DEL and DUP)

-b 50 is the default parameter for maximum breakpoint offset (in bp) between the newly-identified SV and the SV in gnomAD SV

-r 0.8 is nearly the default parameter (default is 0.800000012) for minimum reciprocal overlap between SVs

-s all provides all matches instead of automatically selecting the single best match

-d $gnomAD is the gnomAD SV database

$vcf is the input vcf file

VEP

vep-annot_SV.sh script executes VEP to annotate the SVs with genes and transcripts. The script produces an annotated vcf containing all variants.

A maximum SV size slightly larger than chr1 in the hg38/GRCh38 genome (--max_sv_size 250000000) is used in the VEP command to avoid filtering large SVs. The --overlaps option is also included to record the overlap between the VEP features and the SVs (reported in bp and percentage). The --canonical option is included to flag canonical transcripts.

Combine Sansa and VEP

The outputs from Sansa and VEP are combined using combine_sansa_and_VEP_vcf.py script. The vcf file generated by VEP is used as a scaffold, onto which gnomAD SV annotations from Sansa are added.

When multiple matches are identified in the gnomAD SV database, the following logic applies to select the best (and rarest) match:

Select a type-matched SV (if possible), and the rarest type-matched variant from gnomAD SV (using AF) if there are multiple matches

If none of the options are a type-match, select the rarest variant from gnomAD SV (using AF)

Note: CNV is a variant class in gnomAD SV, but not in the Manta output. Since DELs and DUPs are types of CNVs, we prioritize as follows: (I) we first search for type-matches between DEL and DEL or DUP and DUP, (II) if a type-match is not found for the variant, we then search for type-matches between DEL and CNV or DUP and CNV, (III) all other combinations (e.g., INV and CNV, or DEL and DUP) are considered to not be type-matched.

These rules were set given limitations on the number of values the gnomAD SV fields can have for filtering in the CGAP Portal and to avoid loss of rare variants in the upcoming filtering steps. The final output is a vcf file with annotations for both gene/transcript and gnomAD SV population frequencies.

Confidence Classes

This workflow assigns a confidence class to each of the SVs identified by the pipeline.

CWL: manta_add_confidence.cwl

Confidence classes are calculated and assigned using the SV_confidence.py script. A single vcf is required as input, and the file must contain the information supporting each of the calls created by Manta. The possible confidence classes are:

HIGH

MEDIUM

LOW

NA (Not Available): assigned only to insertions, which are currently not ingested into the portal

Confidence classes are calculated based on the following parameters:

length: the length of the variant, calculated as the absolute value of the SVLEN field

split-reads: the number of alternative split reads, based on the SR field

spanning-reads: the number of alternative spanning reads, based on the PR field

split-read-ratio: the proportion of alternative split reads out of the total number of reference and alternative split reads, based on the SR field

spanning-read-ratio: the proportion of alternative spanning reads out of the total number of reference and alternative spanning reads, based on the PR field

For each variant, all the samples are classified according to the following criteria:

High Confidence Calls

length > 250bp & split-reads >= 5 & split-read-ratio >= 0.3 & spanning-reads >= 5 & spanning-read-ratio >= 0.3
or
length =< 250bp & split-reads > 5 & split-read-ratio > 0.3

Note: In the case of translocations, the length parameter is not taken into consideration. These SVs are examined based on the number of split reads and spanning reads and have the same priority as variants which are greater than 250 bp.

Medium Confidence Calls

length > 250bp & split-reads >= 3 & split-read-ratio >= 0.3 & spanning-reads >= 3 & spanning-read-ratio >= 0.3
or
length =< 250bp & split-reads > 3 & split-read-ratio > 0.3

Low Confidence Calls

All the other variants.

The calculated confidence classes are added as the new FORMAT field CF to each sample. The definition is added to the header:

##FORMAT=<ID=CF,Number=.,Type=String,Description="Confidence class based on length and copy ratio (HIGH, LOW)">

References

ensembl-vep. Sansa.