Learning to use

Genome-Phenome Analyzer: Variant table fields

The SimulConsult Genome-Phenome Analyzer imports genomic data from an annotated variant file in plain text format.  The details of the file are specified on this page and the variant file in our tiny demo (choose trio or beyond trio version) can be examined as a guide.  It is most convenient to examine the file in a spreadsheet program; accordingly it is best to use a file extension of .txt or .tsv.  To facilitate examination in a spreadsheet program, values in the file are tab-separated, allowing the data to be human-readable as columns and “cells”.  Be careful, however, when saving the file because Excel will convert the (now deprecated) HGNC gene symbol SEPT9 or the zygosity 1/1 to a date.  The free spreadsheet Libre Office Calc is used by many genomics groups because it does not perform as many such conversions as Excel does.  Newline characters are used in the file to separate lines (any of the newline conventions, LF, CR+LF, or CR, are allowed if used consistently the file).

The structure of the variant file is as follows:

Row 1: the first “cell” should be:
fileformat=generic
or some other fileformat assigned specifically to your group.  This is followed by newline character(s) as discussed above.

Row 2: for a trio this has the 43 field columns headers as listed below.  The column headers are tab-separated except for newline character(s) at the end.  All headers must be included, even if blank (e.g., when not using a particular conservation score) or not having an individual in the proband trio (e.g., for a proband only with no parents, the parental columns are blank except for the headers).  If there are any individuals beyond the proband trio, new columns are added after those for the proband’s father.  For example, if the proband, father and proband’s sister are included, there would be 4 zygosity column headers: zygProband (with data in rows below), zygMother (header without data), zygFather (with data in rows below), and zygSister (with data in rows below).  Additionally, for each individual beyond the trio, 3 column headers and the data in rows below would be added (total depth, variant depth and quality).

Rows of rows of variant data follow the headers as row 3 and so on (tab-separated, newline at end).  Many of the fields are not required and can be left blank if you don’t plan to use the related functionality, but even if a field is blank, you need to include the tabs in the variant rows and the headers in row 2.  Examining the sample variant file in our tiny demo (choose trio or beyond trio version) illustrates how several columns can be blank but headers are used anyway.

CNV: The variant table can also include large copy number variation (CNV) regions. Three of the fields below are different: hgncSymbol has a comma-separated list of the genes in the region (e.g., GOLGA6A,RN7SL429P), chrPos has a chromosomal range (e.g., 15:74363000-74365200), and effect has DEL or DUP.

Problems? If your data has values that we don’t support, such as values for the “effect” field, let us know, and we can support your values.  In many instances blank, “.” , “-“, -9 and -99 are allowed, and interpreted as not used.  In the fields indicated, “NA” and “na” also count as blanks.

The software uploads and analyzes variant files with ~20,000 variants in ~1 second.  The variant file loading text box on the “Load or save patient” screen of the software will report the results of file reading or problems encountered, for example stated gender not matching chromosomal gender.

Field (column header)Sample valueRequired ActionComments
hgncSymbolGBA or GOLGA6A,RN7SL429Pyes (all)labelMultiple symbols separated by commas are supported. For CNV analysis, this field should contain all gene symbols in the CNV. HGNC symbols for genes with known human phenotype that have been replaced since 1 January 2010 are recognized and tied to the current HGNC symbol.
geneNameLong glucosidase, beta, acid no report  
chrPos
(or specify the genome assembly using HG19:chrPos (the default) or HG38:chrPos)
chr1:155206167 or 1:155206167, or for CNV, 1:155200000-1:155444444yes (all)computeThe chromosome number and position are displayed in the gene variants display, hyperlinked to the UCSC genome browser, using the genome assembly indicated in the header using the format shown at left.  The chromosome number is also used in choosing the inheritance model used in the Gene Discovery display.  Also, unusual distributions of variants over the chromosomes are reported in the variant table processing text area.  Entries may start with chr or the number or letter for the chromosome.  For chromosome designations, 23 or X, 24 or Y, 25, M or MT are supported.  Characters past a colon are used to display position information. 
cSeqAnnotationNM_014208:ex5some of the 8 sequence columns should be usedlinkText (if desired, can combine this + next 7 sequence annotations in this field, and multiple sequence annotations separated by commas are supported). If a DNA position and change is recognized, it is used to construct a ClinVar query.
cPosition38some of the 8 sequence columns should be usedlinkIf this and the next 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
cRefAsome of the 8 sequence columns should be usedlinkIf this and the fields before and after are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
cAltGsome of the 8 sequence columns should be usedlinkIf this and the previous 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
pSeqAnnotationNP_00100574.1 or p.E1149D or E1149Dsome of the 8 sequence columns should be usedlinkIf no DNA position and change is recognized, the protein change is used to construct a ClinVar query.
pPosition13some of the 8 sequence columns should be usedreport 
pRefKsome of the 8 sequence columns should be usedreport 
pAltRsome of the 8 sequence columns should be usedreport  
rsidrs12345678nocomputeThe percent of variants with rsID numbers is reported in the variant table processing text area.
zygProband (to use identifiers within the software, use terms such as zygPaula here and Paula will be used as the identifier within the software)HetyescomputeZygosity of proband: Accepts non-negative integers from 0-100, or case-insensitive text (het, heterozygous, hap, hom, homozygous, hemi, wt).  Inputs treated as wt: unknown, ., -, none.  Inputs of the form x/x (or the phased equivalents using |) are treated as hom for nonzero x; otherwise wt.   Inputs of the form x/y can contribute to compound hererozygotes, even at the same locus.  If the / or | forms are used, 1 (with no pipe or slash) is interpreted as hap (hemi).
zygMother (to use identifiers within the software, use terms such as zygINDIV_35 here and INDIV_35 will be used as the identifier within the software)50no, if no genomic data is available from the proband’s mother leave this blank but use a headercomputeZygosity of mother, as above
zygFather (to use identifiers within the software, use terms such as zygGeorge here and George will be used as the identifier within the software)0no, if no genomic data is available from the proband’s father leave this blank but use a headercomputeZygosity of father, as above
(for beyond the trio, the next additional zygosity column goes here)
effectmissense, or for CNV, DEL or DUPyes (most)computeEffect terms that are recognized are listed at this link, though each group typically uses only a few of these, such as the core terms missense, frameshift, and synonymous.  The listing here is periodically updated to include all SnpEff effect prediction terms, but let us know if we are missing any.  The terms are case-insensitive.  If multiple terms are used (separated by “|” or “&” or “,” or “/”), each effect term is considered and the one with the highest severity is chosen. 
freq1 (to use identifiers within the software, use terms such as freqExAC here and ExAC will be used as the identifier in the mini variant table).0015yescomputeReal numbers between 0-1.  “NA”and “na” are interpreted as 0, in addition to the usual blank characters.
freq2 (to use identifiers within the software, use terms such as freqAfrica here and Africa will be used as the identifier in the mini variant table).02nocomputeReal numbers between 0-1.  “NA”and “na” are interpreted as 0, in addition to the usual blank characters.
homoShares0nocomputeCount of times this particular mutation seen in homozygous form in unaffected individuals (for example, at your lab) (non-negative integers)
heteroShares3nocomputeCount of times this particular mutation seen in heterozygous form in unaffected individuals (for example, at your lab) (non-negative integers)
omimNumber606463noreportSix digit number corresponding to the gene
omimDiseaseNamesGaucher diseasenoreportMultiple diseases can be strung together for display
variantAccessionCM065215noreportVariant accession number
variantPathogenicityDM or 5nocomputeVariant pathogenicity report.  For HGMD varient pathogenicity, DM is treated as severity 5, DM? as 4, and DP as 3.  ClinVar values from 2 to 5 and their verbal equivalents (capitalized or uncapitalized) are treated as follows: 2 or Benign are severity 1, 3 or Likely benign are severity 2, 4 or Likely pathogenic are severity 3, and 5 or Pathogenic are severity 5.  These values override other variant severity determinations, though the other scoring is described in the mini variant table that is displayed for non-benign variants.
polyPhenprobably-damagingnocomputeEither verbal (probably-damaging, possibly-damaging, benign, DP, FP, DFP, D, P, B), or numerical (real number between 0 and 1, with damaging values near 1) but not both.  If two terms are present the first one is used.
mutationTaster0.73nocomputeA, D, N, P (case-insensitive) or real number between 0 and 1, with damaging values near 1. 
sift0.68nocomputeEither verbal (D, T (case-insensitive)) or numerical (real number between 0 and 1, with damaging values near 0 (can be configured for customers to use damaging values near 1)) but not both.  If two terms are present the first one is used.
gerp5.34nocomputeReal number, with higher numbers more damaging.
grantham29nocompute0-215, with higher numbers more damaging.
phat0.82nocomputeReal number between 0 and 1, with higher numbers more damaging. “NA”and “na” are interpreted as 0, in addition to the usual blank characters.
phast0.95nocomputeReal number between 0 and 1, with higher numbers more damaging. “NA”and “na” are interpreted as 0, in addition to the usual blank characters.
phyloP 0.75nocomputeRankscore real number between 0 and 1, with higher numbers more damaging. “NA”and “na” are interpreted as 0, in addition to the usual blank characters.
strandBiasno   
knownSplice no   compute Real number between -1 and 1 reflecting disruption of a known splice site, with values near +1 being more damaging.
totDepthProband105nocomputeTotal read depth for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored.
varDepthProband74noreportVariant read depth for proband; non-negative integer. 
qualProband162nocomputeRead quality score for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored.
totDepthMother99noreportTotal read depth for mother; non-negative integer. 
varDepthMother99noreportVariant read depth for mother; non-negative integer. 
qualMother99noreportRead quality score for mother; non-negative integer. 
totDepthFather250noreportTotal read depth for father; non-negative integer. 
varDepthFather99noreportVariant read depth for father; non-negative integer. 
qualFather99noreportRead quality score for father; non-negative integer. 
(for beyond the trio, the next additional total depth column goes here)
(for beyond the trio, the next additional variant depth column goes here)
(for beyond the trio, the next additional quality column goes here)