‹‹ Back to SVS Home
General Statistics
13.1 General Statistics
General Marker Statistics
The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Section Genotype Statistics by Marker) or as a tab in the Genotypic Association Test module (Section Genotype Association Tests).
Hardy-Weinberg Equilibrium Computation
Suppose we have a marker with alleles 1,…,k having frequencies p1,…,pk. We may write the genotype count for alleles i
and j as nij. Due to phase ambiguity, if i≠j, we count occurrences of allele i on the first chromosome and allele j on the
second chromosome, along with occurrences of allele j on the first chromosome and allele i on the second chromosome in both
the notations nij and nji.
Thus, we may write the count for allele i as ni = 2nii + ∑
j=1,j≠iknij. We may also express the genotype frequency for
allele i occurring homozygously as pii =
, and the genotype frequency for heterozygous alleles i and j as pij =
, where n
is the population count. The frequency of allele i may be expressed as:

We wish to check the agreement of pii with pi2 and the agreement of pij, where i≠j, with 2pipj. We multiply by two
because of how we deal with the phase ambiguity (see above).
Thus, we will define the Hardy-Weinberg equilibrium coefficient Dii or Dij for alleles i and j such that
| pii | = pi2 + D ii | ||
| pij | = 2pipj − 2Dij(for i≠j). |
(It may be shown that for a bi-allelic marker, D11 = D12 = D22.)
We then have a chi-squared distribution with k(k − 1)∕2 degrees of freedom,
| X2 | = n∑
i=1k + n∑
i=1k−1∑
j=i+1k![]() | ||
= ∑
i=1k + ∑
i=1k−1∑
j=i+1k . |
From this, we obtain the distribution’s p-value p = χ2(X2,k(k − 1)∕2), and the correlation, R, from the inverse distribution for one degree of freedom (where F is the chi-squared distribution), which is

Fisher’s Exact Test HWE P-Values
In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and
all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the
observed set of counts are summed.
See [Emigh 1980].
Signed HWE Correlation R
NOTE:
- This statistic applied only to bi-allelic markers.
We define the signed HWE correlation R as

where

n is the total genotype count and nDD and nDd are the counts for genotypes DD and Dd, respectively.
This is derived from the formula for (signed) correlation between two sets of observations, xi and yi,

where we take the xi to be 0 if the first allele is d and 1 if the first allele is D, and the yi to be 0 if the second allele is d and
1 if the second allele is D.
Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous)
observed count of Dd. The correlation then simplifies to the formula first given above.
If there is a high homozygous count, xi and yi will often be 1 or often be 0 at the same time, and therefore there will be a
positive correlation between the xi and the yi. Similarly, if there is a high heterozygous count, xi and yi will often be 1 at
opposite times, causing an anti-correlation to exist.
Call Rate
The call rate is the fraction of genotypes present and not missing for the given marker.

Minor Allele Frequency (MAF)
The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

Statistics Available for Genotype Association Tests
Correlation/Trend Test
The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables
which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the
numeric variables may have taken against the other one.
If we have n pairs of observations (xi, yi), the (signed) correlation R between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be
obtained.
NOTE:
- In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use,
instead of the above formula,
we would have the mathematical equivalent of the Armitage Trend Test.
- This correlation/trend test is also available to be used after PCA correction. However, the formula for the
chi-squared statistic is instead
where k is the number of principal components that have been removed from the data. The premise is PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.
Armitage Trend Test
The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In SVS, the ordering is by
number of minor alleles in the genotype – zero, one, or two.
Let n10, n11, and n12 be the counts for cases with 0, 1 and 2 alleles, respectively, and n00, n01, and n02 be the counts for
controls with 0, 1 and 2 alleles, respectively. Also, let s0 = 0, s1 = 1, and s2 = 2.
If we let N be the total count,
| pcase | = , | ||
| p1i | = , | ||
| = , and | ||
| b | = , | ||
then the prediction equation under ordinary least-squares fit is

The statistic for the Armitage Trend Test is

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this
test.
Exact Form of Armitage Test
The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at
least as extreme as the one observed, assuming an equal probability of any permutation of the dependent
variable.
To perform the exact Armitage test, we define the trend score for the contingency table m as




(Pearson) Chi-Squared Test
This is the most-often used way to obtain a p-value for (the extremeness of) an (unordered) m×n contingency table, to
know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the
proportions of the margin column totals and the margin row totals, respectively as much as they do by chance
alone.
If the contingency table with elements xij has N observations, we make an “expected” contingency table based on the marginal totals:

We then obtain a p-value from the fact that

For the 2 × 2, 2 × 3, and 2 × 4 tables for which this technique is used in this program, the degrees of freedom are 1, 2 and
3 respectively.
Fisher’s Exact Test
The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those
of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the
observed contingency table.
The probability of a 2 ×r contingency table with elements xrc and row totals rc and column totals cr and N elements is given by

To reduce the amount of computation, techniques developed by Mehta and Patel [Mehta and Patel 1986] are used for
computing Fisher’s Exact Test.
Odds Ratio with Confidence Limits
For the purposes of this method’s description, we define a 2 × 2 contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.
| Yes | No | Total | |
| Case | ycase | ncase | ycase + ncase |
| Control | ycontrol | ncontrol | ycontrol + ncontrol |
| Total | ycase + ycontrol | ncase + ncontrol | N |
The odds ratio is defined as the ratio of the odds for “Case” among the Yes’s to the odds for “Case” among the No’s, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently


The 95% confidence interval then ranges from elog(OR)−1.96s to elog(OR)+1.96s.
Analysis of Deviance
This is a maximum-likelihood based technique for analyzing a case-control contingency table with k columns. Let s be the proportion of cases in the entire sample, nj be the number of observations in column j of the contingency table, and pj be the proportion of cases in column j. Then, to perform an analysis of deviance test, we define

![k
Fk = − ∑ [− 2nj(pjlog(pj)+ (1− pj)log(1− pj))].
j=1](manual38x.png)
The test statistic is then F0 −Fk, which approximates a chi-squared distribution with k − 1 degrees of freedom. A p-value
is then obtained based on this chi-squared approximation.
F-Test
The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the
predictor variable.
This test is on whether the distributions of the dependent variable within each category are significantly different
between the various categories of the predictor variable. Another way to phrase this question is whether the
variation of the trait between the categories is substantial by comparison to the variation of the trait within the
categories.
If there are n observations xi subdivided into k groups, we define

and

If v1 = (k − 1) and v2 = (n − k), then



The p-value is obtained by subtracting the probability of observing the F statistic from an Fv1,v2 distribution (where v1 are the numerator degrees of freedom and v2 are the denominator degrees of freedom) from one.

Linear Regression
See Linear Regression.
Logistic Regression
See Logistic Regression.
Statistics for Numeric Association Tests
Correlation/Trend Test
The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables
which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the
numeric variables may have taken against the other one.
If we have n pairs of observations (xi, yi), the (signed) correlation R between them is

Meanwhile,

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be
obtained.
NOTE:
- This correlation/trend test is also available to be used after PCA correction. However, the formula for the
chi-squared statistic is instead
where k is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.
T-Test
The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T
statistic is the square root of the corresponding F statistic for two categories.)
In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control
(binary) dependent variable.
The test is on whether the distributions of the quantitative predictor within the two categories of case
versus control are significantly different. Another way to phrase this question is whether the variation of the
predictor between the categories is substantial by comparison to the variation of the predictor within the
categories.
If there are nt observations xti corresponding to a true dependent variable value and nf observations xfi corresponding to a false dependent variable value, we define
| St = | ∑ xti, | ||
| Sf = | ∑ xfi, | ||
| Sq = | ∑ observationsxi2. | ||
![]() |
If Sd is less than a threshold (10−6), then the p-value returned is 1.0. Otherwise,

The p-value may be calculated on the basis of this T value as a “two-sided p-value” using Student’s t distribution with
nt + nf − 2 degrees of freedom.
False Discovery Rate
When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance.
Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these
are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in SVS
specifically to correct for multiple testing over many different predictors.
Suppose that m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really false positive results, that is V is the number of type I errors. The False Discovery Rate is defined as

Suppose we are rejecting (the null hypothesis) on the basis of the p-values p1,…,pm from these m tests, specifically, when a p-value is less than a parameter γ. If we can treat the p-values as being independent, then we can estimate Pr(p ≤ γ) as


When this is computed for γ equal to any particular p-value, these expressions simplify to


See [Storey 2002]. (We use π0 = 1 here.)
+ 
+ 




