‹‹ Back to SVS Home

General Statistics

13.1 General Statistics

General Marker Statistics

The following subsections further explain the methods used in obtaining General Marker Statistics, which may be invoked using a separate window (Section Genotype Statistics by Marker) or as a tab in the Genotypic Association Test module (Section Genotype Association Tests).

Hardy-Weinberg Equilibrium Computation

Suppose we have a marker with alleles 1,,k having frequencies p1,,pk. We may write the genotype count for alleles i and j as nij. Due to phase ambiguity, if ij, we count occurrences of allele i on the first chromosome and allele j on the second chromosome, along with occurrences of allele j on the first chromosome and allele i on the second chromosome in both the notations nij and nji.

Thus, we may write the count for allele i as ni = 2nii + j=1,jiknij. We may also express the genotype frequency for allele i occurring homozygously as pii = nini, and the genotype frequency for heterozygous alleles i and j as pij = nnij-, where n is the population count. The frequency of allele i may be expressed as:

    ni-       1  ∑k
pi = 2n = pii + 2    pij.
               j=1,j⁄=i

We wish to check the agreement of pii with pi2 and the agreement of pij, where ij, with 2pipj. We multiply by two because of how we deal with the phase ambiguity (see above).

Thus, we will define the Hardy-Weinberg equilibrium coefficient Dii or Dij for alleles i and j such that

pii = pi2 + D ii
pij = 2pipj 2Dij(for ij).

(It may be shown that for a bi-allelic marker, D11 = D12 = D22.)

We then have a chi-squared distribution with k(k 1)2 degrees of freedom,

X2 = n i=1k(pii −-p2i)2
   p2i + n i=1k1 j=i+1k(pij −-2pipj)2
   2pipj
= i=1k(nii − np2i)2
----np2---
      i + i=1k1 j=i+1k(nij − 2npipj)2
----2npipj----.

From this, we obtain the distribution’s p-value p = χ2(X2,k(k 1)2), and the correlation, R, from the inverse distribution for one degree of freedom (where F is the chi-squared distribution), which is

    ∘--−1---
R =   F--(p).
        n

Fisher’s Exact Test HWE P-Values

In this test, all of the possible sets of genotypic counts consistent with the observed allele totals are cycled through, and all the probabilities of all sets of counts which are as extreme or more extreme (equally probably or less probable) than the observed set of counts are summed.

See [Emigh 1980].

Signed HWE Correlation R

NOTE:

  • This statistic applied only to bi-allelic markers.

We define the signed HWE correlation R as

R      = n-∗nDD-−-n2D-,
  signed   n∗ nD − n2D

where

            nDd
nD = nDD  ⁄= -2-,

n is the total genotype count and nDD and nDd are the counts for genotypes DD and Dd, respectively.

This is derived from the formula for (signed) correlation between two sets of observations, xi and yi,

            ∑       ∑   ∑
r = ∘------n--xiyi −-∘-xi--yi-------,
     n ∑ x2− (∑ xi)2 n ∑ y2 − (∑ yi)2
          i               i

where we take the xi to be 0 if the first allele is d and 1 if the first allele is D, and the yi to be 0 if the second allele is d and 1 if the second allele is D.

Because of phase ambiguity, we set each of the counts of (d, D) and (D, d) to be one-half of the (phase-ambiguous) observed count of Dd. The correlation then simplifies to the formula first given above.

If there is a high homozygous count, xi and yi will often be 1 or often be 0 at the same time, and therefore there will be a positive correlation between the xi and the yi. Similarly, if there is a high heterozygous count, xi and yi will often be 1 at opposite times, causing an anti-correlation to exist.

Call Rate

The call rate is the fraction of genotypes present and not missing for the given marker.

          number-of complete and-non- missing genotypes
call rate =         total number of genotypes

Minor Allele Frequency (MAF)

The minor allele frequency is the fraction of the total alleles of the given marker that are minor alleles.

       Minor allele count
MAF  = ----------------
       Total allele count
Statistics Available for Genotype Association Tests

Correlation/Trend Test

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have n pairs of observations (xi, yi), the (signed) correlation R between them is

      cov(x, y)              ∑ xiyi − 1n ∑ xi∑ yi
R = ∘var-(x)var(y) = ∘-(∑--------∑----)-(∑--------∑----)-.
                         x2i − n1( xi)2    y2i − 1n ( yi)2

Meanwhile,

χ2 = (n− 1)R2

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

NOTE:

  1. In the special case of the additive model (and no PCA correction) for a case/control study, if we were to use, instead of the above formula,
    χ2 = nR2,
    we would have the mathematical equivalent of the Armitage Trend Test.
  2. This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead
      2             2
χ  = (n− 1 − k)R ,
    where k is the number of principal components that have been removed from the data. The premise is PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.

Armitage Trend Test

The Armitage Trend Test tests the “trend” in an ordered case/control contingency table. In SVS, the ordering is by number of minor alleles in the genotype – zero, one, or two.

Let n10, n11, and n12 be the counts for cases with 0, 1 and 2 alleles, respectively, and n00, n01, and n02 be the counts for controls with 0, 1 and 2 alleles, respectively. Also, let s0 = 0, s1 = 1, and s2 = 2.

If we let N be the total count,

pcase = n10 + n11 + n12
-----N-------,
p1i = ---n1i---
(n0i + n1i),
¯s = ∑ (n  + n )s
----0i---1i--i-
     N, and
b = ∑
--(n0∑i +-n1i)(p1i −-pcase)(s2i −-¯s)
       (n0i + n1i)(si − ¯s),

then the prediction equation under ordinary least-squares fit is

pˆ = p  + b(s− ¯s).
 1i   1i    i

The statistic for the Armitage Trend Test is

           2      ∑
z2 =------b-------   (n0i + n1i)(si − ¯s)2,
    pcase(1− pcase)

which is asymptotically chi-squared with one degree of freedom. This is used to obtain the chi-squared based p-value for this test.

Exact Form of Armitage Test

The exact form of this test yields the exact probability under the null hypothesis of having a “trend” at least as extreme as the one observed, assuming an equal probability of any permutation of the dependent variable.

To perform the exact Armitage test, we define the trend score for the contingency table m as

     ∑   (m )
Tm =    n1i s1i,
where
s  =  ∑-----(si −-¯s)------.
 1i     (n0j + n1j)(sj − ¯s)2
The exact permutation p-value is evaluated as
            ∑
pexact =            pm,
       |Tm|≥|Tobserved|
where
         (     )
      ∏  n0in+n1i
pm =  (---N-1i--).
      n10+n11+n12

(Pearson) Chi-Squared Test

This is the most-often used way to obtain a p-value for (the extremeness of) an (unordered) m×n contingency table, to know whether to reject the null hypothesis that the proportions in the rows and columns of the table differ from the proportions of the margin column totals and the margin row totals, respectively as much as they do by chance alone.

If the contingency table with elements xij has N observations, we make an “expected” contingency table based on the marginal totals:

e =  ricj.
ij   N

We then obtain a p-value from the fact that

     ∑           2
χ2 =    (xij-−-eij)-
            eij
approximates a chi-squared distribution with (m 1)(n 1) degrees of freedom.

For the 2 × 2, 2 × 3, and 2 × 4 tables for which this technique is used in this program, the degrees of freedom are 1, 2 and 3 respectively.

Fisher’s Exact Test

The output of this test is the sum of the probabilities of all contingency tables whose marginal sums are the same as those of the observed contingency table and which are as extreme or more extreme (equally probable or less probable) than the observed contingency table.

The probability of a 2 ×r contingency table with elements xrc and row totals rc and column totals cr and N elements is given by

p = (r1!r2!)(c1!c2!...cr!)
     x11!x12!...x2r!N !

To reduce the amount of computation, techniques developed by Mehta and Patel [Mehta and Patel 1986] are used for computing Fisher’s Exact Test.

Odds Ratio with Confidence Limits

For the purposes of this method’s description, we define a 2 × 2 contingency table as being organized as “(Case/Control) vs. (Yes/No)” demonstrated in the table below.

Yes No Total




Case ycase ncase ycase + ncase




Controlycontrol ncontrol ycontrol + ncontrol




Total ycase + ycontrolncase + ncontrolN

The odds ratio is defined as the ratio of the odds for “Case” among the Yes’s to the odds for “Case” among the No’s, or equivalently the ratio of the odds for “Yes” among the cases to the odds for “Yes” among the controls, or equivalently

OR = ycasencontrol.
     ncaseycontrol
To obtain confidence limits, we use the standard error of log(OR), which is
    ∘ ------------------------------
      --1-   --1---   -1---  --1----
s =   ycase + ycontrol + ncase + ncontrol.

The 95% confidence interval then ranges from elog(OR)1.96s to elog(OR)+1.96s.

Analysis of Deviance

This is a maximum-likelihood based technique for analyzing a case-control contingency table with k columns. Let s be the proportion of cases in the entire sample, nj be the number of observations in column j of the contingency table, and pj be the proportion of cases in column j. Then, to perform an analysis of deviance test, we define

F0 = − 2n(slog(s)+ (1− s)log(1 − s))
and
        k
Fk = − ∑  [− 2nj(pjlog(pj)+ (1− pj)log(1− pj))].
       j=1

The test statistic is then F0 Fk, which approximates a chi-squared distribution with k 1 degrees of freedom. A p-value is then obtained based on this chi-squared approximation.

F-Test

The F-Test applies to a quantitative trait being subdivided into two or more groups according to the category of the predictor variable.

This test is on whether the distributions of the dependent variable within each category are significantly different between the various categories of the predictor variable. Another way to phrase this question is whether the variation of the trait between the categories is substantial by comparison to the variation of the trait within the categories.

If there are n observations xi subdivided into k groups, we define

       ∑                  ∑
F0 =         (xi − ¯x)2 =        x2i − n¯x2,
    observations          observations

and

     ∑   ( ∑        ∑      2)
Fk =           x2i − --groupxi- .
    groups  group      ngroup

If v1 = (k 1) and v2 = (n k), then

F0 −-Fk
  v1
is proportional to the variance between the groups, and
Fk-
v2
is proportional to the variance within the groups. The F statistic becomes
     (F0 − Fk)∕v1   (F0 − Fk)v2
F =  ---Fk∕v2---=  ---Fkv1---.

The p-value is obtained by subtracting the probability of observing the F statistic from an Fv1,v2 distribution (where v1 are the numerator degrees of freedom and v2 are the denominator degrees of freedom) from one.

p − value = P (X ≥ |Fstatistic|) where X ∼ Fv1,v2

Linear Regression

See Linear Regression.

Logistic Regression

See Logistic Regression.

Statistics for Numeric Association Tests

Correlation/Trend Test

The Correlation/Trend Test tests the significance of any correlation between two numeric variables (or two variables which have been encoded as numeric variables). This test may also be thought of as any “trend” which either one of the numeric variables may have taken against the other one.

If we have n pairs of observations (xi, yi), the (signed) correlation R between them is

                             ∑       1∑    ∑
R = ∘-cov(x, y)--=  ∘-(--------xiyi −-n)-(-xi-yi-------)-.
      var(x)var(y)      ∑ x2− -1(∑ x )2  ∑ y2 − 1(∑ y )2
                          i  n     i       i   n    i

Meanwhile,

 2          2
χ = (n− 1)R

follows an approximate chi-squared distribution with one degree of freedom, from which a p-value may be obtained.

NOTE:

  • This correlation/trend test is also available to be used after PCA correction. However, the formula for the chi-squared statistic is instead
    χ2 = (n− 1 − k)R2,
    where k is the number of principal components that have been removed from the data. The premise is that the PCA correction has removed k degrees of freedom from the data, and only the remaining degrees need to be tested.

T-Test

The T-Test is a special form of the F-Test in which distributions in only two categories are being compared. (The T statistic is the square root of the corresponding F statistic for two categories.)

In the CNV Association Test, the T-Test is used for a quantitative predictor (independent variable) and a case/control (binary) dependent variable.

The test is on whether the distributions of the quantitative predictor within the two categories of case versus control are significantly different. Another way to phrase this question is whether the variation of the predictor between the categories is substantial by comparison to the variation of the predictor within the categories.

If there are nt observations xti corresponding to a true dependent variable value and nf observations xfi corresponding to a false dependent variable value, we define

St = xti,
Sf = xfi,
Sq = observationsxi2.
Then
          2    2
    Sq − Sntt − Sfnf-
Sd =-n-+-n--−-2-.
      t    f

If Sd is less than a threshold (106), then the p-value returned is 1.0. Otherwise,

      Snt− Snf
T =  ∘-t---f--
       Sndt + Sndf

The p-value may be calculated on the basis of this T value as a “two-sided p-value” using Student’s t distribution with nt + nf 2 degrees of freedom.

False Discovery Rate

When testing multiple hypotheses, there is always the possibility one or more tests have appeared significant just by chance. Various techniques have been proposed to adjust the p-values or to otherwise correct for multiple testing issues. Among these are the Bonferroni adjustment and the False Discovery Rate. The following discussion and technique is used in SVS specifically to correct for multiple testing over many different predictors.

Suppose that m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really false positive results, that is V is the number of type I errors. The False Discovery Rate is defined as

        (        )
          V-
FDR  = E  R |R > 0  P r(R > 0),
that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.

Suppose we are rejecting (the null hypothesis) on the basis of the p-values p1,,pm from these m tests, specifically, when a p-value is less than a parameter γ. If we can treat the p-values as being independent, then we can estimate Pr(p γ) as

            max (R (γ),1)
P^r(P ≤ γ ) =-----m-----,
where R(γ) is the number of pi less than or equal to γ, and use this to estimate the False Discovery Rate FDR as
 ^        ----γ-----
F DR (γ ) = ^P r(P ≤ γ).

When this is computed for γ equal to any particular p-value, these expressions simplify to

^Pr(P ≤ γ) = R-(γ),
             m
and
          m γ    mγ
F^DR (γ) = R(γ) = -j-,
where j is the number of p-values less than or equal to γ.

See [Storey 2002]. (We use π0 = 1 here.)