‹‹ Back to SVS Home

Formulas for Principal Component Analysis

15.6 Formulas for Principal Component Analysis

This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price 2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson 2006].

NOTE:

Suppose X is the m (marker) by n (subject or sample) matrix of either numeric values or the numeric equivalent values of the genotypes, with the exception that for each marker, what had been the average value over that marker has first been subtracted out from the markers values (data centering by marker).

NOTE:

  • In our notation, a spreadsheet in SVS on which PCA may be performed may be thought of as XT, since such a spreadsheet suggests an n (subject or sample) by m (marker) matrix.
Motivation

The idea is that X may be thought of as decomposed into USV T, where U is an m by n matrix whose kth column contains coordinates of each marker along a kth principal component, S is an n by n singular-value matrix, and V is an n by n matrix of ancestries relating the samples to the singular values.

It turns out that for this analysis, XTX (which is like a Wishart matrix) is a useful matrix for computational purposes. (See Further Motivation: Relation to the Variance-Covariance Matrix.) This is equal to:

XT X = VSU TU SV T,
but UTU = I by definition, so
  T       2 T
X  X = V S V .
If XTX is decomposed into its eigenvectors and eigenvalues, then
XT XV  = V S2VTV = V S2,
or
XT Xvk = λkvk,
where λk and vk are the eigenvalues and eigenvectors of XTX.

Thus, we conclude that the vk are ancestry vectors, with the most important ones being those corresponding to the highest λk. Our object will be to, in some way, subtract these most important vk out from the rest of the data.

NOTE:

  • In actual fact, SVS normalizes XTX by the number of markers, m. The corresponding decomposition of X becomes XTX
-m-- = V (S2∕m)V T, and S2
m- is the matrix which contains the eigenvalues that we can obtain and output from SVS.
Technique

When accumulating the elements of XTX for genotypic data, the marker data may be optionally weighted in favor of those markers with less variance by normalizing them according to their variance, either theoretical or actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing for genotypic data.

Otherwise, the technique is to:

  • Scan the markers and obtain the Wishart-like XTX matrix.
  • Find the desired number of eigenvectors and eigenvalues from XTX.
  • If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.
Applying the PCA Correction

Suppose r is a vector containing the dependent variable (response) values or numeric-equivalent values. Exercising data centering by marker, we subtract the average element value from each element to get r0; then,

r1 = r0 − ρ1v1,
where ρ1 = v1 r0. Similarly,
r2 = r1 − ρ2v2,
where ρ2 = v2 r1, etc., up to
rk = rk−1 − ρkvk,
where ρk = vk rk1.

Note that the ρ values are obtained through dot products, where component wise,

    ∑
ρ = --ja∑jkrj(k−1).
 k       ja2jk

The corrected r is obtained (as a part of data centering by marker) by adding back the average element value of the original r to each element of rk.

The idea is for each principal component, we find what magnitude of it is present in r (or rk1)–then we subtract that magnitude of that component from rk1 to get rk.

Exactly the same technique is applied for every vector g of data from an individual marker. After subtracting the average element value to get g0, we compute

g1 = g0 − γ1v1,
where γ1 = v1 g0. Similarly,
g2 = g1 − γ2v2,
where γ2 = v2 g1, etc., up to
g = g   − γ v ,
k    k−1   k k
where γk = vk gk1.

Component-wise, we have

     ∑
γk = --ja∑jkgj(k−1).
          ja2jk

We finally get the corrected g back (as a part of data centering by marker) by adding back the average element value of the original g to each element of gk.

Formulas for PCA Normalization of Genotypic Data

The possibilities for marker normalization are:

  • The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the EIGENSTRAT program.

    This theoretical standard deviation is as follows according to the genetic model being used:

    • Additive (the only model supported by EIGENSTRAT itself): 2∘ ---------
  (p(1− p)), where
             ∑
p = (1+---gj)
    (2+ 2n)
      is a Bayesian estimate of the minor allele probability in terms of the numeric marker values gj.
    • Dominant: ∘ ---------
  (p(1− p)), where
               ∑
p = (.5-+---gj)
      (1 + n)
      is a Bayesian estimate of the probability of being a DD or Dd genotype.
    • Recessive: ∘ (p(1-−-p))-, where
          (.5 + ∑ gj)
p = --(1-+-n)--
      is a Bayesian estimate of the probability of being a DD genotype.
  • The actual standard deviation of this markers data.
  • Don’t normalize.

NOTE:

  • Numeric data is not normalized in SVS.
Further Motivation: Relation to the Variance-Covariance Matrix

For the purposes of PCA correction in SVS, the marker data may be treated as if it comes from a multivariate normal distribution, where the data from marker r, xr = (xr1,...,xrn), is picked from a random variable x = (x1,...,xn) (where n is equal to the number of samples). If x indeed has a multivariate normal distribution, then there is a random vector z = (z1,...,zk) whose components are independent standard normal random variables, a vector μ = (μ1,...,μn), and a k by n matrix A such that x = zA + μ.

The vector μ is the expected value of x, and the matrix ATA is the variance-covariance matrix of the components xj.

Thus, the vectors (xr1,...,xrn) in matrix X may be treated as coming from the expression

(xr1,...,xrn) = (zr1,...,zrk)A +(μ1,...,μn ),

or

     ∑
xrp =   zrqaqp + μp,
      q

where aqp are elements in matrix A.

Let us examine XTX. One element of XTX may be written as

  T       ∑
(X  X )ij =    xrixrj.
           r

The term xrixrj expands out to

         ∑              ∑
xrixrj = [( zrqaqi)+ μi][(   zrqaqj)+ μj]
          q              q

   ∑         ∑            ∑             ∑
= (   zrqaqi)(   zrqaqj)+ μi(  zrqaqj) +μj(   zrqaqi)+ μiμj.
    q         q            q             q

Expanding the first term above, we have

(∑  zrqaqi)(∑   zrqaqj) = ∑ zrqaqizrqaqj + ∑ ∑  zrqaqizrsasj.
  q        q           q              q s⁄=q

We may thus write

          ∑          ∑  ∑               ∑  ∑                ∑             ∑
(XT X )ij =   xrixrj =   [  zrqaqizrqaqj +      zrqaqizrsasj + μi(  zrqaqj)+ μj(  zrqaqi)+ μiμj]
           r          r  q               q s⁄=q                q             q

  ∑  ∑              ∑   ∑  ∑                ∑  ∑            ∑  ∑         ∑
=       zrqaqizrqaqj +          zrqaqizrsasj + μi     zrqaqj +μj       zrqaqi +   μiμj
   r  q              r   q s⁄=q                r  q            r  q         r

  ∑       ∑       ∑  ∑        ∑           ∑     ∑         ∑     ∑
=    aqiaqj    zr2q +      aqiasj   zrqzrs + μi  aqj   zrq + μj  aqi   zrq + m μiμj,
   q       r       q s⁄=q       r           q     r         q     r

where m is the number of markers.

However, since the components of the random variable from which z is assumed to be drawn are independent standard normal random variables, we may make the following estimates:

∑   2
   zrq = m
 r
∑  z z  = 0,(q ⁄= s)
 r  rqrs
∑
    zrq = 0
 r

Thus, we may estimate the elements i and j of XTX as

  T         ∑
(X X )ij = m    aqiaqj + m μiμj,
             q

or

(XT X )ij   ∑
---m----=    aqiaqj + μiμj.
           q

Note that

  T      ∑
(A A )ij =   aqiaqj;
          q

thus, we have the following estimate:

XT X     T     T
--m-- = A A + μ μ,

where ATA is the covariance matrix of the multivariate distribution of x and μTμ is an outer product of the sample averages.

NOTE:

  •   T
Xm-XμTμ is a Wishart matrix (based on (x rjμj)). The distribution of this random matrix is a generalization of the chi-square distribution.
Centering by Marker vs. Not Centering by Marker

In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered by marker–that is, the average of the data within a given marker is first subtracted out from each of that marker’s data values both before they are used for the XTX matrix and before they are PCA-corrected, with the averages being added in after correction.

An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.

Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.

Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.

On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.

Regarding data centering by marker, it should be noted that after the marker averages are subtracted out, we have

ixri = 0, where xri is the element of X corresponding to the r-th marker and the i-th sample. Since

          ∑
(XT X )ij =    xrixrj,
           r

for any pair of samples i and j, we have

∑    T      ∑  ∑          ∑  ∑          ∑  ∑
  (X  X )ij =      xrixrj =      xrixrj =   (  xri)xrj = 0.
 i           i  r          r  i          r   i

Multiplying a constant vector (1,...,1)T into the above XTX will therefore obtain a vector with zero for every element, demonstrating that when data centering by marker is used, one of the eigenvalues is zero, corresponding to an “average vector” of constant elements.

If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.

Centering by Sample vs. Not Centering by Sample

A second alternative approach to PCA, available in SVS for numeric or copy number data, is to center by sample, either in addition to centering by marker or instead of centering by marker. This entails effectively subtracting out the sample average before contributing to the matrix XTX (and adding it back in while correcting). (In SVS, if centering by both sample and marker, the overall average is subtracted out only once.)

Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.

An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.

Applying PCA to a Superset of Markers

Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.

Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.

Applying PCA to a Subset of Samples

It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.

Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.

However, now we must look at whether two components which are orthogonal in the main sample set are still orthogonal in the sample subset. Suppose (aTbT)T and (cTdT)T are two components of the full set, where a and c are in the original sample set but not the sample subset, and b and d are in the sample subset. (Without loss of generality, we are writing the vector as if all of the elements in the subset follow after all of the elements not in the subset.) Because these are eigenvectors of XTX, they will be orthogonal. That is, their dot product will be zero.

This dot product may be written ac + bd. We see that if the individual dot products, both outside of and inside of the subset, are zero, then the full dot product will be zero. Thus, components that are orthogonal in the subset and its complement will be orthogonal for the overall set.

Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.

However, because the subset components are not precisely orthogonal except in special cases, a modified Gram-Schmidt process is applied by SVS to make the components that belong to the subset orthonormal (both orthogonal and normalized). Effectively, this is equivalent to “obtaining a PCA correction” for every component vk based upon the previous components v1, v2, ..., up to vk1.