‹‹ Back to SVS Home
Formulas for Principal Component Analysis
15.6 Formulas for Principal Component Analysis
This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price 2006]. A more thorough discussion of stratification, principal components analysis and the eigenvalues involved may be found in [Patterson 2006].
NOTE:
- SVS currently uses PCA with association tests performed against genotypes, using their numeric equivalent values (Genotypic Principal Component Analysis), and also with association tests performed against numerical data (Numeric Principal Component Analysis).
Suppose X is the m (marker) by n (subject or sample) matrix of either numeric values or the numeric equivalent values of
the genotypes, with the exception that for each marker, what had been the average value over that marker has first been
subtracted out from the markers values (data centering by marker).
NOTE:
- In our notation, a spreadsheet in SVS on which PCA may be performed may be thought of as XT, since such a spreadsheet suggests an n (subject or sample) by m (marker) matrix.
Motivation
The idea is that X may be thought of as decomposed into USV T, where U is an m by n matrix whose kth column contains coordinates of each marker along a kth principal component, S is an n by n singular-value matrix, and V is an n by n matrix of ancestries relating the samples to the singular values.
It turns out that for this analysis, XTX (which is like a Wishart matrix) is a useful matrix for computational purposes. (See Further Motivation: Relation to the Variance-Covariance Matrix.) This is equal to:




Thus, we conclude that the vk are ancestry vectors, with the most important ones being those corresponding to the highest λk. Our object will be to, in some way, subtract these most important vk out from the rest of the data.
NOTE:
- In actual fact, SVS normalizes XTX by the number of markers, m. The corresponding decomposition of X
becomes
= V (S2∕m)V T, and
is the matrix which contains the eigenvalues that we can obtain and
output from SVS.
Technique
When accumulating the elements of XTX for genotypic data, the marker data may be optionally weighted in favor
of those markers with less variance by normalizing them according to their variance, either theoretical or
actual. See Formulas for PCA Normalization of Genotypic Data regarding marker normalizing for genotypic
data.
Otherwise, the technique is to:
- Scan the markers and obtain the Wishart-like XTX matrix.
- Find the desired number of eigenvectors and eigenvalues from XTX.
- If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable (for genotypic data or if correcting the dependent variable) and then during the testing scan from each marker of the genetic data just before it is tested.
Applying the PCA Correction
Suppose r is a vector containing the dependent variable (response) values or numeric-equivalent values. Exercising data centering by marker, we subtract the average element value from each element to get r0; then,



Note that the ρ values are obtained through dot products, where component wise,

The corrected r is obtained (as a part of data centering by marker) by adding back the average element value of the
original r to each element of rk.
The idea is for each principal component, we find what magnitude of it is present in r (or rk−1)–then we subtract that
magnitude of that component from rk−1 to get rk.
Exactly the same technique is applied for every vector g of data from an individual marker. After subtracting the average element value to get g0, we compute



Component-wise, we have

We finally get the corrected g back (as a part of data centering by marker) by adding back the average element value of the original g to each element of gk.
Formulas for PCA Normalization of Genotypic Data
The possibilities for marker normalization are:
- The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). This means
the standard deviation of this marker’s data would have to be over the population if, in the population, it were
in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured
for this marker. This is the standard method used by the EIGENSTRAT program.
This theoretical standard deviation is as follows according to the genetic model being used:
- Additive (the only model supported by EIGENSTRAT itself): 2
, where
is a Bayesian estimate of the minor allele probability in terms of the numeric marker values gj.
- Dominant:
, where
is a Bayesian estimate of the probability of being a DD or Dd genotype.
- Recessive:
, where
is a Bayesian estimate of the probability of being a DD genotype.
- Additive (the only model supported by EIGENSTRAT itself): 2
- The actual standard deviation of this markers data.
- Don’t normalize.
NOTE:
- Numeric data is not normalized in SVS.
Further Motivation: Relation to the Variance-Covariance Matrix
For the purposes of PCA correction in SVS, the marker data may be treated as if it comes from a multivariate normal distribution, where the data from marker r, xr = (xr1,...,xrn), is picked from a random variable x = (x1,...,xn) (where n is equal to the number of samples). If x indeed has a multivariate normal distribution, then there is a random vector z = (z1,...,zk) whose components are independent standard normal random variables, a vector μ = (μ1,...,μn), and a k by n matrix A such that x = zA + μ.
The vector μ is the expected value of x, and the matrix ATA is the variance-covariance matrix of the components xj.
Thus, the vectors (xr1,...,xrn) in matrix X may be treated as coming from the expression

or

where aqp are elements in matrix A.
Let us examine XTX. One element of XTX may be written as

The term xrixrj expands out to
![∑ ∑
xrixrj = [( zrqaqi)+ μi][( zrqaqj)+ μj]
q q](manual197x.png)

Expanding the first term above, we have

We may thus write
![∑ ∑ ∑ ∑ ∑ ∑ ∑
(XT X )ij = xrixrj = [ zrqaqizrqaqj + zrqaqizrsasj + μi( zrqaqj)+ μj( zrqaqi)+ μiμj]
r r q q s⁄=q q q](manual200x.png)


where m is the number of markers.
However, since the components of the random variable from which z is assumed to be drawn are independent standard normal random variables, we may make the following estimates:



Thus, we may estimate the elements i and j of XTX as

or

Note that

thus, we have the following estimate:

where ATA is the covariance matrix of the multivariate distribution of x and μTμ is an outer product of the sample averages.
NOTE:
−μTμ is a Wishart matrix (based on (x
rj−μj)). The distribution of this random matrix is a generalization
of the chi-square distribution.
Centering by Marker vs. Not Centering by Marker
In the EIGENSTRAT technique (Formulas for Principal Component Analysis), the data is centered by marker–that is, the average of the data within a given marker is first subtracted out from each of that marker’s data values both before they are used for the XTX matrix and before they are PCA-corrected, with the averages being added in after correction.
An alternative approach, available in SVS for numeric or copy number data, is to not center by marker, but instead allow the overall average value of each marker to be a part of the principal components analysis.
Two cases can be made that this approach is preferable to centering by marker for segmenting copy number data. First, if a variation in the data due to a batch effect is “corrected out” using centering by marker, a slight displacement over the remaining markers will ensue. If this displacement is large enough, it might be interpreted as a variation large enough to require a segment boundary, even over samples where there should not be a segment boundary.
Second, leaving the marker average in the principal components analysis may result in overall data smoothing and removing unnecessary segment boundaries.
On the other hand, a true copy-number variation over enough samples may influence the marker average, which, under this approach, could be picked up as a large component or as large parts of two or three large components by principal components analysis. This might result in apparent variations large enough for segment boundaries for samples for which there were no actual variations. However, the good news is that the apparent variations would be in the opposite direction from the true copy-number variations.
Regarding data centering by marker, it should be noted that after the marker averages are subtracted out, we have
∑ ixri = 0, where xri is the element of X corresponding to the r-th marker and the i-th sample. Since

for any pair of samples i and j, we have

Multiplying a constant vector (1,...,1)T into the above XTX will therefore obtain a vector with zero for every element, demonstrating that when data centering by marker is used, one of the eigenvalues is zero, corresponding to an “average vector” of constant elements.
If data is not centered by marker, but instead marker averages are incorporated into the components, this zero eigenvalue will not be present–instead, if the data’s averages are indeed not zero, one “extra” eigenvalue will show up. Often, this will be among the larger eigenvalues, before the eigenvalues start becoming very similar to eigenvalues (with one smaller subscript) that occur when data centering by marker is used.
Centering by Sample vs. Not Centering by Sample
A second alternative approach to PCA, available in SVS for numeric or copy number data, is to center by sample, either in addition to centering by marker or instead of centering by marker. This entails effectively subtracting out the sample average before contributing to the matrix XTX (and adding it back in while correcting). (In SVS, if centering by both sample and marker, the overall average is subtracted out only once.)
Whether it is appropriate to subtract out sample averages and add them back in after correction, rather than including the sample averages as part of what may be corrected, depends upon how much of the batch effect or stratification may be built into the various sample averages and their products vs. built into the variation across both markers and samples.
An example of when centering by sample can be useful is when obtaining components of copy number data from chromosomes 1 through 22 and then applying these to copy number data for chromosomes X and Y (See Applying PCA to a Superset of Markers). Centering by sample insures that the different averages of X chromosome intensities between men and women and the different averages of Y chromosome intensities between men and women will be preserved through PCA correction.
Applying PCA to a Superset of Markers
Sometimes, it is desirable to obtain principal components for a given set of samples from one set of markers, and then apply these components to a different set of markers. One example is, within a given set of samples, obtaining the components from a set of markers with a known population structure and then applying these to genotypic data over other markers. Another is the example mentioned above, which is to, for a given set of samples, obtain components of copy number data from chromosomes 1 through 22 and then apply these to copy number data for chromosomes X and Y.
Whether applying PCA to a superset of markers is reasonable to do is governed by whether we can assume that the behavior of the data over the different set of markers, especially the parts of the data contributing to population structure or batch effects, will be governed by essentially the same variance-covariance matrix (and the same set of sample averages, unless data centering by sample is being used) that the original data is governed by. Under this assumption, the same eigenvectors that describe the original data’s population structure or batch effects will also govern, and may be used to remove, the “new” data’s population structure or batch effects.
Applying PCA to a Subset of Samples
It is also sometimes desirable to obtain components for one set of samples, then use these components to correct data from a subset of these samples, either for the same markers or for a superset of markers. To do this, a subset of the components is first created to match the data’s sample subset, then are applied to the data.
Here again, if we are correcting over any superset of the markers from which we obtained principal components, we assume consistency in the variance-covariance matrix (and the sample averages, if applicable) over this superset of markers.
However, now we must look at whether two components which are orthogonal in the main sample set are still orthogonal in the sample subset. Suppose (aTbT)T and (cTdT)T are two components of the full set, where a and c are in the original sample set but not the sample subset, and b and d are in the sample subset. (Without loss of generality, we are writing the vector as if all of the elements in the subset follow after all of the elements not in the subset.) Because these are eigenvectors of XTX, they will be orthogonal. That is, their dot product will be zero.
This dot product may be written a⋅c + b⋅d. We see that if the individual dot products, both outside of and inside of the subset, are zero, then the full dot product will be zero. Thus, components that are orthogonal in the subset and its complement will be orthogonal for the overall set.
Unfortunately, the converse is not generally true. It is true that because the two dot products must sum to zero, any variations of dot products for the subset are perfectly compensated for by corresponding variations in the opposite direction in the product components over data that has not been subset out. On that basis, we could try to say that neither dot product “should” deviate too much from zero, and thus that the components as input “should” be more or less orthogonal over the subset, especially if the subset is of a reasonably large size.
However, because the subset components are not precisely orthogonal except in special cases, a modified Gram-Schmidt process is applied by SVS to make the components that belong to the subset orthonormal (both orthogonal and normalized). Effectively, this is equivalent to “obtaining a PCA correction” for every component vk based upon the previous components v1, v2, ..., up to vk−1.