‹‹ Back to SVS Home
Linear Regression
15.3 Linear Regression
Genotype or Numeric Association Test – Linear Regression
The response, y, is fit to every genetic predictor variable or encoded genotype, x, in the spreadsheet, using linear regression,
and the results include the regression p-value, intercept and slope which are output in a new spreadsheet along with other
genotypic association test results and any multiple correction results. The response is represented with the formula
y = b1x + b0 + 𝜖, where the model is represented by the expression b1x + b0 and the error term, 𝜖, expressing the difference, or
residual, between the model of the response and the response itself. For missing values of the predictor, the mean of the
response is used.
The regression hypothesis test is the test of:

Assumptions:
| 𝜖i | ∼ N(0,σ2) for all i = 1,…,n | ||
| where 𝜖i denote the residuals | |||
| and 𝜖i are all independent and follow a normal distribution | |||
| and all 𝜖i have equal variance σ2 |
The sums of squares and mean sum of squared errors are calculated as follows:
| Number of Observations: | n | |||||
| Rank of the Coefficient Matrix: | m | |||||
| Mean of the response: | = ∑
i=1ny
i | |||||
| Mean of the predictors or genotypes: | = ∑
i=1nx
i | |||||
| Solution to the normal equations: | = = (XTX)−1XTy | |||||
where ∼ N(β,σ2(XTX)−1) | ||||||
| Total Sum of Squares: | SST = ∑
i=1ny
i2 −![]() 2 = SSReg + SSE | |||||
| Regression Sum of Squares: | SSReg = ∑
i=1n 2 = TXTy −![]() ![]() | |||||
| where J is a matrix of ones | ||||||
| Error Sum of Squares: | SSE = ∑
i=1n 2 = yTy − TXTy | |||||
| Residual Sum of Squares: | SEresid = ![]() | |||||
| Coefficient of determination: | R2 = ![]() | |||||
| Adjusted coefficient of determination: | Radj2 = 1 −![]() | |||||
| Test Statistic: | F∗ = ![]() |
Multiple Linear Regression Model
The Regression Analysis window performs multiple linear regression on the regressors unless only one regressor is specified. A
multiple linear regression model takes one or more regressors and fits a regression model to one dependent variable. This
model is a generalization of the simple linear regression model used for linear regression in the analysis test
dialogs.
Full Model Only Regression Equation
The regression hypothesis test is the test of:

Assumptions:
| 𝜖i | ∼ N(0,σ2) for all i = 1,…,n | ||
| where 𝜖i denote the residuals | |||
| and 𝜖i are all independent and follow a normal distribution | |||
| and all 𝜖i have equal variance σ2 |
The sums of squares and mean sum of squared errors are calculated as follows:
| Number of Observations: | n | |||||
| Rank of the Coefficient Matrix: | m | |||||
| Mean of the response: | = ∑
i=1ny
i | |||||
| Mean of the predictors or genotypes: | = ∑
i=1nx
i | |||||
| Solution to the normal equations: | = = (XTX)−1XTy | |||||
where ∼ N(β,σ2(XTX)−1) | ||||||
| Total Sum of Squares: | SST = ∑
i=1ny
i2 −![]() 2 = SSReg + SSE | |||||
| Regression Sum of Squares: | SSReg = ∑
i=1n 2 = TXTy −![]() ![]() | |||||
| where J is a matrix of ones | ||||||
| Error Sum of Squares: | SSE = ∑
i=1n 2 = yTy − TXTy | |||||
| Residual Sum of Squares: | SEresid = ![]() | |||||
| Coefficient of determination: | R2 = ![]() | |||||
| Adjusted coefficient of determination: | Radj2 = 1 −![]() | |||||
| Test Statistic: | F∗ = ![]() |
The test statistic follows the F distribution, where p − value = P(X > F∗) where X ∼ F(m − 1,n − m).
Full Versus Reduced Model Regression Equation
In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced
and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed
to find the significance of the full versus the reduced model.
The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true
model and that the full model is not necessary.
The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:
| Number of Observations: | n | |||||
| Rank of the Reduced Model Coefficient Matrix: | r | |||||
| Mean of the response: | = ∑
i=1ny
i | |||||
| Mean of the predictors or genotypes: | = ∑
i=1nx
i | |||||
| Solution to the normal equations: | R = = (XTX)−1XTy | |||||
where ∼ N(β,σ2(XTX)−1) | ||||||
| Total Sum of Squares: | SSTR = ∑
i=1ny
i2 −![]() 2 = SSReg
R + SSER | |||||
| Regression Sum of Squares: | SSRegR = ∑
i=1n 2 =
RTXTy −![]() ![]() | |||||
| where J is a matrix of ones | ||||||
| Error Sum of Squares: | SSER = ∑
i=1n 2 = yTy −
RTXTy | |||||
The sums of squares and mean sum of squared errors for the full model are calculated similarly:
| Number of Observations: | n | |||||
| Rank of the Full Model Coefficient Matrix: | m | |||||
| Mean of the response: | = ∑
i=1ny
i | |||||
| Mean of the predictors or genotypes: | = ∑
i=1nx
i | |||||
| Solution to the normal equations: | F = = (XTX)−1XTy | |||||
where ∼ N(β,σ2(XTX)−1) | ||||||
| Total Sum of Squares: | SSTF = ∑
i=1ny
i2 −![]() 2 = SSReg
F + SSEF | |||||
| Regression Sum of Squares: | SSRegF = ∑
i=1n 2 =
FTXTy −![]() ![]() | |||||
| where J is a matrix of ones | ||||||
| Residual Sum of Squares: | SEresid = ![]() | |||||
| Error Sum of Squares: | SSEF = ∑
i=1n 2 = yTy −
FTXTy | |||||
The test statistic is:

The p-value is calculated by: p − value = P(X > F∗) where X ∼ F(m − r,n − m).
Regressor Statistics
The coefficient of the jth regressor is calculated with the equation:

j is the mean of the jth regressor and
is the mean of the response.The Y-intercept of the regression equation is calculated with the equation:

j is the mean of the jth regressor. The standard error for the jth regressor is computed by taking a full model regression equation with all regressors less the
jth regressor. For the purposes of calculating the standard error, the jth regressor is set as the dependent variable. Let
SSR = ∑
i=1n(xi,j −
j) be the regressor sum of squares, Rj2 be the coefficient of determination for the jth regressor vs all
other regressors model, MSE be the mean square errors for the regression model, and SSE be the error sum of squares. Let
the total number of regressors in the model be k. Then the standard error of the regressor SEj is calculated as
follows:

The value of the t-statistic for the jth regressor is obtained from the equation:

j is the estimated coefficient for the jth regressor.The p-value of the t-statistic for the jth regressor is the probability of a value as extreme or more extreme than the observed t-statistic from a Student’s T distribution with n − 2 degrees of freedom.

The p-value for the univariate fit is obtained from a Student’s T distribution where the t-statistic is calculated assuming
that the jth regressor is the only regressor in the model against the dependent variable.
Categorical Covariates and Interaction Terms
If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for
the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the
observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the
values of one dummy variable can be determined by examining all other dummy variables for a covariate, in
most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression
equation.
A first-order interaction term is considered a new covariate created from the product of two covariates as
specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables
for each category of the covariate will be multiplied by the other covariate to create a first-order interaction
term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each
other.
For example, consider the following covariates for five samples.
| Sample | Lab | Dose | Age |
| sample01 | A | Low | 35 |
| sample02 | A | Med | 31 |
| sample03 | A | High | 37 |
| sample04 | B | Low | 32 |
| sample05 | B | Med | 36 |
| sample06 | B | High | 33 |
Using dummy variables for the categorical covariates the above table would be:
| Sample | Lab=A | Lab=B | Dose=Low | Dose=Med | Dose=High | Age |
| sample01 | 1 | 0 | 1 | 0 | 0 | 35 |
| sample02 | 1 | 0 | 0 | 1 | 0 | 31 |
| sample03 | 1 | 0 | 0 | 0 | 1 | 37 |
| sample04 | 0 | 1 | 1 | 0 | 0 | 32 |
| sample05 | 0 | 1 | 0 | 1 | 0 | 36 |
| sample06 | 0 | 1 | 0 | 0 | 1 | 33 |
Interactions Lab*Dose and Lab*Age would be specified as:
| Sample | A*Low | A*Med | A*High | B*Low | B*Med | B*High | A*Age | B*Age |
| sample01 | 1 | 0 | 0 | 0 | 0 | 0 | 35 | 0 |
| sample02 | 0 | 1 | 0 | 0 | 0 | 0 | 31 | 0 |
| sample03 | 0 | 0 | 1 | 0 | 0 | 0 | 37 | 0 |
| sample04 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 32 |
| sample05 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 36 |
| sample06 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 33 |
Stepwise Regression
If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these
variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described
below.
Forward Selection
Starting with either the null model or the reduced model (depending on which type of regression was
specified), successive models are created, each one using one more regressor (or covariate) than the previous
model.
Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of
the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest
p-value is used as the next model. This method adds the next most significant variable to the current model. If the current
model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection
method stops and declares the current model as the final model as determined by stepwise forward selection.
If the model with all regressors has the smallest p-value then this full model is determined to be the final
model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential
regressors.
Backward Elimination
Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the
previous model.
Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of
the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest
p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value
is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if
all variables have been removed from the model, or if all variables left are included in the original reduced
model.
From the standpoint of further analysis, the final model becomes the “full model” for this set of potential
regressors.
= 
= 
=
= (










= 
= 
=
= (










= 
= 
= (






= 
= 
= (






