‹‹ Back to SVS Home

Linear Regression

15.3 Linear Regression

Genotype or Numeric Association Test – Linear Regression

The response, y, is fit to every genetic predictor variable or encoded genotype, x, in the spreadsheet, using linear regression, and the results include the regression p-value, intercept and slope which are output in a new spreadsheet along with other genotypic association test results and any multiple correction results. The response is represented with the formula y = b1x + b0 + 𝜖, where the model is represented by the expression b1x + b0 and the error term, 𝜖, expressing the difference, or residual, between the model of the response and the response itself. For missing values of the predictor, the mean of the response is used.

The regression hypothesis test is the test of:

{
  H0 : β0 = β1 = 0
  Ha : βi ⁄= 0 for at least one i

Assumptions:

𝜖i N(02) for all i = 1,,n
where 𝜖i denote the residuals
and 𝜖i are all independent and follow a normal distribution
and all 𝜖i have equal variance σ2

The sums of squares and mean sum of squared errors are calculated as follows:

Number of Observations: n
Rank of the Coefficient Matrix: m
Mean of the response: ¯y = 1-
n i=1ny i
Mean of the predictors or genotypes: ¯x = 1-
n i=1nx i
Solution to the normal equations: ˆβ = ∑n
--i=∑1(xi-−x¯)(yi −-¯y)
     ni=1(xi − ¯x)2 = (XTX)1XTy
where ˆβ N(β,σ2(XTX)1)
Total Sum of Squares: SST = i=1ny i2 1-
n( n   )
  ∑  y
  i=1  i2 = SSReg + SSE
Regression Sum of Squares: SSReg = i=1n(ˆyi − ¯y)2 = ˆβ TXTy 1
n-(     )
 yTJy
where J is a matrix of ones
Error Sum of Squares: SSE = i=1n(yi −yˆi)2 = yTy ˆ
β TXTy
Residual Sum of Squares: SEresid = ∘ ------
   SSE
  n-−-m-
Coefficient of determination: R2 = SSReg
-SST--
Adjusted coefficient of determination: Radj2 = 1 (          n− 1  )
 (1− R2 )---------
         n− m − 2
Test Statistic: F = ---R2∕(m-−-1)--
(1− R2 )∕(n − m)
The test statistic follows the F distribution, where p value = P(X > F) where X F(1,n m).
Multiple Linear Regression Model

The Regression Analysis window performs multiple linear regression on the regressors unless only one regressor is specified. A multiple linear regression model takes one or more regressors and fits a regression model to one dependent variable. This model is a generalization of the simple linear regression model used for linear regression in the analysis test dialogs.

Full Model Only Regression Equation

The regression hypothesis test is the test of:

{
  H0 : β1 = β2 = β3 = ...= 0
  Ha : βi ⁄= 0 for at least one i

Assumptions:

𝜖i N(02) for all i = 1,,n
where 𝜖i denote the residuals
and 𝜖i are all independent and follow a normal distribution
and all 𝜖i have equal variance σ2

The sums of squares and mean sum of squared errors are calculated as follows:

Number of Observations: n
Rank of the Coefficient Matrix: m
Mean of the response: ¯y = 1
--
n i=1ny i
Mean of the predictors or genotypes: ¯x = 1
--
n i=1nx i
Solution to the normal equations: ˆβ = ∑n
--i=∑1n(xi-−x¯)(yi −-¯y)
     i=1(xi − ¯x)2 = (XTX)1XTy
where ˆβ N(β,σ2(XTX)1)
Total Sum of Squares: SST = i=1ny i2 1-
n( n∑   )
     yi
  i=12 = SSReg + SSE
Regression Sum of Squares: SSReg = i=1n(ˆyi − ¯y)2 = ˆ
β TXTy 1-
n( T   )
 y Jy
where J is a matrix of ones
Error Sum of Squares: SSE = i=1n(yi −yˆi)2 = yTy ˆβ TXTy
Residual Sum of Squares: SEresid = ∘ ------
  -SSE--
  n − m
Coefficient of determination: R2 = SSReg-
 SST
Adjusted coefficient of determination: Radj2 = 1 (      2 n − 1 )
 (1− R  )n−-m--
Test Statistic: F =     2
---R-∕(2m-−-1)--
(1− R  )∕(n − m)

The test statistic follows the F distribution, where p value = P(X > F) where X F(m 1,n m).

Full Versus Reduced Model Regression Equation

In the full versus reduced model regression equation, the regression sums of squares are calculated both for the reduced and for the full model the same way that they are calculated for a regression on just one model. An F test is then performed to find the significance of the full versus the reduced model.

The null hypothesis tested is the model comparison test, where the null hypothesis is that the reduced model is the true model and that the full model is not necessary.

The sums of squares and mean sum of squared errors for the reduced model are calculated as follows:

Number of Observations: n
Rank of the Reduced Model Coefficient Matrix: r
Mean of the response: ¯y = 1
--
n i=1ny i
Mean of the predictors or genotypes: ¯x = 1
--
n i=1nx i
Solution to the normal equations: ˆβ R = ∑n
--i=∑1n(xi −-¯x)(yi −-¯y)
     i=1(xi − ¯x)2 = (XTX)1XTy
where βˆ N(β,σ2(XTX)1)
Total Sum of Squares: SSTR = i=1ny i2 1-
n(∑n   )
    yi
  i=12 = SSReg R + SSER
Regression Sum of Squares: SSRegR = i=1n(ˆyi − ¯y)2 = ˆ
β RTXTy 1-
n( T   )
 y Jy
where J is a matrix of ones
Error Sum of Squares: SSER = i=1n(yi − ˆyi)2 = yTy ˆβ RTXTy

The sums of squares and mean sum of squared errors for the full model are calculated similarly:

Number of Observations: n
Rank of the Full Model Coefficient Matrix: m
Mean of the response: ¯y = 1
--
n i=1ny i
Mean of the predictors or genotypes: ¯x = 1
--
n i=1nx i
Solution to the normal equations: ˆβ F = ∑n
--i=∑1n(xi −-¯x)(yi −-¯y)
     i=1(xi − ¯x)2 = (XTX)1XTy
where ˆβ N(β,σ2(XTX)1)
Total Sum of Squares: SSTF = i=1ny i2 1-
n( ∑n   )
     yi
  i=12 = SSReg F + SSEF
Regression Sum of Squares: SSRegF = i=1n(ˆyi − ¯y)2 = ˆ
β FTXTy 1-
n( T  )
 y Jy
where J is a matrix of ones
Residual Sum of Squares: SEresid =   ------
∘  SSE
  n-−-m-
Error Sum of Squares: SSEF = i=1n(yi − ˆyi)2 = yTy  ˆ
β FTXTy

The test statistic is:

     (               )
       SSSSREegFF-− SSSRSeEgRR  ×(n − m)
F ∗ =---(----SSReg--)-----------.
         1+  SSRegFR- × (m − r)

The p-value is calculated by: p value = P(X > F) where X F(m r,n m).

Regressor Statistics

The coefficient of the jth regressor is calculated with the equation:

    ∑
    --ni=1(xi,j −-¯xj)∗(yi −-¯y)
bj =    ∑ni=1 (xi,j − ¯xj)2
where n is the sample size, ¯xj is the mean of the jth regressor and ¯y is the mean of the response.

The Y-intercept of the regression equation is calculated with the equation:

        k
b = ¯y −∑   b¯x
 0     j=1 j j
where k is the number of regressors, bj is the coefficient and ¯xj is the mean of the jth regressor.

The standard error for the jth regressor is computed by taking a full model regression equation with all regressors less the jth regressor. For the purposes of calculating the standard error, the jth regressor is set as the dependent variable. Let SSR = i=1n(xi,j ¯xj) be the regressor sum of squares, Rj2 be the coefficient of determination for the jth regressor vs all other regressors model, MSE be the mean square errors for the regression model, and SSE be the error sum of squares. Let the total number of regressors in the model be k. Then the standard error of the regressor SEj is calculated as follows:

      ∘ ---------------  ∘ --------------------------- ┌│ -∑-------------∑-----------
        ----M--SE------    -----------SSE------------  │∘  --ni=1(yi −-(b0 +-kj=1bjxj))2-
SEj =   (1− R2j)× SSRj  =   (1− R2j)× SSRj  ×(n − k− 1) =   (1 − R2j)× SSRj × (n − k − 1).

The value of the t-statistic for the jth regressor is obtained from the equation:

    βˆj
t = SEj,
where βˆj is the estimated coefficient for the jth regressor.

The p-value of the t-statistic for the jth regressor is the probability of a value as extreme or more extreme than the observed t-statistic from a Student’s T distribution with n 2 degrees of freedom.

P(> |T|) = p− value = 2 ∗P(X > |T|), where X ∼ t(n − 2)

The p-value for the univariate fit is obtained from a Student’s T distribution where the t-statistic is calculated assuming that the jth regressor is the only regressor in the model against the dependent variable.

Categorical Covariates and Interaction Terms

If a covariate is categorical, dummy variables are used to indicate the category of the covariate. A value of “1” for the observation indicates that it is equal to the category the dummy variable represents. Similarly, if the observation is not equal to the category for the dummy variable, then it is assigned the value of “0”. As the values of one dummy variable can be determined by examining all other dummy variables for a covariate, in most cases the last dummy variable is dropped. This avoids using a rank-deficient matrix in the regression equation.

A first-order interaction term is considered a new covariate created from the product of two covariates as specified in either the full- or reduced-model covariates. If one interaction term is categorical, dummy variables for each category of the covariate will be multiplied by the other covariate to create a first-order interaction term. If both covariates are categorical, dummy variables from both covariates will be multiplied by each other.

For example, consider the following covariates for five samples.





Sample LabDoseAge




sample01A Low 35




sample02A Med 31




sample03A High 37




sample04B Low 32




sample05B Med 36




sample06B High 33




Using dummy variables for the categorical covariates the above table would be:








Sample Lab=ALab=BDose=LowDose=MedDose=HighAge







sample011 0 1 0 0 35







sample021 0 0 1 0 31







sample031 0 0 0 1 37







sample040 1 1 0 0 32







sample050 1 0 1 0 36







sample060 1 0 0 1 33







Interactions Lab*Dose and Lab*Age would be specified as:










Sample A*LowA*MedA*HighB*LowB*MedB*HighA*AgeB*Age









sample011 0 0 0 0 0 35 0









sample020 1 0 0 0 0 31 0









sample030 0 1 0 0 0 37 0









sample040 0 0 1 0 0 0 32









sample050 0 0 0 1 0 0 36









sample060 0 0 0 0 1 0 33









Stepwise Regression

If only a few variables (regressors or covariates) drive the outcome of the response, Stepwise Regression can isolate these variables. The methods for the two types of stepwise regression, forward selection or backward elimination, are described below.

Forward Selection

Starting with either the null model or the reduced model (depending on which type of regression was specified), successive models are created, each one using one more regressor (or covariate) than the previous model.

Each of the unused regressors is added to the current model to create a “trial” model for that regressor. The p-value of the trial model (or full model) versus the current model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method adds the next most significant variable to the current model. If the current model had the smallest p-value, or if no p-value is better than the p-value cut-off specified, then the forward selection method stops and declares the current model as the final model as determined by stepwise forward selection. If the model with all regressors has the smallest p-value then this full model is determined to be the final model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.

Backward Elimination

Starting with the full model, successive models are created, each one using one less regressor (or covariate) than the previous model.

Each of the regressors currently in the model is removed to create a “trial” model excluding that regressor. The p-value of the current model (or full model) versus the trial model (or reduced model) is calculated, and the model with the smallest p-value is used as the next model. This method removes the least significant variable from the current model. If every p-value is smaller than the p-value cut-off specified, the backward elimination method stops. The method also stops if all variables have been removed from the model, or if all variables left are included in the original reduced model.

From the standpoint of further analysis, the final model becomes the “full model” for this set of potential regressors.