‹‹ Back to SVS Home
Regression Analysis
8.6 Regression Analysis
The Regression window allows you to perform linear and logistic regression, stepwise linear and logistic regression, and
permutation tests with numeric variables in a moving window along with numeric or categorical covariates, against one
dependent variable. Regressions may either be performed with all variables and covariates together (“full
model”) or with some of the covariates grouped into a “reduced model” (yielding a full-vs-reduced model
p-value).
For an overview of the theories behind regression analysis in SVS, see Linear Regression and Logistic Regression.
Performing Analysis
To perform regression analysis, open a spreadsheet and select a column for the dependent variable. The dependent variable
must be either quantitative (real-valued or integer-valued) or a binary case/control status column. To open the
Regression window, select the Analysis > Regression Analysis menu item. This feature is currently supported
for spreadsheets with only one column set as dependent. Categorical dependent columns are currently not
supported.
The Regression Analysis window (see Figure 49) allows for various regression options to be set or changed. A list and brief description of these options is as follows:
- Regression Parameters: The first tab of the Regression Analysis window allows for the general regression parameters
to be set. The parameters are:
- Stepwise Regression: If this box is checked then either backward elimination or forward selection of the predictors is used. Otherwise all predictors are used in the regression equation.
- Full Model Regressors: These options allow the selection of how the predictor columns are treated.
- If there are only a few predictors it is recommended the “Regress once on each data column” parameter is used.
- If the regression analysis in on the entire genome, then it is recommended the “Use a moving window of regressors” option is selected.
- If only certain covariates/predictors are desired then using the “Perform regression with selected covariates only” allows the selection of covariates/predictors to be used in the full model.
- Regression Options: Allows the selection of the regression model and whether or not the residual spreadsheet should be output. The two types of regression models are full model only or full versus reduced model.
- Reduced Model Regressors: If the full versus reduced model regression option is selected then covariates can be selected to be in the reduced model. All covariates in the reduced model will be required to be part of the final regression model, forward selection starts from these covariates and backward elimination cannot remove these covariates from the model.
- Output Parameters: This tab (see Figure 50) allows multiple testing corrections to be set. Data can also be output for creation of P-P or Q-Q plots.
Type of Regression
The type of regression method used is indicated on the top right of the Regression Analysis window. The two types of regression methods are:
- Linear or Logistic Regression: If the stepwise regression option is not selected, then a single linear or logistic regression will take place. In the case of a binary dependent variable, the regression will be logistic, for an integer or real-valued dependent column, the regression will be linear.
- Stepwise Regression: Selecting this option specifies the linear or logistic regression should be done as the specified stepwise regression procedure, either backwards elimination or forward selection. A p-value cut-off must be specified when running stepwise regression. Backward elimination starts with all of the full model covariates and removes the least significant covariate until removing any covariates would be more significant than the stepwise p-value cut-off specified.”Significant” here means testing the current model as a “full model” and the current model without a regressor as a “reduced model” and finding a full-vs-reduced p-value. Forward selection selects the most significant covariate and keeps adding the next most significant covariate until adding a further covariate is no longer significant. “Significant” here means testing the current model plus a covariate as a “full model” and the current model itself as a “reduced model” and finding a full-vs-reduced p-value.
Full Model Regressors
The Regression Analysis allows the selection of the full model regressors or covariates. These options are detailed below:
- Regress once on each data column: Uses all numeric columns in the spreadsheet for the analysis except the dependent column and any specified full-model or reduced-model covariates.
- Use a moving window of regressors: There are two options for the moving window – either a moving window of a
fixed number of columns, or, if a marker map is applied, a dynamic moving window size with a fixed number of base
pairs.
- Fixed window size: Specifies that a fixed number of numeric columns should be used for the moving window.
- Dynamic window size in base pairs: Specifies both the genetic distance in base pairs and size of the moving window. It will define which columns are considered to be within the window. The “kb” field defines a maximum genetic distance in kilo-base pairs that the moving window will include, and the “max columns” field, if used, specifies the maximum number of columns within the maximum genetic distance to be included in the window. The window will not cross over chromosome boundaries as defined in the marker map. This option is only available for spreadsheets where a marker map has been applied.
- Perform regression with selected covariates only: Allows you to select the numeric or categorical covariate
columns to include in the full model, or first-order interactions between covariates in the regression.
To include a covariate in the analysis, click on the Add Covariate button. This will open a dialog allowing you to select the covariate(s) to use in the regression equation. Then, select the covariate(s) to include and click Add. If you would like to add all of the covariates in the list, click Add All. The selected covariates will be shown in the “Full model covariates” list. To remove a covariate, select the covariate(s) to remove, and click Remove Selected. This will remove the item from the “Full model covariates” list and from the regression equation. To remove all covariates click Clear List.
To include first-order interactions, click the Add Interaction button. This will open a dialog which displays two lists, each containing all of the non-genetic covariate column names within the spreadsheet. Select the term(s) from each of the two lists which you would like to include and click Add. All selected items from the list on the left will be paired with all the selected items from the list on the right, and an item for each pair will be added to the “Full model covariates” list. If any of the selected items in either window represent categorical columns, then sub-items representing the dummy variables used in regression for each category will be paired with the items or sub-items from the other window. (Values from each pair are multiplied to create a “new” covariate, which is then used in the regression equation.)
When you have added all of the interactions, click Close to return to the regression window. All listed interactions will be included in the analysis, so unwanted interactions must be removed in order to exclude them. To remove an interaction, select the item(s) to remove and click Remove Interaction.
Create Residual Spreadsheet With Covariates
If this option is checked, a residual spreadsheet will be created along with the results view from the regression. This spreadsheet will contain the actual, predicted, and residual values for each sample, as well as the spreadsheet values for the regressors.
Reduced Model Regressors
If the regression option for computing the significance of the full model versus the reduced model is selected then covariates and interaction terms to be included in the reduced model can be specified in the same manner as for full model covariates. See Performing Analysis for more information on how to add covariate and interaction terms to the model.
Full Versus Reduced Model Regression Equation
Sometimes it is desired to “correct for” binary, continuous, or categorical variables, otherwise known as “covariates”.
These covariates, or first-order interactions between covariates, may be influencing the dependent variable
response. Correcting for the covariates allows the user to see specifically what effects there are on the remaining
variables.
To do this, first a linear regression equation, which includes only the dependent and the reduced model covariates, is
calculated (the “reduced model”). Next, a linear regression which includes all of the variables including all full model
covariates is calculated (the “full model”). The significance of the full versus the reduced model is calculated with an
F-test.
See Multiple Linear Regression Model for more information.
Note on Missing Values
All missing values will be dropped from the analysis both from the predictor variables and from the dependent variable.
Multiple Testing Corrections
It may be possible to obtain a good test statistic by chance alone. Multiple testing corrections are designed to help ensure, if possible, this is not the case. You may optionally select one or more of the following multiple testing corrections.
Permutation Tests
The permutation testing of the SVS linear and logistic regression models permutes the dependent variable, then runs the
regressions over again, checking the significance of these regressions. This is distinct from checking the “fit” of the permuted
dependent to the original regression results from a given set of regressors. The object is to see whether by chance, a different
set of dependents could have had a better relationship or “fit” with the covariates and regressors. This is tested through
performing a new regression for each permutation.
See Permutation Testing Methodology for a more detailed explanation and examples of permutation testing.
Output and Running the Regression
Click Run to start the regression analysis procedure.
NOTE:
- Sometimes the regression may fail due to insufficient rank in the coefficient matrix. This can be a result of not enough observations or due to the inclusion of “collinear” regressors. A collinear regressor is one which is a linear combination of one or more regressors.
Residual Spreadsheet
If a residual spreadsheet is produced (see Figure 51), it will contain the actual, predicted and residual values
of the dependent variable for each sample, as well as the remaining spreadsheet values. The residual value
of a sample is defined as the difference between the sample’s actual value and its predicted value from the
regression.
NOTE:
- Strictly speaking, residuals do not make as much sense for logistic regression as they do for linear regression because the distribution of a logistic regression residual separates into two parts. However, this spreadsheet is produced to report any covariate and interaction terms used in the model. The residual spreadsheet in this case can be used as a crude gage of how well the regression model predicts the observed values of the dependent variable.
Regression Results Spreadsheet
A spreadsheet of regression results for each regression model calculated will be output (see Figure 52). The rows of this
spreadsheet correspond to unique regression models, the row label corresponds to the first regressor in a moving window, or
the column used in the case of regressing once on each column. The row label does not reflect any covariates
used.
NOTE:
- Detailed results for any interesting regression models can either be found in the Regression Statistics Results Viewer, if the p-value or R2 value meets the specified criteria, or by running a covariate only regression model, including all regressors and any full or reduced model covariates used in the “interesting” model.
NOTE:
- If Regress once on each data column is selected, certain detailed results for the model and the column being
regressed are shown in the spreadsheet. These include
- The intercept or β0 value and its standard error.
- The slope or β1 value and its standard error.
- The F value (linear regression) or chi-square value (logistic regression) for the model. If applicable, this is shown for both the full model and the reduced model.
- The sample size.
Regression Statistics Results Viewer
A Regression Statistics Results Viewer (see Figure 53) will be displayed for a single regression or all regressions that meet
the criteria specified on the Output Parameters tab of the Regression Analysis window.
Linear Regression Model Statistics
Full Model Only Regression
If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:
- Name of the response variable.
- Unsigned multiple correlation coefficient R, where R =
.
- Coefficient of determination R2.
- Adjusted R2. This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
- Sample size.
- Residual standard error SEresid.
- Unbiased standard deviation of the response.
- Value of the F-statistic.
- P-value of the F-statistic for the regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom.
- Residual degrees of freedom.
- Total degrees of freedom.
- Y-intercept.
Full Versus Reduced Model Regression
If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:
- Name of the response variable.
- Coefficient of determination R2 for the full model.
- Coefficient of determination R2 for the reduced model.
- Adjusted R2 for the full model. This statistic is meant to compensate for many regressors, each explaining small portions of the variation by chance alone.
- Sample size.
- Residual standard error SEresid.
- Unbiased standard deviation of the response.
- Value of the F-statistic for the full model.
- Value of the F-statistic for the full versus reduced model.
- P-value of the F-statistic for the full regression model.
- P-value of the F-statistic for the full versus reduced regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom of the full model.
- Regression degrees of freedom of the reduced model.
- Residual degrees of freedom of the full model.
- Total degrees of freedom of the full model.
- Y-intercept of the full model.
- Y-intercept of the reduced model.
Logistic Regression Model Statistics
Full Model Only Regression
If only a full model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:
- Name of the response variable.
- Regression likelihood L1.
- Null model likelihood L0.
- Sample size.
- Value of the Chi-Squared (χ2) statistic.
- P-value of the Chi-Squared statistic for the regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom.
- Residual degrees of freedom.
- Total degrees of freedom.
- β0.
- β0 standard error.
Full Versus Reduced Model Regression
If a full versus reduced model was used for the regression equation, the following model statistics are displayed for both normal and stepwise regression:
- Name of the response variable.
- Full model likelihood L1.
- Reduced model likelihood L0.
- Chi-squared (χ2) statistic of the full model.
- Chi-squared statistic of the full versus reduced model.
- P-value of the Chi-Squared statistic for the full regression model.
- P-value of the Chi-Squared statistic for the full versus reduced regression model.
- Single-value permuted p-value, if single-value permutation testing was selected.
- Full-scan permuted p-value, if full-scan permutation testing was selected.
- Number of permutations, if permutation testing was selected.
- Regression degrees of freedom of the full model.
- Regression degrees of freedom of the reduced model.
- Residual degrees of freedom of the full model.
- Total degrees of freedom of the full model.
- β0 for the full model.
- Standard error for β0 for the full model.
- β0 for the reduced model.
Linear Model Regressor Statistics
For all types of linear regressions, the Y-intercept for the full model is displayed, and for full versus reduced linear
regression models, the Y-intercept for both the full and reduced models are displayed.
The following statistics are displayed for each regressor:
- Name
- Coefficient
- Standard error
- T-statistic for adding this regressor
- P-value for adding this regressor
- Univariate fit p-value
Logistic Model Regressor Statistics
For all types of logistic regressions, the Y-intercept for the full model is displayed, and for full versus reduced logistic
regression models, the Y-intercept for both the full and reduced models are displayed.
The following statistics are displayed for each regressor:
- Name
- Coefficient
- Standard error
- P-value for adding this regressor
- Odds ratio
- Univariate fit p-value
The regression odds ratio for the coefficient
is e
. The interpretation of this odds ratio is the ratio of the odds of the
dependent being one (“true”) if the given regressor were increased by one unit to the odds of the dependent being one
(“true”) when the given regressor has its current value.
Left Out Regressors
This list will include all regressors excluded from the final model of a stepwise regression model.
Moving Window Regressors
This will list the regressors used for the last moving window.
Caveats for Logistic Regression
Under some circumstances, the iteration procedure for the logistic regression algorithm will be unstable and the
regression may fail, even when the coefficient matrix has sufficient rank and significant regressors are included.
Such a circumstance can be when the regression algorithm tries to emulate a step function, or otherwise tries
to accommodate independent values for which the dependent variable is either exclusively 1 or exclusively
0.
If a stepwise regression model approach is used, similar circumstances resulting in instability may cause “paradoxical” phenomena such as:
- The final regression used to get the model statistics failing, even though it is “the same as” the last model tried in the stepwise regression algorithm. Actually, it is possible that a different order will be used for the regressors in the final model compared to the last model tried for stepwise regression. If the problem is highly unstable, the different order may be enough to cause failure.
- For some regressors, the p-value Pr(Chi) associated with dropping the regressor from the regression equaling 1 (Pr(Chi) = 1). This happens where the regression fails after removing the regressor. This is only possible for a regressor other than the last one added to the model.
The best workaround is to filter out the data causing such instabilities. If one covariate of a regression has
a coefficient above 15 or 20 or below -15 or -20 and the regressors from a stepwise regression won’t regress
directly, or if a certain covariate does not regress by itself, the data should be filtered. Consider making a row
subset spreadsheet based on ranges of values of the covariates and performing the desired regression model on
each. Alternatively, consider stepwise regression if not already applied to the model. If stepwise regression is
failing, changing the method from forward selection to backwards elimination or vice versa could result in a
solution.