‹‹ Back to SVS Home

PBAT Genotype Analysis

9.4 PBAT Genotype Analysis

Summary

The tools which have been implemented in PBAT for the analysis of quantitative and dichotomous traits are discussed in a series of papers by Lange and Laird ([Lange 2002aLange 2002bLange 2002c]). They allow a variety of analysis possibilities:

  • Computation of a large variety of FBAT-statistics and their power for nuclear families and for extended pedigrees.
  • Multivariate FBATs for multiple phenotypes: FBAT-GEE, and FBAT-PC. FBAT-GEE is based on the generalized estimating equation approach. FBAT-PC is based on principal components that maximize the heritability.
  • FBATs for time to onset data/survival data (logrank-FBAT and Wilcoxon-FBAT, FBAT-EXP).
  • Permutation tests for certain FBAT statistics.
  • Transformation tools for continuous phenotypes that are not normally distributed.
  • Conditional power calculations for all implemented FBATs.
  • Construction of the most powerful FBAT-statistic.
  • Including predictor variables in the FBAT.
  • Including gene-environment/drug interactions in the FBAT statistic.
  • Various estimation routines to estimate the genetic effect size.
  • Screening methods to select the most “promising” combinations of markers and phenotypes without biasing the significance level of the FBAT statistic.

The default settings can be changed and saved by clicking Save Options at the bottom of the PBAT Genotype Analysis dialog window.

To restore the defaults, select Restore Defaults.

To access this section of the manual from the analysis dialog, select Help.

Using PBAT Genotype Analysis

Getting Started

The first step is to open an existing project or create a new project where you want to perform the data analysis and save the results. See Getting Started for more information on creating a new project or opening an existing one.

Once you have opened or created a project, you must import your pedigree and/or phenotype data into SVS. See Importing PBAT Family-Based Data for information on how to import pedigree and phenotype files. A properly imported pedigree file will have the six required pedigree columns at the front of the spreadsheet and the column name headers will have a blue background. See Special Features of a Pedigree Spreadsheet for more information about pedigree spreadsheets.

NOTES:

  1. When creating your pedigree, remember to list the parents, even if their genotype information is not known. This ensures that siblings are grouped together properly into families.
  2. If unrelated families are listed together using the same family ID, the results will be unpredictable.

If there is additional phenotype information to be used for the PBAT analysis, join the pedigree and phenotype spreadsheets together, keeping unmatched rows. See Figure 66. The resulting spreadsheet will keep the pedigree columns at the front of the spreadsheet, followed by the phenotype columns and then the genotypes. See Figure 67.

NOTE:

  • You don’t have to have additional phenotype columns to perform a PBAT analysis, but if you do, you need to follow the above steps to join the phenotype dataset to the pedigree dataset.

[Picture]

Figure 66: Join or Merge dialog to Join a Pedigree spreadsheet to a Phenotype spreadsheet


[Picture]

Figure 67: Pedigree spreadsheet joined to a Phenotype Spreadsheet

PBAT Genotype Analysis can be performed by opening a pedigree spreadsheet, activating the markers to be analyzed, and by selecting Analysis > PBAT Genotype Analysis. A parameter selection dialog will open.

NOTE:

  • If you have many markers in your pedigree spreadsheet, it may be easiest to use Select > Column > Inactivate All Columns, to inactivate all columns. Then activate any phenotype columns as well as the columns for those markers you wish to analyze before opening the PBAT Genotype Analysis dialog.

The parameters for PBAT Genotype Analysis include phenotype (and other variable) selections, the type of analysis, type of screening, and parameters for phenotypes, haplotypes, test statistic and computational algorithm. In the parameter selection dialog, the parameters are organized into four tabs, which are:

Select Phenotypes

[Picture]

Figure 68: PBAT Genotype Analysis dialog – Select Phenotypes tab


[Picture]

Figure 69: PBAT Genotype Analysis dialog with extra Phenotypes – Select Phenotypes tab

The Select Phenotypes tab of the dialog allows you to select the phenotypes to test. See Figure 68 if this dialog was opened from a spreadsheet that does not contain additional phenotype columns. Figure 69 illustrates what the tab of this dialog looks like if there are additional phenotype columns joined to the pedigree spreadsheet.

Phenotypes

In this list, select the phenotype or phenotypes to be analyzed for association with the selected markers or with haplotypes from the selected markers. Multi-select operations are valid in these dialog boxes. These are: <Ctrl>-left-click selects multiple phenotypes one at a time, and <Shift>-left-click selects all phenotypes between the first and last selected phenotypes.

Phenotypes as predictor variables (covariates)

It may be possible that the selected phenotypes are not only associated with certain markers or haplotypes, but also are predicted by other phenotype variables (covariates for the test statistic). Select these other variables in this box to better determine the actual genetic effect after adjusting for the selected predictor variables.

When important covariates for the selected phenotypes are known, adding them to the conditional mean model [Lange 2002bLange 2002c] and also using them for the offset computation can increase the power of the FBAT statistic substantially.

Double-click on an item in this list to select or deselect it. An option dialog will appear. To select the variable, select the top radio button and enter the maximum power/order of the predictor variable. This determines the covariates that are added to the conditional mean model and to the offset value. For instance, entering “3” will add Xj, Xj2, and Xj3, where X is the selected predictor variable, to the model. To remove all orders of this predictor variable from the model, select the bottom radio button.

Phenotypes as interaction variables

To account for interactions of one or more phenotypic variables with the marker or haplotype being tested
(“gene/covariate interactions”), select the interaction variables in this box.

Double-click on an item in this list to select it or deselect it. An offset selection dialog will appear. There are three options in this dialog, select the appropriate option:

  • Offset = mean: To use the mean of the selected variable as the offset, select this option.
  • Specify offset: Use this option to specify an offset for the selected variable, and enter in the offset value to the Offset value box.
  • Deselect this interaction variable: To remove the selected variable as an interaction variable select this option.

NOTE:

  • It is recommended that you use a particular offset choice here only when its effects need to be examined. In a standard data analysis, it is preferable to use “mean” here and allow all offsets to be computed by using one of the estimating procedures specified in the Offset drop-down menu on the next tab.

Subgroups

PBAT analyses may be divided into subgroups of patients (a stratified analysis). The outputs for the separate analyses of the subgroups will be provided on the same output spreadsheet, separated and categorized by subgroup.

To divide your patients into subgroups, click the box labeled Use a variable to define subgroups, and select one of the phenotype variables listed (this will be the grouping variable). Only binary, integer, and categorical variables can be used as grouping variables.

Select subgroup categories

Once the subgroup option is selected, this box becomes available and all subgroups for the selected variable are listed. Select the category or categories from the grouping variable to calculate the PBAT statistics on. Multi-select operations are available in this list box.

Censoring Variables for Time-to-Onset Analysis

To do time-to-onset analysis:

  • Select the time-to-onset variable as the phenotype variable in the upper left-hand box.
  • Use the lower right-hand box to select a censoring variable. A censoring variable denotes whether the disease or condition has occurred at all during the study. It should be set to:
    • not censored, if the condition occurred (affected), and
    • censored, if the condition did not occur (unaffected).
  • Select other parameters (phenotype, haplotype, and computational) as necessary. FBAT-LOGRANK will have been automatically selected as the test statistic when the use censoring variable option is selected.
Phenotype and Haplotype Parameters

[Picture]

Figure 70: PBAT Genotype Analysis dialog – Phenotype and Haplotype Parameters tab

The next tab in the PBAT Genotype Analysis dialog is the Phenotype and Haplotype Parameters tab, see Figure 70.

Maximum and Minimum Number of Phenotypes per Group

  • FBAT-GEE statistic:
    If more than one phenotype is selected, the test can be performed against all of the phenotypes as one group, just one phenotype at a time, or any number of phenotypes combined together. Testing against more than one phenotype at a time will result in a multivariate test. To select the number of phenotypes to “group together” when testing, set the minimum and maximum number in the Min number of phenotypes per group and Max number of phenotypes per group.
  • FBAT-PC statistic:
    The FBAT-PC statistic may be used to find the relative weights of many phenotypes within a PBAT principal component analysis. Set both Max number of phenotypes per group and Min number of phenotypes per group to the number of phenotypes selected. FBAT-PC tests against every phenotype individually as a part of its analysis. Select the non-compact output format (Test Statistic and Computational) to see the weight of each phenotype within the principal component.

Offset Choice

The phenotype offset may be specified in this menu and, when applicable, the following text box.

The final trait used in FBAT calculations is the original phenotype value minus the offset.

The offset accomplishes two purposes:

  1. Increases the power of the FBAT statistic by offsetting the mean of the original phenotype from the trait.
  2. Incorporates covariates and interaction variables into the FBAT statistic.

The offset choices in this menu are:

  • No offset: No offset is used; only the original phenotype value is used. Neither covariates nor interaction variables are incorporated into the FBAT statistic. (Useful for affected-only analyses.)
  • Optimal power: Use the offset that maximizes the power of the FBAT-statistic (computationally slow, efficiency dependent on the correct choice of the mode of inheritance).
  • Phenotypic residuals (including E(X—HO)): Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes the expected marker score (E(X|H0)) as well as all covariates and interaction variables. (This differs from standard phenotypic residuals only in the inclusion of the expected marker score.)
  • Standard phenotypic residuals: Offset is based on standard phenotypic residuals obtained by GEE-estimation which includes all covariates and interaction variables.

    In other words, the offset will be equal to the difference between the actual observed phenotype and a predicted phenotype. This predicted phenotype comes from a regression model that regresses the observed phenotype on all of the covariates in the dataset. If there are no covariates or interaction variables selected, this will constitute subtracting the mean phenotype value (for a continuous phenotype), or the sample prevalence (for a dichotomous phenotype).

  • Specify here: (User-specified offset.) Enter the offset to use in the text box to the right of this menu. (Useful for unaffected studies, use an offset of 1, or when a the effects of a particular offset need to be examined.)

Normally, it is recommended to use Standard phenotypic residuals, except in the case of affected-only studies, where it is normally recommended to use No offset.

Other possibilities include:

  • Unaffected-only studies (use an offset of 1).
  • Other studies using binary traits (use the disease prevalence).
  • Total population samples and ascertained samples where the quantitative trait is not highly correlated with the ascertainment criteria (the offset should approximate the phenotypic mean – use Standard phenotypic residuals).
  • Ascertained samples where the quantitative trait is highly correlated with the ascertainment criteria (dichotomize and set the offset to 0–No offset).

Compute All Predictor Sub-Models

Check the Compute all predictor sub-models box to use or not use the covariates (predictors) in all possible combinations, in separate tests.

Uncheck this box to use all of the covariates combined together in one test.

Transformations

The phenotypes can be used as is without a transformation, or the selected phenotypes can be transformed to ranks or Z-scores (normal scores). There is a similar choice for the selected predictor variables and also for the selected interaction variables. In practice, it is recommended to transform the data to normal scores, since the asymptotic convergence of the FBAT-statistic is robust against outliers and skewed data [Lange 2002a].

MFBAT (Multi-Marker/Multi-Phenotype) Test Parameters

For the most common SNP and haplotype tests, multiple markers and/or multiple phenotypes may be subjected to both FBAT-GEE multivariate tests and FBAT-PC tests after the original analysis has finished. These tests are collectively referred to as “MFBAT” tests.

MFBAT testing may be done along with genotypic testing, with rapid analysis, or with haplotype testing performed using sub-haplotypes with adjacent markers. All combinations of M phenotypes and all combinations of N markers will be tested, where M and N take on all values within the bounds specified for the number of phenotypes and the number of markers, respectively.

NOTE:

  • When performing MFBAT testing using sub-haplotypes of length greater than one, the “marker” referred to in the MFBAT test will mean the first marker of the sub-haplotype being tested.

MFBAT output will be shown in either of two modes:

  • Multiple Marker Testing: If you specify more than one marker as the maximum marker grouping for MFBAT output, the outputs will follow after the outputs for all of the individual markers.
  • Testing with Multiple Phenotypes Only: If you specify only one marker as the maximum marker grouping for MFBAT output, the MFBAT output for any marker or haplotype will follow after the output for that marker’s test.

Check Perform MFBAT test to perform these tests. Fill in the maximum and minimum numbers of phenotypes and of markers to be tested at a time.

The outputs will be identified by marker names separated by plus signs or a single marker name with a plus sign after it. The phrase “FBAT-GEE^2” or “FBAT-PC^2” will be used in place of an allele number or haplotype designation to identify the test.

A p-value will appear in the normal p-value column (either “p-value(FBAT)” or “FBAT-Wilcoxon”) for the “FBAT-GEE^2” test, and two power-related values will appear for the “FBAT-PC^2” test, appearing in the normal p-value column and the next column to the right of the p-value column.

NOTES:

  1. MFBAT testing is valid for either the FBAT-GEE or FBAT-LOGRANK test statistic.
  2. To perform MFBAT testing, no interactions may be specified, no grouping of phenotypes is allowed, and Compute all predictor sub-models must be unchecked.
  3. MFBAT testing with combinations of multiple markers may only be performed when 20 or fewer markers are active in the pedigree spreadsheet.
  4. For MFBAT FBAT-LOGRANK (time-to-onset) testing using only multiple phenotypes, an FBAT-GEE test is made for the censor variable under the marker or haplotype being tested. This test is output after the original test and before the MFBAT test. Its output fields are the same as for an FBAT-GEE test except for the “1” used as the phenotype field indicator.

Check Use simplified variance structure to simplify (average out rows in) the variance/covariance matrix used in the FBAT-PC calculations, thereby improving performance for larger groups of phenotypes and markers.

Alternative Rapid Pedigree Algorithm

Check Use alternative rapid pedigree algorithm to use a new algorithm for processing extended pedigrees. This is currently the default pedigree algorithm. Uncheck this box to use the standard pedigree algorithm.

This new algorithm combines the advantages of the following two strategies:

  • Breaking up the extended pedigrees into trios before analysis.
  • Analyzing the extended pedigrees directly.

Breaking up the extended pedigrees into trios, which is a computationally fast strategy, does not take full advantage of the structure of the known extended pedigree. On the other hand, analyzing extended pedigrees as such, which takes full advantage of all the information and is the most powerful option, can be computationally slow when many of the genotypes in a pedigree are missing.

The standard extended pedigree algorithm is particularly slow in a situation in which families in an extended pedigree for which all genotypes are known are linked only by two or more family members for whom genotypic information is not available. Another situation is of an extended pedigree with “isolated genotypes”, that is, spare genotypic information spread across the entire pedigree. In either situation, the power gain is minimal and sometimes even jeopardized by the possibility that the linking family member or members have to be removed when the maximum number of founders is reached in PBAT.

The new rapid pedigree algorithm in PBAT identifies clusters of trios within extended pedigrees that share the same parents, and analyzes such clusters as extended pedigrees. At the same time, clusters of trios that do not share the same parents are broken up into separate extended-pedigree clusters. All resulting clusters are analyzed in the same way that extended pedigrees would be under the standard algorithm, but independently of each other.

The extra information provided to the computation of the genetic distribution under the original algorithm by linking together the extended pedigree clusters is minimal, while the effort required for taking advantage of this information is disproportionately enormous. This puts the standard algorithm at a severe disadvantage.

Under the new hybrid approach, however, such links between family clusters within extended pedigrees are dropped. The increased computational speed of a pure nuclear-family analysis is, therefore, achieved while almost completely keeping the statistical power of the standard extended pedigree algorithm.

Perform Rapid Analysis

To perform a rapid scan of markers using only one test per marker, check the Perform rapid analysis box. The minor allele for the marker being tested will be used as the “haplotype” for a haplotype test using any or all of the four genetic models. This is repeated over all selected markers.

Because this rapid analysis approach focuses on just the minor allele, it will yield results more or less twice as fast as the standard genotypic approach. For certain extended pedigrees having many siblings, the results can be more than twice as fast as the standard genotypic approach, due to the differing algorithms which these approaches use to infer expected marker scores.

Permutation Testing

Permutation testing may be selected for either the Rapid Analysis or for other modes of haplotype analysis. Check the Use permutation testing to obtain p-values, and enter the Number of permutations to use in the text box.

Haplotype Analysis

To perform haplotype analysis, check the Perform Analysis for Haplotypes box. The haplotype-related choices delineated in the following paragraphs will then become active.

NOTE:

  • If any enabled marker is multi-allelic, haplotype testing will select only those two alleles that are most prevalent, and treat the marker as if it is bi-allelic with these two alleles.

Check Overall haplotype test to additionally perform an overall haplotype test. This is a multivariate test performed on all the haplotypes whose frequency is greater than the specified cut-off frequency.

NOTE checking this option is only valid:

  1. when the Analyze all sub-haplotypes option is not checked,
  2. if no interactions have been specified, or
  3. if only one level of grouping (using the Subgroups box on the first tab–see Select Phenotypes) is used, or if no explicit subgrouping is used at all.

When the Overall haplotype test box is checked, the Cut-off frequency for overall haplotype tests box is active. Use this box to enter the minimum frequency a haplotype must have for inclusion in the overall test.

Check Analyze all sub-haplotypes to analyze haplotypes that are defined by subsets of the currently selected markers. Checking this box will also activate the Length of sub-haplotypes box. If “0” is entered, haplotypes from every subset (proper or not) of the SNPs will be analyzed. Entering “0” is not allowed when more than 9 SNPs are active in the pedigree spreadsheet. If a non-zero number is entered in this box, only sub-haplotypes of length equal to the specified number of SNPs will be analyzed. The sub-haplotype length is not allowed to exceed 8.

In addition, if a number greater than one and less than the total number of active markers is entered for Length of sub-haplotypes, the Only sub-haplotypes defined by adjacent SNPs check box is activated. Checking this will effectively cause the sub-haplotypes to be analyzed in a moving window. Unchecking this, which is not allowed for more than 20 total active SNPs, will go through all combinations of the selected SNPs taking the specified length of sub-haplotypes at a time, and can be very slow because of the large quantity of calculation and output requested.

If the Analyze all sub-haplotypes box is not checked, only the haplotypes defined by all the active SNPs in the pedigree spreadsheet are analyzed, while no haplotypes defined by any (proper) subset of these SNPs are analyzed. Only 8 SNPs may be active for analysis in this mode.

Check Infer missing genotypes in haplotypes to include individuals with missing genotype information in the analysis. The algorithm of [Horvath 2004] is applied to all individuals, even if they have missing genotype information. Unfortunately, this can result in a greater number of ambiguous haplotypes, and is much more compute-intensive.

If Infer missing genotypes in haplotypes is not checked, individuals with missing genotype information will be excluded from the analysis.

Check Remove ambiguous haplotypes from the analysis to exclude ambiguous haplotypes from the analysis.

Normally, ambiguous haplotypes (possible haplotypes which cannot be inferred from the parental genotypes) are included in the analysis and are weighted according to their estimated frequencies in the probands.

Enter the Maximal number of mating types for computation for use in the haplotype analysis. One mating type is one combination of what the father’s haplotype pair (diplotype) and the mother’s haplotype (diplotype) might be. Using 100 is sufficient for most haplotype calculations. Use fewer to speed up the calculations, and use more to be more certain to get all mating types.

Test Statistic and Computational

[Picture]

Figure 71: PBAT Genotype Analysis dialog – Test Statistic and Computational tab

The next tab in the PBAT Genotype Analysis dialog is the Test Statistic and Computational tab, see Figure 71. On this tab there are options to specify the test statistic parameters, computational parameters and the screening type.

Test Statistic Parameters

  • Test Statistics: select one of the following test statistics as appropriate.
    • FBAT-GEE: generalized estimating equation for FBAT. If one phenotype is selected, the FBAT-GEE statistic simplifies to the standard univariate FBAT-statistic. If several phenotypes are selected, all phenotypes are tested simultaneously using FBAT-GEE.

      For FBAT-GEE:

      • Both binary and continuous phenotypes will work.
      • Can combine phenotypes with different distributions (e.g. continuous and ordinal).
      • For each phenotype, an additional degree of freedom is used.
      • This statistic is not as good for a large number of phenotypes.

      Generally, the FBAT-GEE statistic can handle a moderate amount of any type of multivariate data, including groups of dichotomous phenotypes.

    • FBAT-PC: principal components FBAT extension for longitudinal phenotypes, repeated measurements and correlated phenotypes.

      This method tests a weighted sum of all the measurements, with the weights determined so as to maximize the genetic component of the overall phenotypes and to minimize the phenotypic/environmental variance. Generalized principal component analysis is used to determine these weights.

      For FBAT-PC:

      • All phenotypes must have the same distribution.
      • Degrees of freedom always equals one regardless of how many phenotypes are used.
      • As the number of phenotypes increases the power increases.
      • Quantitative phenotypes are preferable.
      • Good for a large number of phenotypes.
      • Can be its own type of marker “screening” test, since small genetic effects are amplified

      Generally, FBAT-PC is more powerful than FBAT-GEE if the phenotypes are correlated and quantitative.

    • FBAT-LOGRANK: this option also includes FBAT-Wilcoxon, these test statistics are FBAT extensions of the classical LOGRANK and Wilcoxon tests for time-to-onset data.
  • Genetic Model: The mode of inheritance of the target/disease allele and the underlying genetic model can be selected here. The choices available are:
    • Additive
    • Dominant
    • Recessive
    • Heterozygous Advantage
    • All (calculates outputs for all four possible models)
  • Null Hypothesis: Specify the applicable null hypothesis from among the following options.
    • No linkage and no association: Standard hypothesis
    • Linkage and no association: Use if testing in a region with known linkage.
    • Linkage and no association (sw): Use if testing in a region with known linkage and there are large pedigrees. The empirical variance requires estimation of the correlation between all pedigree members, which can be unstable in large pedigrees. Here “sw” stands for “sandwich variance”, which is used to provide a more robust variance estimate.

Screening Type

Screening is useful when the phenotypes with the strongest genetic components are not known prior to the analysis and several markers have to be analyzed. The screening technique can also deal with the multiple comparison problem in genome-wide association studies. Additionally, screening can help the user to decide whether a study has sufficient power to detect a significant association. See Output Spreadsheet for how screening is output from PBAT.

Screening is an integral part of the workflow of PBAT, which, for continuous phenotypes, is called the “Conditional Mean Model”.

Two types of screening are available for continuous phenotypes. Both are based on a genetic effect size estimate (i.e. β) which is obtained by regressing the observed offspring phenotypes on the expected offspring genotype (given the parental genotypes). The larger the genetic effect size, the larger the estimated power of the FBAT test.

The two screening types are:

  • Screening based on conditional power calculations (parametric approach). The conditional power is the probability that the FBAT test is rejected given the offspring phenotype and the parental genotypes. Under the “Conditional Mean Model”, the genetic effect size (β) is used to obtain the expected value and the variance of the marker scores (i.e. offspring genotypes) under the alternative hypothesis, and thus to obtain the conditional power.
  • Screening based on non-parametric approach (Wald tests). For the Wald test, the genetic effect size is directly tested(i.e. H0 : β = 0). This method is recommended for use with continuous phenotypes that have extended pedigrees.

In general, the conditional power test is recommended over the Wald test (non-parametric approach) because the Wald test is a population-based estimate of the genetic effect size. Unlike the conditional power calculation, it does not require model assumptions under the alternative hypothesis, which is why it is called a non-parametric screening approach.

However, since the Wald test is a purely population-based approach, it is generally less powerful than the conditional power, especially when population stratification may be present [Lange 2002c].

Unfortunately, the conditional power method is more computationally intensive if there are very large pedigrees in the dataset. The non-parametric Wald test will run more quickly in these cases.

For other types of studies which do not use continuous phenotypes, use Screening based on conditional power calculations and see Test Statistic and Computational below.

GFBAT

To adjust the FBAT statistic for environmental correlation between the traits of multiple siblings in a family (GFBATs), select this option [Lange 2002b].

Computational Parameters

The following several options allow for the selecting of other necessary computational parameters.

  • Number of non-founders in one pedigree must be less than: Enter the maximum number of non-founders plus one, or siblings, in one pedigree. If a pedigree is found to have this number of non-founders or more, it will be broken up into smaller pedigrees. For instance, if the user wants to restrict pedigrees to only have two siblings, then enter 3 in this box.

    NOTES:

    1. Under the alternative pedigree algorithm, this parameter refers to the maximum of non-founders within the family clusters identified by this algorithm, rather than to the maximum number of non-founders within any original extended pedigree.
    2. For haplotype analysis or rapid analysis, a pedigree with too many non-founders will simply not be used.
    3. If you select fewer non-founders than the actual pedigrees have, and you are not using haplotype analysis or rapid analysis, the results may depend on how the data is sorted. This is because the process of breaking up a larger pedigree into smaller pedigrees (or reducing a cluster size in the case of the alternative pedigree algorithm) which can occur in this mode is dependent on the order in which the larger pedigree is read into PBAT.
    4. Also when you are not using haplotype analysis or rapid analysis, selecting more than approximately seven non-founders when the actual pedigrees have more than seven non-founders can become computationally intensive, especially if screening by computational power. This can especially happen under the standard pedigree algorithm. Screening based on the non-parametric approach can reduce much of the computation and is one possible remedy for this situation. Other possible remedies are to use the alternative pedigree algorithm (if you have been using the standard algorithm) or to use rapid analysis.
  • Empirical distribution for phenotypes: (This parameter only adjusts the power in the output.) The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”. However, the “Conditional Mean Model” assumes continuous phenotypes are being used. Otherwise, a different method of obtaining conditional power needs to be selected. This is because the expected value/variance of the marker score under HA must be estimated to obtain the conditional power.

    The following distributions for phenotypes may be selected:

    • Continuous phenotypes: The “Conditional Mean Model” will be used for power calculations.
    • Approach by Jiang et al (2006): Use this option for time-to-onset calculations. Also use if there is no a priori belief that association will only be observed in affected individuals.
    • Approach by Murphy et al (2006): Use for affected-only studies or categorical phenotypes.
    • Naive allele freq estimator: The allele frequencies used for screening are estimated from the parents’ genotypes. This is an alternative to the Murphy method for affected-only studies or categorical phenotypes if there is a reason why the assumption about the relationship of the penetrance functions under the alternative hypothesis made by the Murphy method might be violated.
    • Observed allele frequencies: Another alternative to the Murphy method for affected-only studies or categorical phenotypes.

    NOTE:

    • If you select any distribution for your phenotype other than Continuous phenotypes, your phenotype variable should either be the Affection Status or have non-missing category numbers ranging between 0 and 199 inclusive.
  • Min. number of informative families: “Informative families” are those families which were included in the calculation for power and p-value statistics.

    Specify the minimum number of informative families required for the display of the FBAT-statistics. If “0” is entered, statistics on all tests will be displayed. In a typical analysis, it is not recommended to include markers with fewer than 20 informative families.

  • Maximal iterations for GEE: Enter the maximal number of iteration steps in the GEE-estimation procedure. Enter “0” to use least-squares residuals. Otherwise, GEE residuals are computed (useful when multiple correlated phenotypes are analyzed). This choice will be active only if the FBAT-GEE statistic is selected.
  • Significance level: Enter the significance level to be used for the power calculations.

    Typically, 0.0005 might be used. However, for logrank tests, a higher significance level, such as 0.01, is preferable.

Output Format

The parameters in this box allow for indicating alternative and/or additional outputs to be included in the resulting spreadsheet.

  • Use compact output format: Select this option to output the shorter format that was developed for the database at the Channing Laboratories. This format is guaranteed to contain 17 columns plus a row label column for the marker names if Output -log 10 p-values is not selected. If -log 10 p-values are included in the output, an additional 3 columns are added.
  • Display p-values as signed numbers to show the direction of the main effect: Select this option to place a negative sign on the p-value when there is a negative correlation between the phenotype and the number of transmitted target/disease alleles. If this option is not selected, all p-values will be displayed as positive numbers.

    NOTES:

    1. The signed p-value is a more reliable indicator of the direction of the effect than the heritability output, which is only an approximation to the direction of the effect.
    2. Signed p-values are not available when more than one phenotype is being tested at a time under FBAT-GEE, or when testing for interactions.
  • Output -log 10 p-values: Select this option to output log 10(p-value) for all p-values in the output, in addition to the p-values themselves.
Multiple Processes

[Picture]

Figure 72: PBAT Genotype Analysis dialog – Multiple Processes tab

The next tab in the PBAT Genotype Analysis dialog is the Multiple Processes tab, see Figure 72. On this tab there are options through which you can choose to run PBAT in multiple processes. This allows you to take advantage of multiple processors on a single machine by selecting Local Machine, or multiple machines in a distributed environment by selecting Run on Condor® Pool. If the option Divide Jobs Into Multiple Processes is not checked, PBAT will run normally on the current computer.

NOTE:

  • Dividing jobs into multiple processes is not allowed for haplotype analysis.

Local Machine

With the advent of dual-core and multiple processor systems as common desktop configurations, it is nice to take full advantage of the extra CPU resources available. It may also be convenient to divide analysis into multiple jobs for the purpose of keeping memory usage low when analyzing hundreds of thousands of SNPs.

When running multiple processes on a local machine, setting Maximum number of simultaneous jobs to be less than the total number of jobs will limit the number of jobs that can be run at one time. It is recommended to only run one concurrent job per processor. This will avoid memory access contention which severely impacts performance. So typically, this number should equal the number of processors and/or cores available on the current machine.

Run on Condor® Pool

Condor® is a freely available, specialized, batch system for managing compute-intensive jobs on a distributed network environment. Condor® and its extensive user manuals can be found at http://www.cs.wisc.edu/condor/. As Condor® is cross-platform, you can easily set up a Condor® pool on Windows, Linux or Mac OS X based systems and take advantage of a distributed computing environment with PBAT Genotype Analysis.

To run multiple jobs through Condor®, select the Run on Condor Pool option and browse to the location of the bin folder inside the directory where Condor® was installed on the system. Click Text to have SVS check that Condor® is configured and connected to a central manager.

It may be advantageous to specify the creation of more jobs than the number of machines available in the Condor® pool. Condor® will properly queue jobs and even out the effect of slower and faster computers taking longer or shorter times on each job.

For instructions on how to install Condor® on your network, see Appendix 15.

Output Spreadsheet

When all of the parameters are set, click Run to begin the analysis. A progress dialog will appear. The analysis may be stopped by pressing Cancel on the progress dialog.

If the PBAT analysis finishes normally, and results were obtained using the selected parameters, a results spreadsheet will be created and displayed. If no test has enough informative families for display, no output spreadsheet will be created.

Using Output for Screening

The main technique of using screening to filter which FBAT tests are considered uses the “Conditional Mean Model”.

In PBAT, the screening results are output into the same spreadsheet as the results from the actual FBAT tests. This allows sorting by the screening (power) results, and selecting only those results which have the most significant power. The FBAT tests which are contained in these same spreadsheet rows (indicating the tests with the most power) may be considered as if they had been calculated separately from the other FBAT tests, and the multiple-test correction applied only to these FBAT tests. This may be done because the screening tests are independent of the offspring genotype component of the FBAT tests themselves. Both the screening tests and the FBAT tests are conditioned on the same known quantities, namely the parental genotypes and the offspring phenotype(s).

Compact Format

This shorter format was developed for the database at the Channing Laboratories. It is guaranteed to contain 17 columns plus a row label column for the marker names (or the first marker of a moving window for haplotype analysis) unless Output -log 10 p-values is selected. An additional 3 columns will be added if log 10 p-values are included in the output.

The 17 columns are as follows:

  • Groupname: this is the grouping variable, if grouping is used. Otherwise, the column will be filled with the missing value “?”.
  • Group: this is the group variable value, if grouping is used. Otherwise, the column will be filled with the missing value “?”.
  • Allele: the allele or haplotype tested.
  • Freq: allele or allele combination frequency.
  • HWE: p-value of the Hardy Weinberg test for the parents.
  • phenos: phenotype(s) used.
  • cov: covariate(s) used, if any.
  • inter: interaction variable(s) used, if any.
  • model: the genetic model used for this test.
    • additive
    • dominant
    • recessive
    • heterozygous advantage
  • test: statistical test used.
    • FBAT-GEE
    • FBAT-PC
    • FBAT-LOGRANK
    • FBAT-Wilcoxon
    • optimal FBAT-LOGRANK (naive weights)
  • #infofam: the number of families that were informative for this test.
  • pvalue: p-value for the FBAT statistic. This is for the main genetic effect, if this test had an interaction term.
    NOTES:
    1. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
    2. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
  • power: conditional power estimate, if screening with conditional power has been selected.
  • wald: the result of the Wald test. The values here will only be meaningful if the conditional mean model would have been meaningful for this test.
  • herit: the heritability of this trait. The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.
  • FBATI: joint p-value for the main effect and the interaction term. If no interaction term was selected, then a value of “1” will be returned.
  • powerFBATI: power for the FBAT interaction statistic, if an interaction term and screening with conditional power was selected.

If log 10 p-values are included in the output then the additional columns will be included in the output:

  • -log10 pvalue: log 10(pvalue), inserted to the right of the pvalue column
  • -log10 wald: log 10(wald), inserted to the right of the wald column
  • -log10 FBATI: log 10(FBATI), inserted to the right of the FBATI column

NOTE:

Normal Expanded Format

The normal expanded format output will have a varying number of columns, depending on the parameters selected and how many phenotypes are in the phenotype spreadsheet. Since a column will be present for every possible phenotype, the spreadsheet may be quite wide. However, all output statistics are visible in this format.

See Output Spreadsheet for the time-to-onset analysis output fields in the expanded format. Otherwise, the output spreadsheet columns in the expanded format may be divided into several categories:

  • Row label with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and Power
  • Phenotype columns
  • Extra columns for powers of predictor phenotypes, if necessary
  • Heritability
  • Extra columns relating to FBAT-PC, if necessary
  • Extra columns relating to interactions, if necessary
  • -log10 columns for p-values (if this output option is selected)

NOTE:

The column groups are:

  • Marker information:
    For SNP analysis, the marker (SNP) name is set as the row label. For haplotype analysis, the first marker (SNP) of the haplotype is set as the row label.
  • Subgroup designation:
    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value “?” in the first column means that all of the samples were analyzed.
  • Marker/allele information and genetic model:
    For SNP analysis, the allele being tested is shown, followed by the following information:
    • freq: Allele frequency overall
    • HW : Hardy-Weinberg p-value overall
    • freq_parent: Allele frequency for the parents
    • HW_parents: Hardy-Weinberg for the parents

    For haplotype analysis, the outputs are instead:

    • markers used: SNPs used in defining the haplotype
    • haplotype: the respective alleles separated by colons
    • hap freq: the haplotype frequency

    These columns are followed (for both SNP and haplotype analysis) by a column for the genetic model. The values in this column (model) represent:

    • additive
    • dominant
    • recessive
    • heterozygous advantage

    If “All” was selected for the genetic model, the analysis will have been run not only for each marker and allele, but also for each model. In this case, an entry in this column will show which genetic model was used for that row’s analysis.

    Following the genetic model is a column (nbr_info_fam) that contains the number of informative families for the marker specified by the row label and allele (or for the haplotype listed for the row).

  • P-values and Power:
    After the marker/allele information and the genetic model are listed in the spreadsheet, the statistical outputs are listed in the following columns:
    • pvalue(FBAT): P-value for the FBAT statistic. This is for the main genetic effect, if this test also included an interaction term.
      NOTES:
      1. If the GFBAT adjustment for environmental correlation has been specified, this statistic will reflect that adjustment.
      2. If you have specified Display p-values as signed numbers to show the direction of the main effect, a negative sign on the p-value will denote a negative correlation between the phenotype and the number of transmitted target/disease alleles.
    • pvalue(FBATI): Joint p-value for the main effect and the interaction term. If no interaction term was selected, this column will be filled with ones.
    • power(FBAT): Conditional power estimate, if screening with conditional power has been selected.
    • power(FBATI): Power for the FBAT interaction statistic, if this test had an interaction term and screening with conditional power has been selected. Otherwise, this column will be filled with the value 0.05.
    • pvalue(Wald): P-value of the overall Wald test for a genetic effect in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.
    • pvalue(WaldI): P-value of the overall Wald test for a gene/covariate interaction in the conditional mean model. These values will be meaningful only if the conditional mean model would have been appropriate for this test.
  • Phenotype columns:
    A column for every phenotype (including Affection Status) that is used in the model is shown. The following notation is used:
    • Not used in the analysis for this row.
    • Selected as a phenotype/trait and tested for association with FBATs in this row’s results.
    • Selected and used as a covariate/predictor variable. The 1’s indicate that the covariate/predictor variable is significant at both the 5% or the 1% significance levels in the conditional mean model.
    • Selected and used as a covariate/predictor variable. The 1 indicates that the covariate/predictor variable is significant at the 5% level, and the 0 indicates that it is not significant at the 1% level in the conditional mean model.
    • Selected and used as a covariate/predictor variable. The 0’s indicate that the covariate/predictor variable is not significant at either the 5% or the 1% significance levels in the conditional mean model.
    • Selected and used as an interaction variable in this row.
  • Extra columns for powers of predictor phenotypes, if necessary:
    If you used predictor variables with a maximum power greater than one, extra columns are included for the higher power phenotypes. The values for this column will be the same for predictor variables as above.
  • Heritability:
    The heritability of the selected phenotype(s) will have associated columns.

    The heritability is defined as the proportion of phenotypic variance explained by the analyzed marker. A negative sign denotes a negative correlation between the phenotype and the number of transmitted target/disease alleles.

    If you selected more than one phenotype, and you also asked for a maximum of more than one phenotype in a group, one column corresponding to each selected phenotype will appear here, and display the heritability whenever the phenotype was involved in the calculations. A value of zero will be used for uninvolved phenotypes.

  • Extra columns relating to FBAT-PC, if necessary:
    If FBAT-PC has been selected as the test statistic, one additional column will be included in the output spreadsheet for every phenotype, indicating that phenotype’s weight in the FBAT-PC calculation.
  • Extra columns relating to interactions, if necessary:
    If one or more interaction variables are selected, additional columns will be included in the output spreadsheet. These columns are (in order):
    • main effect: An estimate of the regression coefficient for the main effect.
    • Std error: Standard error for the main effect coefficient.
    • p-value: P-value for the main effect coefficient.
    • interaction: An estimate of the regression coefficient for the interaction term.
    • Std error: Standard error for the interaction coefficient.
    • p-value: P-value for the interaction term coefficient.
    • FBAT-I: The FBAT statistic p-value for the interaction term coefficient (analogous to the above p-value for the interaction term and should have a similar value).
    • h-main: The heritability of the main effect.
    • h-interaction: The heritability of the interaction.
  • -log10 columns for p-values:
    Additional columns containing the log 10(p-value) will be added if this output option is selected. The additional columns will be:
    • -log10 pvalue(FBAT): log 10(pvalue(FBAT)), inserted to the right of the pvalue(FBAT) column
    • -log10 pvalue(FBATI): log 10(pvalue(FBATI)), inserted to the right of the pvalue(FBATI) column
    • -log10 pvalue(Wald): log 10(pvalue(Wald)), inserted to the right of the pvalue(Wald) column
    • -log10 pvalue(WaldI): log 10(pvalue(WaldI)), inserted to the right of the pvalue(WaldI) column

Output for Time-to-Onset Analysis

For time-to-onset analysis, the outputs are somewhat different. This output may be divided into the following categories:

  • Row label with marker information
  • Subgroup designation
  • Allele information and genetic model
  • P-value and Power
  • -log10 columns for p-values (if this output option is selected)

NOTE:

The column groups are:

  • Marker information:
    For SNP analysis, the marker (SNP) name is set as the row label. For haplotype analysis, the first marker (SNP) of the haplotype is set as the row label.
  • Subgroup designation:
    If you have defined sub-groups of the population, the subgroup to which the analysis was restricted is shown in the first column. The missing value “?” in the first column means that all of the samples were analyzed.
  • Marker/allele information and genetic model:
    For SNP analysis, the allele being tested is shown, followed by the following information:
    • freq: Allele frequency overall
    • HW : Hardy-Weinberg p-value overall
    • freq_parent: Allele frequency for the parents
    • HW_parents: Hardy-Weinberg for the parents

    For haplotype analysis, the outputs are instead:

    • markers used: SNPs used in defining the haplotype
    • haplotype: the respective alleles separated by colons
    • hap freq: the haplotype frequency

    These columns are followed (for both SNP and haplotype analysis) by a column for the genetic model. The values in this column (model) represent:

    • additive
    • dominant
    • recessive
    • heterozygous advantage

    If “All” was selected for the genetic model, the analysis will have been run not only for each marker and allele, but also for each model. In this case, an entry in this column will show which genetic model was used for that row’s analysis.

    Following the genetic model is a column (nbr_info_fam) that contains the number of informative families for the marker specified by the row label and allele (or for the haplotype listed for the row).

  • P-values and Power:
    After the marker/allele information and the genetic model are listed in the spreadsheet, the statistical outputs are listed in the following columns:
    • FBAT-Wilcoxon: P-value for the FBAT-Wilcoxon statistic.
    • power: Power for the FBAT-Wilcoxon statistic.
    • FBAT-LOGRANK: P-value for the FBAT-LOGRANK statistic.
    • power: Power for the FBAT-LOGRANK statistic.
    • optimal FBAT-LOGRANK (FH-weights): P-value for the optimal FBAT-LOGRANK statistic (with FH-weights).
    • power: Power for the optimal FBAT-LOGRANK statistic (with FH-weights).
    • optimal FBAT-LOGRANK (naive-weights): P-value for the optimal FBAT-LOGRANK statistic (with naive-weights).
    • power: Power for the optimal FBAT-LOGRANK statistic (with naive-weights).
  • -log10 columns for p-values:
    Additional columns containing the log 10(p-value) will be added if this output option is selected. The additional columns will be:
    • -log10 FBAT-Wilcoxon: log 10(FBAT-Wilcoxon), inserted to the right of the FBAT-Wilcoxon column
    • -log10 FBAT-LOGRANK: log 10(FBAT-LOGRANK), inserted to the right of the FBAT-LOGRANK column
    • -log10 optimal FBAT-LOGRANK (FH-weights): log 10(optimal FBAT-LOGRANK (FH-weights)), inserted to the right of the optimal FBAT-LOGRANK (FH-weights) column
    • -log10 optimal FBAT-LOGRANK (naive-weights): log 10(optimal FBAT-LOGRANK (naive-weights)), inserted to the right of the optimal FBAT-LOGRANK (naive-weights) column