‹‹ Back to SVS Home
Spreadsheet Overview
5.1 Spreadsheet Overview
Spreadsheets provide the basis for all data manipulation, plotting and analysis in SVS.
All operations performed on spreadsheets are logged in the Node Change Log of the Project Navigator for reference and for replication of procedures.
Navigating the Spreadsheet
Within a spreadsheet there are six regions to note and become familiar with (see Figure 26):
- Spreadsheet Title Bar: contains the name of the spreadsheet as named in the Project Navigator Window, as well as the Navigator Window Node ID number in square brackets. See Project Navigator Window for more information.
- Menu Bar contains the menus: File, Edit, Select, Quality Assurance, Analysis, Plot, Scripts, and Help. See Spreadsheet Menus Overview below for more information.
- Tool Bar: contains icons for shortcuts to certain menu items frequently used. Hovering over an item will pop up a tooltip explaining what the icon does. The tool bar can be removed by right-clicking on the menu bar and clicking on the Navigator Toolbar option. It can be re-shown by right-clicking on the menu bar and clicking on Navigator Toolbar again.
- The Spreadsheet: contains ten distinct regions. These regions are the Genetic Marker Map Information (only available for marker mapped spreadsheets), Row-Information Columns, Column Headers, Column Headers, Genetic Marker Map Information (if shown on a marker mapped spreadsheet), Row-Information Columns, Row-Information Columns, data fields, left-right scroll bar, and up-down scroll bar.
- Spreadsheet Tabs: at the bottom of a spreadsheet, represent all spreadsheets sharing a common parent node in the Project Navigator. Certain actions on a spreadsheet (such as activating rows or columns) when the spreadsheet has child nodes, will cause a new spreadsheet to be created, which also creates a new child node in the Project Navigator and its corresponding tab at the bottom of the spreadsheet. Clicking a tab will make its corresponding spreadsheet active.
Special Features of a Pedigree Spreadsheet
A pedigree spreadsheet has six required columns, Family ID, Patient ID, Father ID, Mother ID, Sex, and Affection
Status. Any spreadsheet that has these six column headers and the columns are the correct type will be identified as a
pedigree spreadsheet. This is indicated not only with a special node icon in the Project Navigator, but also by setting the
background color for the column name headers to blue, see Figure 27.
Whenever a pedigree spreadsheet is joined to another spreadsheet the six pedigree columns will be kept as the left most columns regardless of the join order. See Joining or Merging Spreadsheets for more information on joining two spreadsheets.
Relationships and Dependencies Between Spreadsheets
If the navigator node for a spreadsheet has child nodes - either subset spreadsheets, joined or appended spreadsheets, analysis
results or plots - then any changes made to the original spreadsheet will result in a new navigator node being created, and a
new spreadsheet tab at the bottom of the spreadsheet view. This keeps the parent node in the same state it was when the
first child node was created. This is why child nodes are named as such, since they are descendants of the parent
nodes.
If you try to delete a parent node, you can only do so by deleting all associated child nodes. In some cases, such as joining
two spreadsheets, a dependency is created where one or more spreadsheets rely on the data contained in another spreadsheet.
Dependencies are denoted in the Project Navigator by highlighting all associated nodes when the dependent node is selected,
and by listing all linked nodes in the “(Linked To)” column. A spreadsheet cannot be deleted if one or more nodes is
dependent on it.
If a spreadsheet navigator node does not have any child nodes, then any changes to the spreadsheet will not result in a
new node being created.
Row-Information Columns
In any spreadsheet, there are two static columns which always remain visible even when scrolling. They are named the row
number and row label columns.
Row Label Headers
The Row Label Header describes what the row labels represent. The default Row Label Header is “Label” if
generic row labels are created on import of data. Left-clicking on the Row Label Header sorts the labels first in
ascending order and a second click sorts in descending order. Ascending order is indicated by a triangle pointing
toward the top of the spreadsheet, and descending is indicated by a triangle pointing toward the bottom of the
spreadsheet.
Row Numbers
Row Numbers are static; they never change regardless of any sort operations performed on the spreadsheet.
Right-clicking on a Row Number brings up the Row Menu, which contains, at a minimum, the Edit This Spreadsheet
option. If the spreadsheet has a genetic marker map applied to row labels, and if the marker map has RS ID
information, rows containing RS ID’s will also have hyperlinks to websites referencing the RS ID in the Row
Menu.
Row Labels
Row Labels are identifiers for the rows. Row labels can be used to sort spreadsheet data ascending or descending.
Left-clicking on the row label will set the active/inactive state of the entire row. Right-clicking brings up the Row Menu,
which contains, at a minimum, the “Edit This Spreadsheet” option. If the spreadsheet has a genetic marker map applied to
row labels, and if the marker map has RS ID information, rows containing RS ID’s will also have hyperlinks to websites
referencing the RS ID in the Row Menu.
Row States
There are two possible row states: inactive or active. Inactive means a row will be excluded from any analysis or plot. Active
means a row will be included in any analysis or plot. Multiple rows can simultaneously be made active or inactive by toggling
the first column to the desired state, then <Shift>-left-clicking on a distant column. All columns in between will be set to the
state of the first column clicked.
Column Headers
In any spreadsheet, there are two static rows which always remain visible even when scrolling. They are named the Column Number Headers and Column Label Headers.
Column Number Headers
A Column Number Header indicates the numerical index of a column. On a given Column Number Header the column
type is displayed in a large blue letter. See Column Data Types below for a description of the possible column
types and icons. Either left- or right-clicking on a Column Number Header opens the Column Header Menu.
The Column Header Menu gives options for manipulating the spreadsheet data based on the values in the
selected column, or for plotting the selected column. See Column and Row Spreadsheet Operations for more
information.
Column Name Headers
A Column Name Header is the name of a column as specified on import of the spreadsheet data, or after editing the
spreadsheet. Column Name Headers should be informative of the data contained in the column.
Left-clicking on a Column Name Header sets the column state. See Column States for more information.
Right-clicking on a Column Name Header opens the Column Header Menu. See Column and Row Spreadsheet Operations
for more information.
Column Data Types
Some column operations are specific to the type of column, indicated by a large blue letter on a column number header. The types are as follows:
- ‘B’: Indicates a binary column (values 0, 1, ?).
- ‘C’: Indicates a categorical column (values such as “Low”, “Medium”, or “High”).
- ‘G’: Indicates a genotype column (bi- or multi-allelic markers with alleles separated by an underscore such as “A_B” or “2_2”).
- ‘I’: Indicates an integer-valued column (values such as -1, 0, 1, 2, 10, etc.).
- ‘R’: Indicates a real-valued column (values containing decimal places encoded as single or double precision floating point values).
Column types can be changed or edited using the Spreadsheet Editor, see Editing a Spreadsheet for more information.
Column States
There are three possible column states:
- Active Means a column is set as an independent variable and included in analysis (except for certain situations see Genotype Association Tests) The text and data in this column is black.
- Inactive Means that the column is excluded from any analysis. The text and the data in this column is gray.
- Dependent Means that the column is set as a dependent variable, multiple columns are allowed to be set as dependent for a multivariate analysis. The text and data in this column is magenta.
The column state of a column can be set by left-clicking on a column’s column label header. Repeated clicking will cycle
through the column states (active, dependent, inactive).
Column state can also be set by right-clicking on either column header (or by left-clicking on the column number header)
and selecting Make Active, Make Inactive or Make Dependent. Multiple columns can be made active, dependent, or inactive
by toggling the first column to the desired state, then <Shift>-left-clicking on a distant column. All columns in between will
be set to the state of the first column clicked.
Genetic Marker Map Information
Applying a genetic marker map to a spreadsheet saves the marker map and associated annotations as special marker map
fields in the spreadsheet and reorders all genetic markers first by chromosome and then by position number.
This spreadsheet is referred to as a “marker mapped spreadsheet”. This marker map information will be used
for certain analyses and can be used for selecting subsets of data. See Genetic Marker Maps Overview for
more information. See Figure 28, A identifies the Map button, and B identifies the marker map information
fields.
Map Button
If a spreadsheet has a genetic marker map applied, then a green button with the word “Map” will appear to the left of
the Row Label Header above the Row Numbers. Left-clicking on this button will show all available marker map fields either
in rows or columns depending on whether the marker names are row labels or column name headers. If no genetic marker
map is applied this button will be grayed out.
Marker Map Fields
A marker map field is a row or column of marker map data, such as Chromosome or Position, which may or may not be
unique to the marker names. If there is no marker map information for a particular column or row then the marker map
fields will be blank for that column or row. If a row or column contains marker map information (at the very minimum
Chromosome and Position as required), right-clicking on that field will list hyperlinks to websites that look
up the particular marker by either chromosome and position or RS ID (if available) in the supported online
databases.
Right-clicking on the Map button allows the user to choose which marker map fields are shown or hidden. The user is not
allowed to hide all fields. Clicking on a field name will toggle between showing and hiding the field. To hide all fields, left-click
on the Map button.
Marker map information can be searched using the Edit > Find search dialog. Only one marker map field can be
searched at a time. See Finding Strings or Values in a Spreadsheet for more information.
Saving/Exporting Spreadsheets
For more information on saving or exporting data from spreadsheets see Exporting Spreadsheet Data.
Spreadsheet Menus Overview
A brief description of all spreadsheet menu items will follow along with links for more information on the topic.
File Menu
- Save As...:
A spreadsheet can be saved and exported as a text file, any supported third party format, as family-indexed PED/TPED/BED files, as a Golden Helix, Inc. DSF file, as a Golden Helix, Inc. Legacy GHD file, as a Genome Browser Track, or as CNT files. For more information see Exporting Spreadsheet Data. - Apply Genetic Marker Map:
A genetic marker map can be applied to a spreadsheet when marker names are either row labels or column name headers. See Applying a Genetic Marker Map to a Spreadsheet for more information.
- Export Genetic Marker Map:
A genetic marker map applied to a spreadsheet can be exported and saved as a DSM file. This allows the marker map to be applied to other spreadsheets in the current project or in other projects. See Exporting an Applied Genetic Marker Map to a DSM file for more information.
- Drop Genetic Marker Map:
It may be desirable to drop or remove an applied marker map. This option does just that. See Dropping a Genetic Marker Map from a Spreadsheet for more information.
- Append Spreadsheets:
Appending spreadsheets is used for combining two datasets that in general do not share any row labels, but do share all or most column name headers. For example, adding additional samples to a spreadsheet. See Appending Spreadsheets for more information.
- Join or Merge Spreadsheets:
Joining or merging spreadsheets is used for combining two datasets that generally share row labels. For example, joining genotype and phenotype data for the same samples. If two spreadsheets do not share row labels, an option is available to join on sorted row labels. See Joining or Merging Spreadsheets for more information.
- Create Top-Level Spreadsheet:
Creates a new dataset with no dependencies as a child of the project node from the current spreadsheet. See Create Top-Level Spreadsheet for more information.
- Create Marker Map from Spreadsheet:
Create a marker map from the current spreadsheet. Requires columns containing Marker Name, Chromosome, and Position information.
- Close:
Closes the spreadsheet.
Edit Menu
- Edit This Spreadsheet:
Opens the Spreadsheet Editor. See Editing a Spreadsheet for more information.
- Transpose Spreadsheet:
Transposes all columns in a spreadsheet of a selected type. See Transposing Spreadsheets for more information.
- Convert to Pedigree Spreadsheet:
Converts the spreadsheet to a pedigree spreadsheet by either selecting the pedigree fields from among the available spreadsheet columns or by generating generic values. The only required column is a binary affection status column. See Convert to Pedigree Spreadsheet for more information.
- Recode:
This sub-menu contains several options for recoding spreadsheet data, such as genotypes and alleles, and renaming row labels and column headers. See Recode: for more information.
- Go To...:
Opens a dialog to shift the visible data fields to the specified row and column number in the top left corner.
- Find:
Searches the spreadsheet for a string or value, or partial string or value, as indicated. See Finding Strings or Values in a Spreadsheet for more information.
- Find Next:
Searches for the next occurrence of a string or value as specified in the Find dialog. See Finding Strings or Values in a Spreadsheet for more information.
Select Menu
- Row:
This sub-menu contains several options for row operations. See Row Select Operations for more information.
- Column:
This sub-menu contains several options for column operations. See Column Select Operations for more information.
- Activate All:
Activates all columns and rows in a spreadsheet.
- Activate By Chromosomes:
This option allows for selection of rows or columns based on the chromosome field of a genetic marker map. Requires a marker mapped spreadsheet. See Activate By Chromosome for more information.
- Activate By Column Type:
This option allows for selection of columns based on the their data type. See Activate By Column Type for more information.
- Subset Active Data:
Creates a subset spreadsheet based on active columns and rows. See Creating Subset Spreadsheets for more information.
- Filter by Annotation:
This sub-menu contains several options for filtering based on annotation track information, such as coding region. See Filtering with Annotation Tracks for more information.
- Activate Columns By Row Labels:
This function activates all columns in a spreadsheet based on the active row labels of a second spreadsheet.
- Activate or Inactivate Based on Second Spreadsheet:
This function activates or inactivates all columns or rows in a spreadsheet based on the active row labels or column headers of a second spreadsheet.
- Inactivate Columns By Row Labels:
This function inactivates all columns in the spreadsheet based on the active row labels of a second spreadsheet.
Quality Assurance Menu
- Genotype:
This sub-menu contains several options to improve the quality of genotype data, include genotype statistics.
- CNV:
This sub-menu contains several options for identifying poor markers or samples in copy number data. These markers or samples could then be deactivated before performing any analyses.
- Numeric Principal Component Analysis:
Runs principal component analysis on numeric data or pre-computed principal components. See Principal Component Analysis Overview and Numeric Principal Component Analysis for more information.
- Genotype Principal Component Analysis:
Runs principal component analysis on genotypic data using one of three genetic models. See Principal Component Analysis Overview and Genotypic Principal Component Analysis for more information.
- PBAT Family-Based QC:
Calculates Family-Based genotype statistics by marker or by proband. A pedigree spreadsheet is required to use this feature. See PBAT Family-Based QC Statistics for more information.
- Multidimensional Outlier Detection:
This function determines outliers based on user-specified columns. A distance score is computed by summing the squared distances from the median in each column, then taking the square root of the sum. The sample is considered an outlier if its distance score is greater than a threshold, based on a user-specified multiplier and the quartiles of each column.
For more information about the details of this function, see Quality Assurance Procedures.
- Compare Columns:
This function compares the values of two columns and inactives the rows in which mismatched values lie. The user can optionally create a subset of the rows with matching data values and/or the rows with differing data values.
For more information about the details of this function, see Quality Assurance Procedures.
- Column Statistics:
This function calculates and/or reports the following approximate values for each real-, integer-valued and binary (optional) active column: Lower Outlier Threshold = Q1 - M*IQR, Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Upper Outlier Threshold = Q3 - M*IQR, Interquartile Range (IQR), Variance and Standard Deviation. M is a user defined multiplier to define outlier thresholds based on IQR (Inter Quartile Range).
For more information about the details of this function, see Quality Assurance Procedures.
- Row Average By Chromosome:
For a spreadsheet with real valued data, such as log ratio data, this function calculates the mean of each row, creating a new spreadsheet with the respective row means. If the data is marker mapped, it calculates the means by chromosome.
For more information about the details of this function, see Quality Assurance Procedures.
Analysis Menu
- Numeric Association Tests:
Performs association tests on phenotype or other numeric columns. See Numeric Association Tests for more information.
- Numeric Regression Analysis:
Performs linear or logistic regression on numeric, categorical or binary covariates. See Regression Analysis for more information.
- Genotype Association Tests:
Performs genotype association tests on genotype columns using a quantitative or binary column as the dependent variable. See Genotype Association Tests for more information.
- Haplotype Association Tests:
Calculates haplotype frequencies of blocks of genotypes and the association between those frequencies and a binary outcome. See Haplotype Association Tests for more information.
- Haplotype Block Detection:
Detect blocks of haplotypes that represent biological groupings of genotypes. See Haplotype Block Detection for more information.
- Runs of Homozygosity:
Calculates runs of homozygosity in genotypic data. This feature requires a genetic marker map be applied to the spreadsheet. See Runs of Homozygosity for more information.
- CNAM Optimal Segmenting:
Performs the optimal segmenting algorithm on numeric copy number data. This feature requires a marker map applied to the spreadsheet’s columns. See Using CNAM Optimal Segmenting for more information.
- CNV Association Tests:
Performs association tests on log2 ratio copy number data or CNV segment covariates. This feature requires a genetic marker map be applied to the spreadsheet’s columns. See CNV Association Tests for more information.
- CNAM Output Analysis:
This sub-menu contains several options for performing analytic measures on copy number data and the spreadsheets created by CNAM, such as the Segment List spreadsheet. See Manipulating and Analyzing CNAM Output for more information.
- PBAT Genotype Analysis
Runs the PBAT algorithm from a pedigree spreadsheet containing genotypic data. See PBAT Genotype Analysis for more information.
- PBAT CNV Analysis
Runs the PBAT algorithm from a marker-mapped pedigree spreadsheet containing numeric CNV intensity data. See PBAT CNV Analysis for more information.
- Fisher’s Exact Test for Binary Predictors
A Fisher’s Exact Test can be performed on binary predictors and a binary case/control dependent variable using Analysis > Fishers Exact Test for Binary Predictors. See Fisher’s Exact Test for Binary Predictors for more information.
- Combined Multivariate and Collapsing (CMC) Test
The Combined Multivariate and Collapsing (CMC) method by Li and Leal [Li and Leal 2008] first bins variants within each of a number of regions according to a criterion such as minor allele frequency, then collapses the variants within each bin, and finally performs multivariate testing on the counts across the various bins. See Combined Multivariate and Collapsing (CMC) Test for more information.
- Count Variants per Gene
In order to determine the influence of a variant in a gene or other region it is useful to know how many variants exist for a gene per sample, which can be done with this function. See Count Variants per Gene for more information.
Plot Menu
- Histograms:
Plots one or more histograms of column data in one or more graphs. See Histograms for more information.
- XY Scatter Plots:
Plots one column of data as an independent variable against one or more columns of data as dependent variables. See XY Scatter Plots for more information.
- Numeric Values:
Plots one or more columns of data against row labels or genetic marker map information. See Numeric Value Plots for more information.
- Linkage Disequilbrium
Plots linkage disequilibrium between genotypic columns. LD is plotted on a uniform scale unless a marker map is applied, in that case it is plotted based on genomic distance. See LD Plots for more information.
- Heat Map
Plots an intensity plot for all rows and columns. Intensity values are colored based on a three-color scale based on the minimum, mean and maximum values over all active values. See Heat Maps for more information.
Scripts Menu
This menu is available for launching any scripts to be run on entire spreadsheets. Scripts saved in the directory “../SVS /AppData/userScripts/Spreadsheet/Scripts” will be available from this menu.
Help Menu
- Spreadsheet Help:
Launches the integrated SVS manual chapter on Spreadsheets.
- Email Feedback:
Launches an email message with the appropriate address and subject line depending on the type of email feedback selected.