‹‹ Back to SVS Home
Working with a Single Spreadsheet
5.2 Working with a Single Spreadsheet
There are several options available within a spreadsheet to prepare data for analysis. These operations are detailed below,
along with a few other essential spreadsheet operations.
Copying Spreadsheet Information to Clipboard
Right-click menus for Column Headers, Row Labels, and Cells allow spreadsheet fields to be copied into the clipboard for easy copy and pasting either within the program or into a third-party program such as a document editor. If a marker map is applied to a spreadsheet, then information in the marker map fields for a particular row or column (depending on the orientation of the marker map) can also be selected and copied through the right-click menu.
Finding Strings or Values in a Spreadsheet
SVS provides an enhanced dialog for finding strings or parts of strings in not only the spreadsheet data fields, but also in
column name headers, row labels and marker map fields. The Find dialog can be opened by either clicking on the magnifying
glass in the tool bar or by going to Edit > Find (see Figure 29).
To find a particular string or value, enter it in the text box. There are three matching methods available, Full Match, Partial Match and Regular Expression.
The “Full Match” option requires that the data field or value fully match the search string or value.
If it is desired that all fields contain the string or value be found, including partial matches, then the
“Partial Match” option (default) should be selected. For example, if “1” is entered in the text box and the
“Partial Match” option is selected, then the first field that contains a “1” anywhere in the specified search
area will be found. If the “Full Match” option is selected then the first field containing only a “1” will be
found.
NOTE:
- This search (with the Case sensitive check box unchecked) is case-insensitive. This means that capital letters
will be considered to match lower-case letters when searching for matching column headers, row labels, or data
cells.
If you check the Case sensitive check box, you will force capital letters to match capital letters and lower-case letters to
match lower-case letters in the search.
It is necessary to specify where to search for a string or value. The string or value can be searched for in:
- Column Names
- Row Labels
- Map Field (A particular map field available in an applied genetic marker map must be selected, such as chromosome)
- Data
The search process can either start at the beginning of a spreadsheet, or at the upper left corner of the
selected search field. The first occurrence of the string will be visible in the upper left corner of the searched
field.
In cases where there are multiple occurrences of a search string, click on F3 or go to Edit > Find Next to find the next
field. Repetitive use of Find Next will cycle through all occurrences of the string or value to be found. When no other fields
match the search string, a message will be displayed indicating as such.
Recode:
This sub-menu contains several options for recoding spreadsheet data, such as genotypes and alleles, and renaming row labels and column headers. The functions available include:
- Recode Genotypes
- Add Allele Delimiter
- Recode AGCT Alleles to AB
- Rename Alleles
- Rename Marker Mapped Row Labels
- Rename Marker Mapped Column Headers
Each option is explained in detail below.
Recode Genotypes
This feature provides options for recoding genotypic data based on major or minor alleles, flipping DNA strands for
AGCT encoded genotypes, or transcoding AB to AGCT encoding based on a marker map field including Allele A/B
information.
The genotypic data for each marker is recoded based on the frequency of each allele for each marker. For example, if there
are only three occurrences of the A allele in a column and 2507 occurrences of the B allele in a column, then the
A alleles will be replace with “D” (the minor allele) and the B alleles will be replaced with “d” (the major
allele). If the genotypes for one marker are all homozygous in the same allele (say there are zero occurrences of
the A allele and 2510 occurrences of the B allele) then the non-missing alleles in the entire column will be
replaced with the major allele “d”. If genotypes are missing for an entire marker, the values will remain missing,
“?_?”
To open this window go to Edit > Recode > Recode Genotypes.
The options for recoding are as follows, (see Figure 30):
- Encode genotypes as “DD”, “Dd”, or “dd”: Produces a new spreadsheet where the bi-allelic genotypes are coded as minor allele homozygous (DD), heterozygous (Dd) and major allele homozygous (dd) for each genotypic column. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (dd). This is useful as it gives insight into whether the genetic models view a specific sample as having two minor alleles (DD), one minor allele (Dd) or no minor alleles (dd).
- Encode genotypes numerically based on genetic model: Produces a new spreadsheet where the bi-allelic
genotypes are coded as 2, 1, or 0 based on a specific genetic model and the number of either major or minor
alleles.
- Additive model: Encodes bi-allelic genotypes based on the number of copies of the minor allele. Thus “DD” = 2, “Dd” = 1, and “dd” = 0. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
- Dominant model: Encodes bi-allelic genotypes based on the presence of a copy of the minor allele. Thus “DD” = 1, “Dd” = 1, and “dd” = 0.If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
- Recessive model: Encodes bi-allelic genotypes based on the presence of two copies of the minor allele. Thus “DD” = 1, “Dd” = 0, and “dd”=0. If the genotypes for one marker are homozygous in the same allele then the genotypes are coded as major allele homozygous (“dd” = 0).
- Flip DNA strands for AGCT encoded genotypes: Takes a bi-allelic genotype encoded in the AGCT format and flips the order of the alleles. In other words, the genotypes “A_A”, “A_G”, and “G_G” would become respectively “T_T”, “T_C”, and “C_C”.
- Transcode AB to AGCT encoding using mapping: Takes a marker map field such as “Allele A/B” and transforms bi-allelic genotypes “A_B” to AGCT genotypes using the marker map field. For example, if for a given marker the “Allele A/B” field contains [C/T] then the genotypes “A_A”, “A_B”, and “B_B” will be transcoded to “C_C”, “C_T”, and “T_T” respectively.
Recoding will create a new spreadsheet. The new spreadsheet can be created as a child of the project root or a child of the current spreadsheet. In either case no dependencies are created, but in the case of creating the spreadsheet as a child of the project root a new dataset is created.
Add Allele Delimiter
This function will convert categorical genotypes such as AA, AG without a delimiter to the standard genotype format accepted by SVS (A_A, A_G, etc). The user is prompted for a missing value indicator, such as ’?’.
The appropriate columns for this function will be recognized as Categorical upon import, then be converted to Genotype by the function. Only the appropriate categorical columns should be active in the spreadsheet, as the function works on all categorical columns, which may not be appropriate.
Recode AGCT Alleles to AB
This function recodes AGCT alleles to the standard genotype format of A_A, A_B, and B_B. The marker map fields are scanned and a list of appropriate fields is generated. The user is prompted to choose a marker map strand field to use for recoding from the generated list.
The first allele listed in the reference allele string is recoded as the ’A’ allele, and the second allele listed is recoded as the ’B’ allele. Missing genotypes remain missing. The resulting recoded spreadsheet has the original marker map reapplied.
Rename Alleles
This function scans genotype columns for allele names (e.g. 1,2,3,4), then opens a dialog listing the current alleles and prompts for how the alleles should be renamed. Missing alleles remain missing.
The new assignment will apply to all genotype columns, not on an individual column basis. The marker map on the original spreadsheet will be reapplied to the new spreadsheet.
Rename Marker Mapped Column Headers
Renaming column headers with information from a marker map file requires a marker mapped spreadsheet. It allows the
Column Name Headers to be replaced with a string field in the applied marker map.
Only string fields in the marker map are displayed in the drop-down box, as only string fields can be used as Column
Name Headers. Columns that are not renamed can either be kept in the new spreadsheet with their original
column headers or dropped. The columns that are not renamed will be dropped if the appropriate box is
checked.
Once a marker map string field is selected to be used as the new Column Name Headers, a new spreadsheet is created as
a child of the project root or a child of the current spreadsheet. In either case, no dependencies are created, but in the case of
creating a spreadsheet as a child of the project root a new dataset is created.
A new marker map is also created with the new marker names.
Rename Marker Mapped Row Labels
Renaming row labels with information from a marker map file requires a marker mapped spreadsheet. It allows the Row
Labels to be replaced with a string field in the applied marker map.
Only string fields in the marker map are displayed in the drop-down box, as only string fields can be used as Row Labels.
Rows that are not renamed can either be kept in the new spreadsheet with their original row labels or dropped. The columns
that are not renamed will be dropped if the appropriate box is checked.
Once a marker map string field is selected to be used as the new Row Labels, a new spreadsheet is created as a child of
the project root or a child of the current spreadsheet. In either case, no dependencies are created, but in the case of creating
a spreadsheet as a child of the project root a new dataset is created.
A new marker map is also created with the new marker names.
Convert to Pedigree Spreadsheet
A binary case/control or affection status column is required to use this feature. All other fields can be generated.
Select the pedigree fields from the drop down menus (see Figure 32). Only columns of the correct type will be available in the menus. The required column types are:
- Affection Status: Binary Column (Values of: 0/1 or ?/0/1)
- Patient ID: Row Label column, or any Integer or Categorical column
- Sex: Binary Column (Values of: 0/1 or ?/0/1)
- Family ID: Row Label column, or any Integer or Categorical column
- Mother ID: Any Integer or Categorical column
- Father ID: Any Integer or Categorical column
The spreadsheet can be created as either a child of the Project Root or of the Current Spreadsheet.
Row Select Operations
There are several operations for row selection as described below.
Activate/Inactivate All Rows
To activate all rows use the Select > Row > Activate All Rows command.
To inactivate all rows simultaneously use the Select > Row > Inactivate All Rows command. No analysis operations
can be performed with all rows inactivated. This option is used in the case when it is desired that most, but not all rows, be
set to inactive.
Invert Row Selection
Using the command Select > Row > Invert Row Selection flips the state of the active and inactive rows. By inverting,
all of the inactive rows become active and all of the active rows become inactive.
Selecting a Random Subset of Rows
A random subset of rows can be selected using the Select > Row > Select Row Subset tool. A random subset can be chosen based on:- Random fraction: Specifies what percentage of records to use (default = 0.5).
- Random selection size: Specifies the number of records (default = total number of rows).
- First N items: Specifies the first N rows to use from the spreadsheet (default = total number of rows).
- Reset random seed: Changes the random seed although the default of 1 will do in most cases. Resetting the random seed to 1 picks the same random subset as would have occurred if the program had just started up and no random number generation had taken place.
Randomizing Row Order
Rows can also be randomized using the Select > Row > Randomize Row Order tool. Rows can be reverted to the original sort order by either clicking Unsort at the top left of the spreadsheet, or by clicking on a column header to get the Column Header Menu and selecting Unsort.Column Select Operations
There are several operations for column selection as described below.
Activate/Inactivate All Columns
To activate all columns use the Select > Column > Activate All Columns command.
To inactivate all columns simultaneously use the Select > Column > Inactivate All Columns command. If there are
many columns, it is easier to inactivate all columns and then activate the few columns to be analyzed. At
least one column state needs to be set to either active or dependent for any analysis or subset spreadsheet
options.
Invert Column Selection
For convenience, states of columns can be inverted Select > Column > Invert Column Selection. By inverting, all the
inactive columns become active, and all the active and dependent columns become inactive. This can be useful for splitting
analysis up into two groups of chromosomes.
Selecting Columns by Chromosome/Region Range
In a marker mapped spreadsheet, the map information can be used to select columns (or rows depending on the
orientation of the marker map) within one or more desired ranges of chromosome(s), region(s), or gene(s). Choose Select >
Column > Select Columns by Chromosome/Region Range and the selection dialog will open, see Figure 33. All
genetic columns within the selected ranges will be activated and the others will be inactivated. The states of dependent and
non-genetic columns will be left as is.
Activate By Chromosome
In a marker mapped spreadsheet, the map information can be used to activate all columns corresponding to specified
chromosomes. Choose Select > Activate by Chromosomes, and the chromosome selection dialog will open. All
chromosomes available in the spreadsheet as determined by the corresponding marker map Chromosome field will be listed.
Un-checking all the chromosomes will inactivate all columns corresponding to the chromosomes in the spreadsheet. The
Uncheck All button is best used to clear all check marks and then to select a few chromosomes to activate. Chromosomes
that have active columns in the spreadsheet will be listed at the top of the dialog. All chromosomes can be selected by
clicking on the Check All button.
Activate By Column Type
It is often useful to only want columns of certain types to be active in a spreadsheet. Choose Select > Activate by Column
Type, and the type selection dialog will open. All data types present in the spreadsheet will be listed as check boxes. Check
the data types of the columns you would like to make active. All column that are of an un-checked data type will be made
inactive.
Creating Subset Spreadsheets
If certain rows or columns have been inactivated and need to be removed from a spreadsheet, a subset spreadsheet can be
created. There are three ways to create a subset spreadsheet.
Row Subset Spreadsheet
A new spreadsheet can be created using only the active rows of a spreadsheet. Using the Select > Row > Row Subset
Spreadsheet option creates a new spreadsheet with only the active rows, and all of the columns from the original
spreadsheet. This option keeps the states of all the columns the same in the subset spreadsheet. The Row Subset icon can
also be used to create a row subset spreadsheet.
Column Subset Spreadsheet
A new spreadsheet can be created using only the active columns of a spreadsheet. Using the Select > Column >
Column Subset Spreadsheet option creates a new spreadsheet with only the active or dependent columns, and all of the
rows from the original spreadsheet. This option keeps the states of all the rows the same in the subset spreadsheet. The
Column Subset icon can also be used to create a column subset spreadsheet.
Subset Active Data
A new spreadsheet can be created using only the active rows and active columns of a spreadsheet. Using the Select >
Subset Active Data option creates a new spreadsheet with only the active rows and columns from the original spreadsheet.
This option keeps the state of any dependent columns in the original spreadsheet dependent in the subset
spreadsheet.
Column and Row Spreadsheet Operations
A menu for column operations can be viewed by either left- or right-clicking on a Column Number Header, or by right-clicking on a Column Name Header.
Sorting on a Column
Columns can either be sorted in ascending or descending order. Sorting can be performed on a column by left- or right-clicking on the Column Number Header, or by right-clicking on the Column Name Header and selecting one of the sorting methods (Sort Ascending, Sort Descending, or Unsort). Sorting can be removed by clicking on the Unsort button at the upper left corner of the spreadsheet or by selecting Unsort in the Column Header Menu.Activating/Inactivating Rows Using Column Values
Rows can be activated or inactivated based on the values of a column. The operation depends on the type of column, and all operations are found in the Column Header Menu.- A Binary (B) Column has the following options:
- Activate 1’s: Activates all rows with a 1 in the selected column and inactivates all other rows.
- Activate 0’s: Activates all rows with a 0 in the selected column and inactivates all other rows.
- Inactivate Missings: Inactivates all rows with a “?” in the selected column and keeps the state of all other columns the same. This option will only appear if the column or the column in its associated parent spreadsheet contains missing data.
- A Categorical (C) or Genotype (G) Column has the option to Activate By Category.
- Clicking on this menu operation brings up the category selection dialog. One or more categories can be selected by
either using the <Ctrl> or <Shift> key and left-clicking on the desired categories to have active. All categories that
were not selected will be inactive.
NOTE:- Missing data is considered a category of categorical or genotypic data.
- Clicking on this menu operation brings up the category selection dialog. One or more categories can be selected by
either using the <Ctrl> or <Shift> key and left-clicking on the desired categories to have active. All categories that
were not selected will be inactive.
- A Real or Integer column has the option to Activate By Threshold.
- The parameters needed for this dialog are the direction of the threshold to activate and the numerical threshold value. For example, if you want to activate all rows with column values greater than or equal to 0.986, then you would select >= in the drop-down menu and type 0.986 into the text box in the Activate By Threshold dialog. Rows with missing data for integer or real columns can be inactivated by selecting Inactivate Missings.
Plotting from a Column
A column of numeric values can be plotted using the options in the Column Header Menu:
- Plot Variable: Plots the columns of data against row labels or genetic marker map information. See Numeric Value Plots for more information.
- Plot Histogram: Plots a histogram of the selected column data. See Histograms for more information.
- Plot X: [Column Name Header] vs Y:Dependent(s): Plots the selected column of data as an independent variable against one or more columns of data set as dependent variables. This option requires at least one column state be set to dependent (column is magenta). See XY Scatter Plots for more information.
Copy Column Header, Row Label and Marker Map Information
Column Header, Row Label and Marker Map information can be copied from the appropriate column or row menu. Marker Map fields are viewable from the Copy submenu.
Online Database Links
Several hyperlinks to online databases are available when accessing a menu from a marker-mapped column.
Column Average
This function calculates the average of real- and integer-valued columns.Value Counts
This function counts the occurrences of values of a categorical, binary, genotypic and integer column.Transposing Spreadsheets
Spreadsheets can be transposed to facilitate analysis and/or plotting. It is important to note that only columns of the same
data type can be transposed. If there were three columns, a binary column, a categorical column, and a real-valued column,
transposing would create a binary row, a categorical row and a real-valued row, however, the column type would have to be
the lowest common divisor, which would be categorical, so most analysis and plotting properties would be
lost.
The Transpose Spreadsheet option is located in the Edit menu, Edit > Transpose Spreadsheet.
Transpose Spreadsheet Options
There are several options to use for transposing the data in a spreadsheet.
- New Dataset Name: The default dataset name can be quite long and is based on the name of the spreadsheet being transposed. It is recommended that this default name be edited to a more informative name.
- Data Type to Transpose: In the drop down list box, the possible types of data to be transposed are listed as well as the number of columns with that data type in parentheses.
- Label for Column Names: The new Row Label Header default is a generic “Columns”, a more informative name should be chosen such as “SNPs” or “Genetic Markers”.
- Memory to Use: (Only available for x64-bit versions.) Depending on the size of the data set to be transposed, the process could be memory intensive. Changing this default changes the size of the memory cache. A larger value could make the transposing process run faster, but this value should not exceed 70% of the available RAM on the computer.
- Spreadsheet as Child of: The spreadsheet can either be created as a child of the project root or of the current spreadsheet. The advantage to creating the transposed spreadsheet as a child of the project root is that the original spreadsheet can be deleted. However, sometimes it is advantageous to have all related spreadsheets in a parent/child chain to easily identify how the spreadsheets were created.
Create Top-Level Spreadsheet
A top-level spreadsheet can be created to remove all dependences between navigator window nodes so that unnecessary
navigator window nodes can be deleted. A project that only contains top-level spreadsheets will be larger in size than a
project that allows some dependent relationships between navigator window nodes.
This option is located in the File menu, File > Create Top-Level Spreadsheet. A dialog prompting for a new dataset
name will open. It is recommended that the dataset name be changed to an informative dataset name from the default, if the
default is not informative.