‹‹ Back to SVS Home
Using CNAM Optimal Segmenting
10.3 Using CNAM Optimal Segmenting
CNAM Optimal Segmenting represents the second step for performing copy number analysis after importing log2 ratio data
into a project and applying an appropriate genetic marker map. See Genetic Marker Maps Overview for more information.
CNAM optimal segmenting uses both the genetic marker map information and the log2 ratios in the spreadsheet to discover
regions of markers in which the log2 ratios vary significantly from segment to segment. While the genome has numerous
regions of copy number variation, these regions are approximated by the segments found with the CNAM optimal segmenting
algorithm. These segments will, with high probability, be where there are regions of copy number loss, neutral or gain in the
data.
Upon segmenting, at least two new spreadsheets are created in the current SVS project: the segment means spreadsheet
and the covariates spreadsheet. The segment means spreadsheet lists every region computed, its beginning and ending
marker, and the segment mean log2 ratio value for every sample within that region. A covariate segment is created for all
start and end positions for all samples. Each sample will have exactly the same number of covariates. The value of a sample’s
covariates is determined by the segment mean for the segment that the covariate start and end positions are
contained in. The covariates spreadsheet can be output in one of two formats, either a column is created for
every active marker in the spreadsheet that was segmented, or a column is created for the first marker in
every covariate segment. Optionally, a Wiggle file may also be generated which contains the locations of these
regions.
Options and other fields within the CNAM Optimal Segmenting tool are described below (see Figure 81).
Log2 Ratio Spreadsheet
In order to use CNAM optimal segmenting on a spreadsheet, a spreadsheet must contain log2 ratios and have a genetic marker map applied. From this spreadsheet, select Analysis > CNAM Optimal Segmenting.
Selecting Chromosomes
For large datasets, it is better to only segment a chromosome at a time or a few chromosomes at a time. As CNAM optimal
segmenting does not segment across chromosomal boundaries results will not change by subdividing the segmenting by
chromosome.
To select a chromosome or a few chromosomes, use the Select > Activate by Chromosomes option then select the chromosomes you wish to segment. It is not necessary to create a subset spreadsheet, as the segmenting algorithm will only run on active numeric columns.
Chromosome Segmenting Options
Variations of the CNAM optimal segmenting algorithms for obtaining the regions of CNV are documented in the Formulas and Theories chapter, see CNAM Optimal Segmentation Algorithm. Certain parameters for this algorithm may be changed within these segmenting options.
Algorithm
CNAM offers two types of segmenting methods, univariate and multivariate. These methods are based on the same
algorithm, but use different criteria for determining cut-points denoting CNV boundaries.
The multivariate method segments all samples simultaneously, finding general CNV regions which may be similar across
all samples. This method is preferable for finding very small CNV regions. For a given sample, the covariate is the mean of
the log2 ratios within each segment for that sample. These covariates can then be used for association analysis. This model
makes the tenuous assumption that for a given disease, the beginning and end of a CNV region will be similar for subsets of
the cases. That is, if the regions are conserved for enough cases it is expected there is sufficient power to find a statistical
association. If this assumption holds true, very small CNV regions can be found because the signal will be assessed over
multiple samples.
In reality there may not always be consistent CNV regions across multiple samples. The univariate method segments each sample separately, finding the cut-points of each segment for each sample individually and a spreadsheet is created showing all unique cut-points found among all samples. The univariate method discovers the optimal segments for each sample and outputs the mean, for every sample, of every unique segment found across all samples. This output can be displayed in one of two formats ready for subsequent association analysis or for plotting results. The output spreadsheets are discussed in Outputs from CNAM Optimal Segmenting.
Univariate outlier removal
The univariate outlier removal option helps to address the influence of large negative or large positive values on determining segment boundaries. It works by excluding found cut-points that bracket single marker segments before running permutation tests to determine the strength of the segment. This option is only valid when the minimum number of markers per segment is set to 1. If outliers are not removed and the minimum number of markers per segment is set to a number greater than one, a single marker outlier could force adjacent markers to create a segment that is driven only by the single outlier. This would inflate the number of segments that had the minimum number of markers allowed, and incorrectly specify boundaries if the number of markers in the region was actually less than the the minimum number of markers allowed in a segment. If the minimum number of markers in a segment was set to one with the univariate outlier removal box not checked then single marker segments would be found, but they would not be deemed significant with permutation testing. As a result, the algorithm looks for fewer segments at the expense of the larger, real segments. See CNAM Optimal Segmentation Algorithm for more details on this option.
Use Moving Window
If “Moving Window” is selected, the segmenting is performed using a moving window of markers that sweeps across each chromosome. Segmentation is constrained to the window, and then the results from each region are combined to produce the whole-chromosome results. This can greatly reduce the run time of the algorithm for large chromosomes, but may also introduce edge effects. CNAM chooses window boundaries in such a way that edge effects are reduced, but still cannot guarantee globally optimal results when using a moving window. The run log contains details on what window boundaries are chosen by CNAM.
Moving window size (markers per window)
The number of consecutive markers analyzed in the moving window. This option is only available if “Use Moving Window” is selected. Smaller moving window sizes speed up the run-time of the algorithm, especially when not using hardware acceleration. Note, however, there is a somewhat higher risk of false discoveries using a moving window approach as there is the potential for anomalies due to looking at a window of data instead of all of it. Permutation testing does minimize this.
Max segments (per 10,000 markers)
Set this to be greater than or equal to the number of CNV regions expected for every 10,000 markers. This puts an upper
limit on the number of segments found per chromosome (or per window if enabled). For a chromosome with fewer than
10,000 markers, this parameter is used as the upper limit. For larger chromosomes, CNAM multiplies this limit by
since larger chromosomes are likely to have more CNV regions.
When performing univariate segmentation, smaller values for “Max Segments” are ideal for detecting large rare variants. Smaller values also help keep the run-time manageable. Large “Max segments” values can detect common smaller variants, but also suffer from increased false-positives due to additional multiple testing. Consider using the Multi-variate algorithm to detect small, common CNVs.
When performing multivariate segmentation, more segments are usually needed to detect smaller, common CNVs. The multivariate algorithm’s performance also scales better with more segments, so increasing this will have less effect on run-time compared to the univariate algorithm.
Min #markers per segment
This constrains the algorithm to only find CNV regions with this minimum number of markers in each
segment.
This parameter allows you to prevent finding CNV regions based on short spans of noise. In general the permutation testing should prevent small spurious segments from showing up, but a good default for this parameter is 1 marker with univariate outlier removal on for univariate analysis. For multivariate analysis, a minimum number of 1 marker is still a good default. It is important to take into account any outliers in the log2 ratios for a sample. Outliers can still drive the segmentation results even after permutation testing, although their effect is minimized, to remove their effect use the univariate outlier removal option.
Max pairwise segment p-value
The “Max segments” parameter sets an upper bound on the number of segments found. However, the problem remains to
determine the actual number of valid CNV regions in the data. The process used is, once a set of k segments is found, each
pairwise set of segments is compared through a permutation testing procedure. If every pair is statistically significant
according to the “Max pairwise segment p-value”, then the k-way split is retained. Otherwise, the algorithm continually
decreases k by one until every adjacent segment is significantly different from its neighbor or no segments are found,
whichever comes first.
Larger p-values increase sensitivity by rejecting fewer segment pairs, but also increase the false-discovery rate. Conversely,
smaller p-values decrease the false-discovery rate but also decrease sensitivity. Smaller p-values also require more
permutations to accurately test, and can significantly increase the segmentation running-time.
CNAM uses random permutation testing to estimate the p-value for each segment pair. CNAM evaluates
random
permutations of the log ratios from the segment pair, where pmax is this parameter. Each permutation is checked to see if it
has a better split (smaller sum of squared deviations from the means) than the original input segments. If the
percentage of random permutations that have a better split is greater than pmax, then the pair is rejected as
insignificant.
Segment means output
These options select which segment means output to generate, see Outputs from CNAM Optimal Segmenting for details.Log output
Here you can enable the Full Logging option. This option outputs extra messages that more thoroughly detail CNAM’s activity.Hardware Options
Several options exist to improve CNAM’s performance on modern computers.
Number of CPU Threads
Both the Univariate and Multivariate algorithms can take advantage of multi-processor or multi-core machines by
performing some of their work in parallel threads. It is usually a good idea to match this number to the number of
computational cores you have available on your system. The number of cores detected will be displayed to the right of this
option.
This option only effects the number of threads on the system CPU, but not on any hardware accelerated devices(such as GPUs). For accelerated devices, CNAM automatically chooses the ideal number of threads. However, some operations (such as permutation testing) do not use hardware acceleration, so this option should still be set correctly even when using hardware acceleration.
Use Hardware Acceleration
CNAM can now take advantage of Graphics Processing Units(GPUs) and other OpenCL compatible devices to speed up
segmentation. This option can dramatically improve performance without sacrificing accuracy. To use this option, you will
need a device that supports OpenCL, such as a modern graphics card. You will also need up-to-date drivers to ensure full
support.
If you have more than one OpenCL capable device, you can use the device drop-down to choose which one you want
CNAM to use. Currently, CNAM does not support using multiple OpenCL devices simultaneously. For device details and
troubleshooting information, click the OpenCL Info... button.
Note for Windows Remote Desktop users: Most GPUs can not be used when running SVS via Remote Desktop. This is because remote desktop sessions use a special video driver that is incompatible with OpenCL. Hopefully a work-around will be available in the future.
Specify Memory Limit
This option allows users to fine-tune the memory usage for multivariate segmentation. When left unchecked, SVS will estimate a good memory limit based on your current hardware. Specifying this parameter can improve performance on high-end hardware, or improve stability on low-end hardware.
If you are using a GPU, this option limits the ammount of video memory CNAM will use. If using a CPU, it will limit the ammount of system RAM used. It is a good idea to make this limit smaller than the total ammount of memory available in order to leave room for the operating system, device drivers, and other software.
Optional Output Files
On the Optional Output tab, checking the Optional Bookmark File Output box exports the segment means to a
UCSC Wiggle Track (WIG) file format file for Genome Browser import. Use the Browse button for file name
selection.
If the WIG files are output while using the Univariate segmenting algorithm, the browse button will have you select a directory location as a WIG file will be generated for each sample. These files will be named using the sample name from the file.
Excluding Markers
If desired, markers can be excluded from the segmenting algorithm and its results by inactivating the columns corresponding to those markers.
Run Log
A log is shown in this window informing you of the progress in segmenting sub-regions of the total region of
markers being analyzed. If the number of segments found in a given window is equal to the maximum number of
segments per window, a warning message will be printed in red, suggesting the user consider increasing that
parameter.
NOTE:
- During processing, a normal progress bar is also shown in a separate window.