‹‹ Back to SVS Home
Quality Assurance Procedures
7.2 Quality Assurance Procedures
Derivative Log Ratio Spread
The derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio data. Samples with higher values of DLRS tend to have poor signal-to-noise properties. DLRS was originally developed for use in aCGH analysis. The measurement is based on absolute differences in LR values at consecutive points, rather than deviations from a baseline value. This property makes DLRS robust against signals from true copy number variants, because only the first and last markers in each segment (rather than all markers in a segment) will have a large deviation from normal values.
To calculate the DLRS, open a spreadsheet containing marker-mapped log ratio data and choose Quality Assurance >CNV >Derivative Log Ratio Spread.
The spreadsheet output contains DLRS values for each sample, per chromosome and overall, as well as the median DLRS value per chromosome.
Percentile Based Winsorizing
Calculates thresholds for the top and bottom percentiles of log ratio data, as specified by the user, for the purpose of winsorizing - replacing extreme log ratio values with the calculated thresholds. Winsorizing data prevents segmentation algorithms from being driven by outlier values and results in a more accurate determination of regions of copy number variation.
For autosomes, the median threshold is used to winsorize the data. Values less than the lower threshold are replaced with the lower threshold, and values that are greater than the upper threshold are replaced with the upper threshold. For non-autosomes, the thresholds for each chromosome are used.
To use, open a marker-mapped spreadsheet containing log ratio data, with samples as columns. Select Quality Assurance >CNV >Percentile Based Winsorizing and enter percentile thresholds to be used for winsorizing in the window, or leave the defaults of 0.002 and 0.998.
The spreadsheet output will contain the same information, with the extreme values winsorized.
NOTE:
- This takes about 56 minutes to process 3500 samples by 500k markers on a 32-bit Windows Dual-Core 2.33 GHz computer.
- 2. The resulting spreadsheet can be used for plotting individual samples. It will need to be transposed in order for analysis, in particular Copy Number Segmentation.
Autosome Heterozygosity
Calculates the heterozygosity rate of autosomes from a marker mapped spreadsheet containing genotypic data. From an appropriate spreadsheet, choose Quality Assurance >Genotype >Autosome Heterozygosity. The spreadsheet output (Autosome Heterozygosity Rate) contains the heterozygosity rate for each sample, by chromosome and overall.
Filter Samples by Call Rate
Call rates for all samples are calculated and samples with call rates less than a user-specified default will be deactivated. If at least one sample, but not all of the samples, are deactivated, a subset of active rows is created.
From a spreadsheet containing several genotypic columns, choose Quality Assurance >Genotype >Filter Samples by Call Rate. The spreadsheet output (Sample Statistics) contains two columns; the first column contains the number of markers (not including missing values) and the second column contains the call rate, defined as the number of non-missings divided by the number of columns.
If at least one row, but not all of the rows, are deactivated, a subset will also be created.
SNP Concordance
SNP concordance rates are calculated based on two spreadsheets that are assumed to be the same samples assayed for their SNPS two separate times. The two spreadsheets are assumed to be in the same row order, but not necessarily the same column order. Two output spreadsheets are created; the first contains concordance rates by SNP and the second contains concordance rates by sample.
From an appropriate spreadsheet, select Quality Assurance >Genotype >SNP Concordance by SNP. You will be prompted to select another spreadsheet. This spreadsheet should match the description above.
The first output spreadsheet, SNP Concordance by SNP, has the following columns:
- The first column contains the percentage of calls in the first spreadsheet that match the calls in the second spreadsheet, for each SNP. The column header contains the average value over all SNPs. For example if the column header was Call1/0.96, 96% SNPs in the first spreadsheet matched the calls for the SNPs in the second spreadsheet, or were missing in the first spreadsheet.
- The second column contains the percentage of calls in the second spreadsheet that match the calls in the first spreadsheet, or were missing in the second spreadsheet.
- The third column contains the percentage of calls that matched exactly in both spreadsheets.
- The fourth column contains the percentage of calls that did not match in both spreadsheets, not including missing calls.
- The fifth column contains the percentage of calls that were missing in the second spreadsheet, but were not missing in the first spreadsheet.
- The sixth column contains the percentage of calls that were missing in the first spreadsheet, but were not missing in the second spreadsheet.
- The seventh column contains the percentage of calls that were similarly missing in both spreadsheets.
The second output spreadsheet contains the same information as described above, except calculated for each sample instead of each SNP.
SNP Density
Reports various SNP density statistics across all markers in a marker mapped spreadsheet. To calculate the statistics, open a marker mapped spreadsheet and select Quality Assurance >Genotype >SNP Density. The following statistics will appear in a window: Minimum Gap (bp), Maximum Gab (kb), Average Gap (kb), and SNP Density (1 SNP per X.XXkb).
X Heterozygosity Gender Inference
Allows the user to predict the gender of samples by examining genotype columns mapped within the X-Chromosome, thus must be run from a spreadsheet that contains marker-mapped genotypic columns from the X Chromosome with samples row wise.
From an appropriate spreadsheet, select Quality Assurance >Genotype >X Heterozygosity Gender Inference. The output spreadsheet, X Heterozygosity contains six columns.
- The first column is integer-valued and contains the number of heterozygous columns found in the X-Chromosome in each sample. Females are expected to have many more heterozygous calls than males. This would be evident in a histogram of the column.
- The second column contains the Heterozygosity Rate for each sample, which is defined as the number of heterozygous calls divided by the number on non-missings in the X-Chromosome. Two distinct distributions are also visible in a histogram of this column.
- The third column contains the number of missing values for each sample.
- The fourth column contains the missing rate for each sample, which is defined by the number of missing values divided by the number of columns mapped to the X-Chromosome.
- The fifth column contains the predicted gender of each sample. Females are predicted as having a heterozygosity rate ¿ .1.
- The sixth column also contains predicted gender, but coded as a binary column with Females corresponding to 1.
Multidimensional Outlier Detection
A median centroid vector is calculated as [median(column1), median(column2), ... , median(columnN)] based on N columns (dimensions) specified by the user. A distance score is then calculated for each sample or row as follows:

where N = number of dimensions, or columns included in the calculation and mediann is the nth value of the median centroid vector. The outlier threshold is calculated as follows:

Where Q3 and IQR are the third quartile and inner quartile range of each column (1...N) and M is a user-specified multiplier.
To determine outliers in N dimensions, open a spreadsheet containing several integer- or real-valued columns. From the spreadsheet, select Quality Assurance >Multidimensional Outlier Detection. The default multiplier value is 1.5 but can be changed at the user’s specification. Click Add Columns to add integer- or real-valued columns to be included in the outlier calculation, then click OK to select the columns. Click OK to begin the calculations.
The spreadsheet output, Multidimensional Outlier Detection will contain two columns. The first column contains the distance score for each sample and the second column is a binary column, where a 1 indicates an outlier. The threshold is specified in the second column header as Outlier >= threshold e.g. Outlier >= 0.28.
A common use of this function is to calculate outliers in two dimensions, then filter a scatterplot of the two columns based on outlier status. For example, one could run Principal Component analysis, then plot the first two principal components against each other. Then determine outliers in two dimensions (the first two principal components). If you merged the resulting Multidimensional Outlier Detection spreadsheet with the principal components spreadsheet, you could filter on the binary Outlier column. Outliers would fall outside of an imaginary circle created by the median centroid and threshold values.
Column Statistics
Column statistics can be calculated on all real-, integer-valued and binary (optional) columns in a spreadsheet. Each of the following output statistics are optional: Minimum, Q1 (first quartile), Median, Mean, Q3 (third quartile), Maximum, Variance, Standard Deviation, Lower and Upper outlier thresholds defined by Q1 - x*IQR, Q3 - x*IQR, where x is a user defined multiplier and IQR or the Interquartile Range, defined by IQR = Q3 - Q1.
The resulting spreadsheet will have the original active columns as row labels and the selected summary statistics in columns. If a marker map was applied to the original spreadsheet’s columns it will be reapplied to the new spreadsheet’s rows.
To calculate the Column Statistics, open a spreadsheet containing several quantitative columns. From the spreadsheet, select Quality Assurance >Column Statistics. A dialog will appear. The dialog allows the user to specify which statistics to output. If the outlier thresholds are selected for output, a multiplier must be specified to be used in the formulas described above. Binary columns can also be included in the calculations, however not all of the statistics may be appropriate for binary columns.
The following summary statistics may be reported for every active integer-, real-valued or binary column in the spreadsheet:
- Lower Outlier Threshold: defined as Q1 - x*IQR, where x is the user-specified multiplier. This threshold can be used to identify outliers that fall below the threshold.
- Minimum: The minimum value found in the column. If the minimum value is less than the Lower Outlier Threshold, there are outliers present in the column.
- Q1: The first quartile is defined as the value below which 25% of the data fall. Equivalently, the first quartile could be thought of as the median of the first half of the data.
- Median: The median is defined as the value below which 50% of the data fall.
- Mean: The mean or mathematical average or the data. Comparing the mean and median values of the data can provide information about the skewness or normality of the data.
- Q3: The third quartile is defined as the value below which 75% of the data fall. Equivalently, the third quartile could be thought of as the median of the second half of the data.
- Maximum: The maximum value of the data. If the maximum value is more than the Upper Outlier Threshold, there are outliers present in the column.
- Upper Outlier Threshold: defined as Q3 + x*IQR, where x is the user-specified multiplier. This threshold can be used to identify outliers that fall above the threshold.
- IQR: The inner-quartile range is defined as the first quartile subtracted from the third quartile, Q3 - Q1. The IQR is used in the outlier threshold equations and is a measure of the variability in the data.
- Variance: The variance of the data values in the columns. Also the square of the standard deviation.
- Standard Deviation: The standard deviation of the data values in the column. Also the square root of the variance.
Compare Columns
Compares two columns and inactivates rows in which the data values in the two columns differ. The user also has the option to create a subset spreadsheet containing the rows with matching data values in the two columns and/or a subset spreadsheet containing the rows with differing data values in the two columns.
To compare two columns in a spreadsheet, choose Quality Assurance >Compare Columns from the desired spreadsheet. Add the two columns by clicking Add Columns and selecting the appropriate column headers.
- NOTE: Two columns must be selected with the column chooser. An error is thrown if less than or greater than two columns are chosen.
The subset spreadsheets can be created by checking the appropriate check boxes under Create subset spreadsheet(s) of:; Rows with matching data values and/or Rows with differing data values.
Row Average by Chromosome
Calculates the mean of the integer and real-valued columns for each row, creating a new spreadsheet with the respective row means. If the data is marker mapped, the row means are calculated by chromosome and overall.
To calculate the row averages by chromosome, open a spreadsheet that contains several integer- or real-valued columns. From the spreadsheet, select Quality Assurance >Row Average by Chromosome. The output spreadsheet, Row Averages, will contain a column with row averages for each chromosome in the marker map or if a marker map was not applied, one column with the overall row averages.