Achieving Genome-Wide Success Part III: Quality Assurance and Data Prep in SNP and CNV Studies

Presenter: Dr. Christophe Lambert, Golden Helix CEO and Co-Chair of the FDA MAQC CNV Team

Date: December 9th, 2009

Presentation

Well good morning, everyone.  I’m Christophe Lambert, president and CEO of Golden Helix.  I’d like to welcome all of you who’ve come to hear the webcast this morning.  It’s the third in a series called Achieving Genome-Wide Success.  The last one, which was very well attended, was all about experimental design.  And today’s presentation is all about once you’ve got your data, how to do quality assurance and prepare it for your SNP and copy number variation studies. 

Just a housekeeping detail – you cannot speak to me, but I can speak to you.  It’s just unwieldy.  We have, literally, over 100 people – sometimes 200 people – attending these webinars.  And so if you have any questions, please, by all means – throughout the course of the presentation, there’s a question and answers pane in the webinar view for the Go To Meeting webinar.  Type in your question.  If you have some sort of technical problem with the webinar, we have people standing by who can answer those questions. 

But the technical and scientific questions I’ll take towards the end of the presentation.  One common question that’s often asked is: “Can I get a copy of your slides?”  We will be recording the webinar and making it available on our website.  And I believe we also make available the slides as a PDF file that you can download.  So that’ll be one less question that we’ll, perhaps, have to answer today. 

So I’d like to go over the agenda.  We originally had planned to talk about genotype calling.  And as we were building the content for the presentation, there was just so much we had to cover, we realized that genotype and copy number calling algorithms warrant a webinar all their own, and we intend to do one in the new year. 

So today’s agenda is principles of quality assurance.  We’ve really stepped back and looked at our various quality control procedures and questioned our assumptions behind them, and really find that there’s a lot to think about in doing quality assurance properly, and a lot of assumption that may have held in the past, that might not hold in the future. 

So we’ll talk a bit about that and thread that theme throughout the presentation, as we talk about sample quality assurance, SNP and copy number based versions thereof, and then looking at individual SNP quality assurance methodologies – both the rationale behind them, as well as what they are.  And then spend some time, going into a little more depth than we have in the past, on the current approach we’ve been using for correcting batch effects in copy number variation studies.  And then we’ll have a summary and have plenty of time for questions and answers.  

So in thinking about quality assurance, I was looking up the definition of quality assurance and quality control, and a lot of it has to do with – you’re trying to bring a process into some bounds, with respect to the intended purpose.  And so in the context of science, our intended purpose that we have for generating a data set today – for instance, a genome-wide SNP study – may change over time.  Tomorrow, we might be doing a genome-wide copy number variation study on the same data. 

And one thing we’ve seen is quality assurance is often done in isolation from the intended purpose, i.e. a core lab will do a certain level of quality assurance, when the intended purpose is to do a genome-wide association study or copy number study.  It’s not necessarily their role to do it, and yet we’re finding that the downstream analysis is very much impacted, of course, by some of the decisions that are done upfront.  And we’ll talk a bit about that some more. 

So another principle we need to be aware of, besides what is the purpose of our experiment, and knowing that may change over time, is really trying to hypothesize the sources of departure from expectation, and verbalize.  We discourage the blind use of filters and so forth, without really looking at the questions – or questioning why we have differences from expectation in our data. 

Now one interesting shift that’s occurred in quality assurance thinking, perhaps, as we’ve gone to genome-wide studies with such massive data sets is: in the past, you would have something like an outlier in a data set, and you could look at each outlier individually, and try to track it back and see what caused it.  Was it real?  Or was it some sort of anomaly? 

And with so many data points, we’ve now turned to automated filters to do that, and I think we need to be careful about that.  So in looking at things like filters and so forth, in traditional quality control, you would not judge the validity of data by its unusualness, i.e. an outlier.  There could be a good reason why something has an extreme value, which may even be biologically relevant, and is not an experimental error of some sort. 



So a great example of that was, in fact, the discovery of common copy number variation in genome-wide SNP – was someone saying, “Why do I have –?”  I think Charles Lee and some others discovered there were low call rates for certain SNPs.  And they looked at the cluster plots and found more than three cluster plots.  And if they just discarded that data, without thinking what could be causing it, perhaps it would have been longer before we’d begun looking at copy number with genome-wide SNP rates. 

So again, another principle – and I’ve said it before – is beware blind application of rules of thumb.  And we’re gonna talk about some of the rules of thumb that we’re using.  But really, try to state and verbalize at least most of the assumptions that are behind various filtering and quality assurance procedures. 

And the key thing is: the assumptions that may have held on one data set may not hold in another.  And really, we’ve got to beware inertia, inertia meaning getting a habit, or unthinkingly using a procedure, without thinking about what its appropriate – assumptions under which it’s valid. 

Finally, we drop a whole lot of data in genome-wide studies, and I think what we’re gonna talk about – conditions under which that’s appropriate to reducing Type I and Type II error, as well as recognizing that dropping data’s not the only way to remove bias.  And if you have a chance to hear or watch a recording of my webinar on experimental design, a theme that’s gonna be threaded throughout this presentation is prevention versus treatment, that careful experimental design will remove most of the biases that the QA filters were designed to treat. 

So a lot of the filters we use in discarding SNPs and so forth were designed in the era before a lot of careful experimental design was done.  And as we’re seeing more experiments coming out these days that very carefully randomize the phenotypes and the plates, randomize with respect to phenotypes and gender and so forth, the number of actual spurious associations due to technical problems goes down dramatically. 

And we can actually start moving back to not having to automatically make decisions, but really engage our thought processes on all of the statistically significant findings, and be able to question their validity through careful inspection, as opposed to having to throw away, say, 200,000 SNPs, because there are so many spurious associations. 

So as we look first then at sample quality assurance, let’s try to look at what these assumptions are, under which we’ve been discarding certain samples and so forth, and when they may be valid, and when they may not be valid.  So, of course, sample quality assurance involves – of course, you can have experimental problems where the experiment just doesn’t measure the biological phenomena of interest.  And that, of course, can be a continuum. 

And so what’s done, often, is drop samples and/or rerun them, if possible.  We want to identify experimental errors and biases, because those inevitably lead to problems with spurious associations downstream, if we’re not cognizant of them and don’t correct for them.  Phenotypic errors are another thing.  We find, in almost every study, a gender misclassification for several samples, for instance. 

And there are other errors we’ll talk about.  Population structure’s a big one as well.  We’re very concerned, again, about the biases introduced by mixtures of different populations.  And when we have statistical tests that assume a certain homogeneity of population structure, we can have associations that are spurious, just representing differences between populations, rather than an underlying biological phenomenon.  

So when we look at sample quality, often we assume, in a lot of the methods we’ll be talking about today – and that we still use and are very good, are appropriate when the quality problems – for instance, problems with low call rates, wave effects, you name it – are correlated with the phenotypes introducing biases downstream.  We often assume study outcomes might be biased by population heterogeneity. 

And often, when we do our data quality assurance, we assume that the data set cleaned for one purpose can also be analyzed for other purposes.  And we’ll see for that one, on the right column here – an immediate counter example is SNP quality assurance filters do not find all the copy number variation problems.  And so we’ll talk about that some more.  With population heterogeneity, it’s not necessarily the best solution to throw away samples, but rather to do a statistical test that can account for population heterogeneity.

For instance, the Eigenstrat procedure.  There are also reasons why you might want to do both.  And this theme I’ll repeat several times in the presentation: we see that with careful experimental design, the noise is expected to be random, rather than having correlation between the errors and the phenotypes.  If we’ve designed our experiment well, that is reduced dramatically, and so problems with bad data become much mitigated. 

And so what we’ve found, as we’ve done our genome-wide studies of SNP and copy number data, is that generally samples with a poor SNP quality criteria give poor copy number results.  So the filters that are being used today for SNP based QA – generally, if you’ve got a sample that you would throw away for a SNP study, generally it’s always gonna perform poorly for a copy number based study. 

But the converse is not necessarily true.  There are samples that can pass all your SNP criteria, and still have poor copy number properties.  And often, again, you can’t know until the very end of the final downstream analysis that you’ve captured all of the problems with the biases in your data, and they can only be seen in the context of association.  So you see a certain conundrum in the division of labor between the running of the experiments and the core labs and the analysis that’s done afterwards, if there’s an opportunity to rerun samples. 



And ideally, you want to run them as soon as possible, once you’ve found a problem, so the experimental conditions are kept as constant as possible.  It would be best if we could somehow do the full downstream analysis, or at least a first-pass version.  We find we can actually do analyses almost quick enough to be able to turn around answers, to get those types of questions answered. 

But traditionally, analyses take months.  And the opportunity to correct things – that window is often lost.  So the QA procedures we’ll be talking about for finding that a sample’s got a problem is – on the SNP side, we’ll look at the call rate, the heterozygosity, and all these various things, as well as SNP side.  And so we’ll go, in turn, and look at each of these methods, talk about them, and talk about how to then combine them into some sort of a unified quality assurance procedure. 

So for SNP based QA, often we are looking to optimize our call rates.  We want to get the highest possible yield out of our SNP studies.  And you can see this is a histogram of a couple of different calling methods – CRLMM and Birdseed.  We’ve done a consensus calling approach of the two methods, which actually gave an improved call rate. 

So you see, picking an arbitrary cutoff would depend a lot on the calling methodology.  So we’re gonna talk a little later about: how do you pick an appropriate cutoff, that’s not just an arbitrary eyeballing of the data.  X heterozygosity – the basic concept, of course, is males have one copy of X.  Females have two.  And so the male should be homozygous. 

It’s very easy to calculate the heterozygosity rate.  MARS Software has a script to do that.  I’m sure it’s fairly straightforward.  And you’ll see two distributions – on the left the males, and on the right, the females.  But then there’s this spread of the tail, where there’s a lower heterozygosity rate for certain females.  And generally, this happens when there’s some sort of a genotype calling error, where systematically, heterozygotes are miscalled as homozygotes, and so you get this lower heterozygosity rate. 

Now there’s a new method that we’ve developed.  It’s not rocket science, but it’s surfaced some problems that we’ve never seen anybody talk about.  But in a number of studies we’ve looked at, where there’s a combination of arrays – and in this study, it was a 500K array from Affymetrics.  But you could also envision – often people are combining arrays from even the more modern experiments, when a new version of an array comes out. 

And often – thankfully not too often – a mix up can happen in the samples, where two arrays are supposed to be the same person – one from NSP and one from STY – and actually, when you look at the data, how do you tell that they actually are the same person, versus a mismatch?  We’ve developed an approach in which you take the nearest neighbor of a SNP in the other array, subject to the constraint that it’s no more than 2,500 base pairs away – so you take only pairs that have a neighbor that’s 2,500 base pairs away, and you do the correlation of the minor allele frequency between these pairs and between the NSP and STY, and you get a correlation coefficient. 

And because alleles are inherited as part of a haplotype block, you’d expect a high correlation – not terribly high, but at least a non-zero correlation between nearby markers.  And if they’re from different individuals, that correlation is much lower.  So this was a study that 1,879 samples.  And 20 of them were actually mismatched for their NSP-STY arrays. 

And so putting that into a study would obviously lead to problems with interleaving an NSP and STY from a person who has, perhaps, the correct phenotype, and a person who was accidently pulled from the wrong well – the person’s DNA, rather.  And we validated this method on gender-mismatched NSP-STY pairs, where you can see there’s a male/female mismatch, as measured by heterozygosity or by X – the mean of the log ratios of the X chromosome. 

So another sample quality procedure that’s useful is to look at cryptic relatedness, as well as sample contamination.  Both of these can be elucidated through looking at estimates of IBD relatedness.  I’ve used here the PLINK package to do an all-pairs comparison of every sample in the study.  And in this study, what we did here is we plotted, in fact, the NSP versus STY, so you actually see there are big mismatches off diagonally here, again, from mismatched samples. 

And when you see a pi-hat close to 1, basically the SNPs are nearly identical between the two samples.  And that corresponds either to monozygotic twins.  Or in the case of this study we were looking at, these individuals – there weren’t supposed to be any twins.  But accidentally, a number of samples had been rerun, or pulled from the wrong well.  And so you’d see either twins or the same individual run at 1.  A half is first-degree relatives like siblings, parent/child relationship, and so on down. 

Also, what you can see – occasionally, there’ll be a particular individual that’s highly correlated with a lot of other individuals, with something like a .06/.07 pi-hat.  And that’s an indication of sample contamination.  Now when you’re looking at relatedness in the context of a family study, it’s, again, a different environment.  There, what you’ll want to do is, of course, verify that the pedigree information lines up with the relatedness as measured. 

And so this is an all-pairs heat map of the identity by descent.  And along the diagonal, these are just the samples, and we’ve ordered the samples in these blocks here according to the families that were recorded within the pedigree data.  And this particular individual here was actually found to be – was supposed to be within this family of 1, 2, 3, 4, 5 individuals – actually is related to this family, and most likely was actually a sample mix up or, perhaps, a pedigree error.  But that has to be looked at, and those samples either dropped or rerun, to correct the problem. 



Now population structure is something that – there’s a whole lot to it.  But in terms of kind of a fast and ready quality procedure, what we like to do is – you can download – if you’re doing an Affy study, you can download the Affymetrix 6.0 Hapmap3 data from the – Hapmap3 has pointers to an appropriate FTP site, and you can merge it with your study. 

And even if you have like an Affy 500 case study, there’s enough SNPs in common, that you can just join up the matching SNPs and do the Eigenstrat or principal components decomposition.  And this is a plot of our study in blue – of interest – versus the different populations in the Hapmap3, as opposed to the Hapmap2, that just had the Chinese, Japanese, Yoruban, and Ceph populations.  There are various other populations, including Mexican and some indigenous cultures from various places – various African populations to the lower left here. 

So you see this blue point from one of our studies actually is way over here towards the African population.  And so a sample like that – you could decide to exclude it from your analysis, or to include it.  But to either adjust for the principal components in a regression-based model or use the Eigenstrat approach to do a correction for multiple principal components. 

So again, the outliers can cause a bias in association.  But deciding to remove them is something that you may want to do downstream.  Flag them in your initial QA, but then make a decision to, perhaps, run analysis two different ways, to see to what degree the population structure does bias your findings.  So now we’ll swing over to copy number variation based quality assurance. 

And we’ve done a lot of copy number studies, both Affymetrix and Illumina, and we’ve found these metrics work broadly in both of those domains, as well as in – we haven’t done as much with, say, Agilent or NimbleGen.  But these same metrics – in fact, this first one was one we kind of borrowed from Agilent’s quality assurance pipeline, and found it’s very useful – is the Derivative Log Ratio Spread. 

Derivative Log Ratio Spread is better than just doing a standard deviation of the log ratios.  Because the point-to- point variability, the standard deviation between adjacent points will not be skewed, if you do have large cytogenetic changes.  There’ll only be a delta between the first point of the previous segment and the new segment.  And you won’t be calculating a variance versus a copy neutral mean.  You’ll just be calculating a variance versus the point-wise, pair-wise differences. 

Anyhow, if you look at a histogram of a study, you see high spreads.  Basically, it’s noisy data.  And we find when you have very noisy data, two things can happen.  Either you can have problems detecting copy number variance, and you’ll find very few segments, or, depending on the structure of that noise, you can have a bunch of spurious findings as well.  So it can either go towards a Type I error or Type II error. 

And we find, interestingly, that either the excessive number of segments – the highest Derivative Log Ratio Spread samples often do have an excessive number of segments.  But sometimes, like here, you find very few segments as well.  So it goes both ways.  And so, generally, what we want to do is be cutting off the high end of the tail.  The low end is always good for Derivative Log Ratio Spread.  The lower the noise, the better. 

Wave effects are a problem that we’ve run into.  What is a wave effect?  There’s a nice paper by Diskin et al that describes a PennCNV tool called Genomic Wave, which can calculate this wave effect.  But what it is are the intensities, and even the log ratios.  Despite normalization, there’s this kind of a long-range wave that seems to be correlated not just with the GC content of the probes themselves, but the GC content of the neighborhood around the probes. 

So if you take like a mega-base window of GC content – and that’s what the PennCNV approach is – and you do a regression.  You can fit this wave and try to knock it down.  We’ve worked with their methodology and done segmentation after correcting fro the waves, and we see some distortions in the final results.  For instance, here on the right is not using wave effect correction. 

And you’ll see a nice signal separation between neutral and the central peak of this histogram of log ratio segments, a gain of extra copy – Copy 3, Copy 1, and the hump here – Copy 0.  And that all seems to go away for us.  So the technology – maybe we’re misusing it.  But at least in this particular study, we didn’t get great results.  So what we’ve been finding is the best approach is to actually remove samples that have large effects. 

And so the tool, Genomic Wave, has a metric, and you can find samples that have particularly bad problems with these wave effects.  And the problem is that a wave can be, of course, caused – a problem with the segmentation or a hidden Markov model approach for finding copy number regions. 

So we found, interestingly, when we were looking at the number of metrics, we were really trying to figure out: how can we filter and find problems with as many of the samples as possible that are gonna have an excess number of copy number calls, without actually making copy number calls?  In other words, is there an approach that a core lab could measure as many things as possible and find problems, without actually doing the downstream analysis of making copy number calls? 

And we found that the extreme value distribution captured a lot of problematic samples that were not flagged by other methods.  And so when we looked at the lower 1 percent and upper 1 percent, which is on the X-axis here, versus the segment count – finding an excess number of copy number regions – a large majority of those samples were flagged by there being a lot of extreme values in the lower or upper 1 percent. 



And so you can have kind of a symmetric, noisier sample, and this would probably be captured by a Derivative Log Ratio Spread, where if you look – these are the same axis – the log ratios go from -2 to +2 in this, whereas they’re in a much narrower band for this cleaner sample.  But we also find the skewedness indicates quality problems. 

So if you take the ratio of the top and bottom 1 percent cutoff, that’s a useful metric.  And so you see skews going upward and skews going downward with these particularly poor samples, and sometimes they can pass other quality metrics.  So you may not have a really high noise, but the skewedness of the noise indicates that there’s a problem there. 

Now this quality assurance procedure is complimentary, and often gives very similar results to looking at the X heterozygosity of the SNPs.  And the Y – as well as looking, I suppose, at the Y call read of the SNPs, is to just take the average intensity across the X and Y chromosome, which we have in the lower right plot here, and use surface problems where we’ve color coded individuals who’s phenotype said that they were female, in green, and male, in blue.

And you see there’s a blue dot here, where this person was most likely classified as a male, because their genetic data is totally consistent with them being female.  But you also see some samples where there appear to be XXX and XXY, up here.  And those individuals – not to say you necessarily need to discard them.  It may well be that XXX is correlated with a particular condition of interest.  But it’s just very useful to know that and make an informed decision with respect to the purposes of your study: do you wish to discard such samples? 

Also, you can look, in a study like an Affymetrix 500K, where there’s a separate array for the NSP and STY probes, and see if there’s a mismatch.  And if you see a large mismatch, it’s potentially indicative of either a problem with the quality of one or the other, or, in the case of very different intensities with respect to the X chromosome, that you’ll have an NSP from the males and an STY from the females, or vice versa, as you see in a point like out here. 

Here is actually a copy number that’s consistently very low for a female sample.  But it’s the same for NSP and STY, so it’s probably a good experiment.  And what it’s measuring is X mosaicism, where the female has a certain fraction of her cells that have one copy of X and a certain fraction that have two copies, and so you see an intensity that’s somewhere between the male and the female average distributions. 

Another quality assurance criteria that you want to look for is outliers, in terms of the average intensity of all the other chromosomes.  So this is a histogram of all the autosomes, and you’ll see there are a number of samples that have very high means for certain chromosomes.  And, in fact, this Hapmap sample NA 18540 has an extra copy of five different chromosomes.  And it’s clearly a cell line artifact, because no one would be alive having five copies – an extra copy of five different chromosomes. 

And we see this from time to time in cell line data.  Not always.  But it’s something to be aware of, particularly if you’re using a cell line as some sort of a control.  Run an array first to verify you don’t have this problem, before using it as some sort of a control.  Again, when you get, finally, to the end game of your copy number analysis, and you do your segmentation, and you find copy number variance, and do a histogram of how many segments are found, you can sometimes see very low and very high under and overabundance of segments found.

Now the general midpoint of this distribution’s rather high, because we used a very lax threshold.  Generally in our studies, we’re usually around 400 or so, including both neutrals and gains and losses.  But these tails are what we want to flag as samples that are problematic.  Fortunately, a lot of the metrics have already showed we’ll capture samples that are likely to have these problems.  But still, occasionally some of them do get through all the other filters. 

Often a particular individual sample – when you look at the histogram of the log ratios, if it’s a good sample, it should have nice multimodal distribution, versus not a very good spread.  And that’s consistent with, again, a problem with the sample – either the DNA or the experiment.  And it’s hard to know where the error occurred.  We’ve seen some studies where the rerun of the sample was just as bad as the original.  We’ve seen other samples where the rerun gets beautiful data.  So it can be both causes. 

So given all these metrics, how do you make informed decisions about which samples to keep and which samples to discard, or at least flag as problematic?  So we’re gonna talk about picking thresholds for each QA metric.  Once you’ve picked thresholds, ideally your best samples – they should pass all of the quality assurance thresholds.  Now depending on the particular experiment you’re running, which metrics you choose could be different. 

For instance, if you have very high call rates and no particular problem with heterozygosity or any of the SNP metrics, but you see large wave effects in the copy number data, if you’re doing a SNP association study first, perhaps you’d leave that sample in the SNP association study.  But then when you move to the copy number association, you might remove it. 

Now when you’ve got a lot of metrics – and we’ve talked about, perhaps, nine or ten of them.  You might find a sample passes all metrics except one.  And you’ll have a look at it and say, “I’d like to keep it.”  It may be a judgment call in some cases.  Again, rationalize that when you’re writing up your study.  We have found that following a rather Draconian policy of – if it fails any test in the SNP or CNV test, drop it – has given us the cleanest final data, obviously, but potentially at the expense of lower powers. 

So you’ll have to make those deliberations yourself.  In some studies we’ve found that have very good SNP quality criteria – we’ve found as much as 25 percent of the data might end up having to be dropped, based on CNV thresholds.  Also, we have to recognize the thresholds we’re using for dropping are within the limitations of the downstream analysis approaches that we use today.  If we can find downstream analysis approaches that totally overcome wave effects, or can somehow boost the signal, despite it being very noisy, we might change our approach. 



So the recommendations for today are conditioned, of course, on what we’re gonna do downstream.  So I’d like to credit Bryce Christensen, who works with us.  He’s the statistical geneticist on this work.  How do you come up with a defensible, non-arbitrary threshold for various quality assurance procedures? 

The interquartile range, or IQR, is a very useful one.  It’s a very standard methodology in statistics.  It’s depicted visually here.  You sort your data from lowest to highest.  And the middle half of the distribution, you look at the distance between the Quartile 1 down and the Quartile 3 down, and that distance is your interquartile range.  And then the cutoff that you use for calling something an outlier is one and a half the length of this interquartile range in red here, to the left of Q1. 

So that’s here.  And one and a half to the right of Q3.  And it actually corresponds to about two-and-a-half sigma, similar to a 99 percent confidence interval.  So if you look at that – what it does to a uniform distribution, it actually would not drop any points.  Because the interquartile range – one and a half of it is actually to the left here of any of the data, and to the right of any of the data at the upper end. 

Here’s the chi-squared random data set, and you’ll notice it cuts off some points at the right distribution.  Here’s some real data – the segment count.  We found a cutoff here – a lower bound, samples that just couldn’t – were so noisy, we found very few segments, and samples that had other problems that resulted in finding an excess of segments. 

And these are reasonably defensive bounds for this particular data set.  So we can do this also with the Derivative Log Ratio Spread here, and we can do it with the autosomal SNP call rate.  In this case, the lower bound of 95 percent, or roughly thereabout, was what was found empirically, using this interquartile range approach. 

Now when we looked at all of the different methods for removing samples, whether it was waviness factor, segment count, or what have you, we did a correlation of the failures, in terms of which ones failed for a given reason and how they were correlated.  And we see waviness was actually the least correlated with some of the other metrics. 

A number of the metrics are correlated.  But they do tend to be certain metrics that uniquely find individuals.  So 53 individuals in this study of 1,550 or so samples only failed due to the waviness factor.  And only 1 failed on its own, just to Derivative Log Ratio Spread.  So there’s obviously a correlation between these various metrics.  And in this case, we actually threw away about 28 percent of the samples in this study.  Some of the samples were known ones that were rerun anyhow. 

So, in the end, it was probably more like 25 percent of the things.  So what’s the impact of doing this QC filtration?  So if we do the segmentation on all the samples without dropping any of them, this is a histogram of the segment means – neutral in the middle.  It’s cut off at the top.  This is copy number 1, copy number 0 somehow in this distribution, copy number 3, and maybe a 4 in there.  It’s much more well defined and clean. 

You can pick much nicer cutoffs, of where you’d say: below, say, - 0.2 is a loss, and above + 0.2 or + 0.15 is a gain.  Whereas it’s much harder to make that kind of delineation.  And so you see by cleaning up these samples, we can have much more certainty in the copy number calls that we’re ultimately making. 

So now let’s talk about SNP quality assurance and, again, the concepts and assumptions behind it.  Of course, we’re gonna drop SNPs, because there are biases in the SNPs.  We want to reduce Type I error and/or reduce Type II error.  The general assumption that characterizes all the methods for SNP filtration are based on the idea that there are large biases in genotype calls correlated with a phenotype, and that’s gonna lead to many false positives.  

And further, because there are gonna be so many of them, it’s gonna be inefficient to manually examine all Type I errors that arise from these biases.  So what we choose to do is have less Type I error, because it’s embarrassing to publish a false positive.  But at the expense, perhaps, that we might miss some potential findings, if we had taken the time to look at every single association, the majority of which were spurious. 

So we also might make an assumption that the filters we’re gonna use for the analysis are adequate for additional analyses we don’t presently envision.  And I talked a bit about that already.  So these assumptions though – as I showed in my previous webinar – can dramatically not hold. 

Here’s a Q-Q plot of a GWAS we did on a 500-case study.  It was carefully randomized.  There were two real signals.  This Q-Q plot did not drop a single SNP from the GWAS.  We had done sample QC, including removing mismatched NSP-STYs.  But there were no spurious associations whatsoever, despite not removing any SNPs for departures from Hardy-Weinberg, for low call rate, etcetera. 

So it turns out that with careful experimental design, we can really question that we need to be doing these filters.  Or if we’re gonna do these filters, I think we ought to both run a GWAS with the filters off and with the filters on, and then use the traditional principals of statistical quality assurance of: let’s then carefully look at every association, because there are very few of them, spurious or real, and we can track down what, exactly, the cause is of the association. 

Is it an experimental problem, that somehow there’s a bias?  Or not?  So when you look at the additional filters, like filtering SNPs for low minor allele frequency, there are those same assumptions I talked about, as well as – we assume rare alleles are likely to be genotype errors.  One assumption that the Chi-square test will be used for association.



If you’re aware of this Chi-square test – it’s like a 2 X 2 table.  If you have very small counts in a table, consistent with what can happen with very low minor allele frequencies, you can get overestimates in the significance of the test.  Another reason why people might decide to re-filter SNPs for low minor allele frequency is there’s a low power to detect association. 

And so why not just remove that SNP to reduce multiple testing?  If there are only five individuals who even have that rare minor allele, we’re not gonna have power to find an association.  These are reasonable rationales under these conditions.  And, actually, these assumptions really were true in these first GWASs.  Particularly, you look at the Welcome Trust study, which had seven common diseases with a common set of controls. 

Genotype errors were highly correlated with phenotypes, because the case and controls were done differently.  The rare alleles tended to be genotype errors and so forth.  However, as we’ve found with good experimental design, these errors don’t have to be correlated.  Precise assays can have little or not genotype errors.  You don’t have to do a Chi-square test. 

The Fisher’s exact test does not have a problem with overestimating significance.  And here’s a key thing: low minor allele frequencies can be significant, if you have a very large sample size.  Just a one percent minor allele, and you have 10,000 individuals – if it’s very penetrant, it could be highly significant.  And here’s another one: there are statistical tests that have been published recently, based on testing for rare variance, that aggregate the rare genotype. 

So while you may not have power for one genotype, if you take a neighborhood of genotypes, say in a gene, and test the presence of a rare variant in any of several loci, you can have quite large power to detect associations for rare variance.  And so indiscriminately throwing away low minor allele SNPs in that regime would not be appropriate. 

So similarly, we could go through the same story for filtering SNPs for low call rate.  The assumption that genotypes are three state, of course, is made there.  And that, of course, may not be true when you have copy variable regions.  So we do not advocate actually filtering copy probes based on low call rates, because it could well be that there are just multiple clusters in the genotype algorithm which assumed a three state model. 

Eventually, we’ll get calling algorithms that don’t assume a three state model, and there’ll be no reason to – we’ll probably retrieve those genotype calls.  Nevertheless, I should state about filtering SNPs for low call rate – this is probably the best filter, if you do have problems with batch effects, and to remove Type I error.  And so we’re not saying it’s a bad filter; it’s a very good one – but when these assumptions hold. 

So departures from Hardy-Weinberg equilibrium – again, we’ve done studies where, yes, there can be SNPs that have large errors due to problems with genotypes.  But if you randomly design your experiment appropriately, those errors will not be correlated with the phenotypes, and they don’t cause a problem. 

Also, we might assume Hardy-Weinberg equilibrium is just due to genotype calling errors.  Sometimes, departures are due to population structure.  And there are, again, ways to test association without dropping the samples, where you can correct for the population structure, such as Eigenstrat.  So with all the caveats and so forth, when we use C-Realm or, perhaps, Bird Seed, which are approaches we’ve been using – and there are some novel and interesting ones coming up, like BEAGLECALL from Brian Browning, that we’d like to perhaps investigate some more and talk about in our next webinar. 

But when we use the various standard genotype calling approaches, and we underscore we have badly batched studies, the standard thresholds that people are using are pretty good.  And again, the call rates are the most important leverage point.  We find if you have a very stringent call rate per SNP threshold – set it up at around 99 percent – almost all the other filters don’t really matter, and you’ve removed almost all the spurious Type I error. 

So family studies – there’s the additional thing of, of course, Mendelian errors.  We do have an approach in our software for measuring that now.  We’ve recently added it.  And so do these filtration procedures.  When there are Mendelian errors, we typically drop for the family.  If there’s an error in a single family, we’ll just drop that SNP for that family. 

Whereas if we see the SNP having Mendelian errors over and over in multiple families, we might drop the SNP entirely.  Verify the Q-Q plots are well behaved.  And then when you get to your association results, examine the cluster plots for significant findings.  One thing I like to do is – I don’t filter anything, but then I create a plot where I’ll plot the GWAS association. 

And then I’ll add an extra track with the Hardy-Weinberg departure, P values, an extra plot track with the call rates and so forth, and then drill down into the findings, and then look to see if they had a quality problem.  We should also be aware that the filters can – when we’re dropping things because they have low call rate, loss of data, traditionally in statistics – if you could somehow correct the data point, that’s a good alternative. 

And so imputation is one way to do that.  When you’re looking at Q-Q plots – here’s, I think, a Welcome Trust study of Type I diabetes before any QC.  So there’s a huge inflation of Type I error.  We did QC.  There were still a lot of associations.  Turned out it was the HLA region.  So we wanted to verify that removing that cleaned up the Q-Q plot.  And so when we did, we had a very good Q-Q plot. 

As I mentioned before, if you do good experimental design, before dropping anything, you’ll see a beautiful Q-Q plot, in all cases that we’ve seen.  So correcting for batch effects – the real painful thing about batch effects in copy number studies – they’re a big problem in SNP studies, but they’re even worse in copy number studies.  When you do a principal component decomposition of the log ratios in copy studies, you see a shift between cases in controls often. 

Even big subpopulations within case and controls have large shifts.  Whereas the SNP studies may not show much of a difference.  So whereas Eigenstrat’s showing there’s no deviation in population structure, the Eigenstrat or PCA approach we’re using on the log ratios is showing that there are large departures, not due to population structure, but due to batch effect differences. 



So we’ve modeled our approach of PCA off of Eigenstrat.  I’ve talked about this many times in previous webinars, where the gist of it is: when you have correlated shifts across populations and across thousands of alleles, that’s gonna show up in the first few principal components of a covariance matrix calculated off of the data.  And so that’s true in allele frequencies. 

It’s also true in batch effects, where you have large batches, perhaps a plate of 96 samples that all have systematic shifts in various directions.  So we can factor out those systematic shifts,  hopefully, leaving the small effects associated with disease, and potentially also correct for CNV population stratification. 

So the process we use is to perform PC analysis, excluding a top few components.  We’ve got to choose the right number of components through scree plots and Q-Q plots.  And I’m gonna talk about that more next.  But we have to be aware of the assumptions of the method, and I’ll talk about that some more. 

So the approach that we now recommend – and again, never park your brains when you’re using somebody’s method, but this has worked pretty well for us – what we do is we run association for a case control study on the phenotype, correcting for 0 component, 1 component, 2 components, all the way up to 30, 40, or so components. 

And here’s a plot of what the Q-Q plot looks like with no correction.  This is on the Welcome Trust Type I diabetes study.  And at 31 components, the least significant 90 percent of the data actually fits the X versus Y line quite well.  And so what we did here was – we actually have an automated procedure that does this, and we ran 1 through 60 components. 

And you see right here, around at 31, was when the line of the expected versus actual of the less significant 90 percent of the data approached very close to 1.  And then you can actually overcorrect, where you can suppress the signal, as well as the noise.  And in some of the previous work we’ve done, I think we’ve done some overcorrection.  In some old, old webinars of ours, we corrected the Welcome Trust data with 100 components. 

So it looks like 31 with our 1,500 controls from the National Blood Service in 2,000 cases are appropriate to get good Type I error suppression.  And here’s before PCA correction on the Welcome Trust Type I diabetes.  You see maybe a signal here, but everything else is just ridiculously significant and dwarfs all the signal.  Whereas we’ve cleaned up the signal, and you see a few – there’s probably plenty of other technical artifacts in here. 

I’m not saying all these are true signals, by any means.  But we’ve substantially got Type I error under control.  We can now look at these associations and see if they make any sense.  A lot of associations, by the way, are T-cell artifacts that, again, are different, are experimental biases that come in.  So some benefits also of correcting for batch effects.  When you look at the log ratios – they clean up tremendously.  The signal and the noise improve. 

But there are some real limitations, again, as with every method.  And what are our assumptions?  We assume modestly large sample size.  This doesn’t work if you’ve got 20 samples.  We like to have hundreds or thousands.  We assume the first principal components contain the undesirable differences.  That might not be true for large pedigrees.  Certainly not true for the sex chromosomes.  So we have a way to exclude the sex chromosomes when we calculate our components. 

There are still some problems.  Small batch effects can remain uncorrected.  And unfortunately, because the copy variable regions represent a smaller proportion of the variability, they're often the ones that can have problems with not being corrected as much as the neutral regions.  So there are approaches we’ve been doing to try to improve those things – correcting after centering my mark or sample means, calculating components on the subset of markers, and then using them for the rest.  

Well, I won’t go into them, for want of time, but probably the biggest take home message, again, is my last webinar, which is Better Design of Experiments – is really the panacea for batch effects.  We’ve spent so much time fighting with batch effects, folks.  And even with these methods, there are just limitations to them that make us scream from the rooftops, “Please do better design of experiments.”  Because it makes life so much easier.   

We found another quality thing to beware of is outliers in CNV detection.  We’ve developed in our – and we’ll talk, probably in a future webinar, about this method – but we’ve developed a segmentation algorithm that’s not overly skewed – particularly in Illumina data, where homozygous deletions come in at like – 6 units on the log ratio scale. 

And what we’ve found – we’ve investigated that Winsorizing is a very useful approach that will probably improve most methods for copy number calling, where you just do a first pass of knocking the data down.  Anything above, say, the top 0.001 percent – knock it down to the threshold of 0.001, and similarly at the bottom end of the scale.  Median smoothing is not a solution.  It helps in visually looking at the data, but it induces order that messes up calling algorithms. 

By the way, just as, again, another sort of confirmation of the message: just because something’s unusual doesn’t mean it’s not real.  We looked at those single market outliers that we see in these Illumina studies and Affymetrix studies, and found that when we filtered all of the next-generation sequencing deletions found in the Bentley study, that the single market SNPs that fell in those deletions confirmed by next-generation sequencing had, overwhelmingly, a very large negative mean. 

Although there were some single marker things that fell in the deletions that had high means as well.  So it’s hard to believe a single marker thing.  But many of the single marker, large negative, large positive values are, actually, a biological phenomena, and not just noise. 

Another take home message of this slide is – here’s a region found in Affy and Illumina platforms as having a large deletion at Chromosome 15, and it was not found with next-generation sequencing.  So I think there’s plenty of room for use for the genome-wide arrays for copy number variation finding in the years to come, as the next-gen sequencing has its limitations as well. 

So I’d like to summarize and then open up for questions.  But I guess we’re scientists.  Sound scientific principles and scientific methods should be used in our QA procedures.  Don’t blindly use methods without verifying their assumptions hold in your environment.  And, hopefully, we’ve highlighted some of the assumptions and shown where sometimes they hold, and sometimes they don’t hold. 

And as the design of experiments is getting better, a lot of the assumptions on which the QA methods were founded actually no longer hold.  And so we have to adjust accordingly.  Expect the intended purpose of your experience to change over time.  Never throw away your raw data.  And SNP QA methods are insufficient to surface CNV quality problems. 

We’ve found we’ve had to flag 20, 30 percent of some studies as having problematic copy number variation data, after it had passed through the usual rigorous quality criteria at very reputable core labs.  However, it is a fairly safe thing to say that SNP QA’s a good starting point for CNV QA, as well as – a CNV quality problem doesn’t mean you have to discard it for a SNP study, obviously. 



And again, careful experimental design changes the rules of the game.  And please see the previous webinar, where we spoke about this at length.  So just like to say if you’d like to learn more about how to do this yourself, we have a free trial of our software.  We have a lot of tutorials that go through these various methods that I’ve talked about.  Not every method I’ve talked about has a tutorial, but we’re working on them. 

And so if you’ve got a demanding need for one of them, please bug us about it, and we can prioritize our tutorial writing accordingly.  We’ve got some interesting data sets, archived webcasts – I think there are over 20 of them now – and links to a lot of published articles, primarily by our customers, who are finding all sorts of things using our software and using their expertise as well. 

Documentation.  A lot of the methods that I’ve talked about today are implemented in add-on scripts, and we have a scripts page, where you can download methods to basically plug into the software and add to them.  And even if you never use our software, and you’re a statistical genetics expert and want to know how we implemented a method, you could download the Python script and look at it.

And by all means, feel free to use it and make improvements to it and send it back to us.  And we can provide online training, if you’re trying to learn more about the methods.  A couple of years ago, we opened our doors to doing analytic engagements with our customers.  And we’ve found most of them have been very productive. 

We do now have a small amount of analytic bandwidth available, where we can do studies of all sorts with you.  How these engagements work – it’s very much a collaborative effort, where we work back and forth, typically meeting on a weekly basis during the most intense period of analysis. 

And we find we’re not a replacement for analysts.  We learn from you; you learn from us.  And so it can be a very educational and productive engagement for both sides, really.  We’re gonna send out an email with an application.  Just if you’re interested in working with us – no one’s making any commitments at this point.  But we’d like to send out an application.  And I’ll personally talk with each one of you who are interested in working with us. 

And we do everything from study design, plating strategies, genotype calling, CNV calling, QA steps, as well as the final stuff of the genome-wide SNP and CNV association.  And we’ve also done some interesting work on predictive modeling for diagnostics. 

So I’d like to really acknowledge a lot of the people that we’ve worked with, without whom we’d not really be able to have all of these learnings that we’ve communicated today.  I won’t read them all out loud here.  But we’ve collaborated with a number of these people on their studies, and are really grateful for the scientific exchange that’s occurred.  And we hope we can do more of it with you and with them.  

So I’d finally like to open up for the questions you have.  I’ve gone rather long here, but I’m willing to stay as long as you’re willing to stay.  And we will record this, as well as the question and answer period.  And so I’d like to also, while you’re typing in questions, make a plug for – Christophe Lange is going to be doing a webcast, we anticipate probably in January of next year, on a new paper of his – a useful quality control procedure he’s developed, both in the context of family-based studies as well as population-based studies, where you can look at a sample at a time and detect problems with departures from Hardy-Weinberg equilibrium. 

And we hope to put on, as part four of this series, a detailed discussion of genotype and CNV calling, covering methods that are state of the art, as well as some of the traditional ones – compare them and talk about really how to use them, and how we’ve used them, and some of the pitfalls and learnings that we’ve had.  

So with that, I’d like to again thank you all for your attention and interest in continuing to come to these webinars.  We hope they’re informative.  We try to bring something new every time.  And we love your feedback.  And I’d now like to open the floor for questions.  If you have to go, we understand.  But we will continue to answer the questions as long as there’s interest.  So thank you so much. 



“Do you have an option for relatedness in Helix – in our software right now?”  Actually, we do not.  I have been using PLINK myself.  There is an export procedure, so after you’ve done various QC work, you can easily export a PLINK format and run PLINK, if that’s a tool you’d like to use.  And we’d like to have that eventually.  But there have just been other more pressing things that we’ve been building so far.  But good point. 

So here’s a question: “Why do contaminated samples have excess relatedness to many samples?  What if a sample was contaminated by only one DNA sample at a different degree – like 1:1, 10:1, 20:1, or 100:1?”  You know I guess I don’t believe that it has to be – I think that’s a good point.  I don’t think you have to have excess relatedness to many samples. 

And so what you would then see is some sort of relatedness between two samples that would be – but then you’d look at the data and try to figure out what went wrong.  You’d probably see some other quality problems popping up.  So that’s a good point.  In this particular study that I was showing the data for, we did have that phenomena, and so I’m not exactly sure why this particular sample showed up as being correlated in some way with everybody else.  I’ll have to think some more about that. 

I actually have Bryce Christensen in the room – has a thought of why that might be the case. 

Bryce Christensen:      I think it’s partially because when a sample has been contaminated, the genotypes for that sample will appear to be heterozygous at a large number of markers.  And by being heterozygous, it gives them a chance of being related to just about everybody, whether they are homozygous, rare homozygous, common, or heterozygous as well.  So the contamination just increases the overall heterozygosity rate, which confuses the relatedness algorithms.

Good, good point.  Thanks Bryce.  “How can you tell these are XXY from the graph?”  I was looking at the mean value of the log ratios for a population.  And what we saw was the average intensity for X was consistent with females right in the middle of the distribution.  And the average intensity for Y was consistent with females. 

So if I pop that slide up again, the – let’s look at that.  Where was that?  Where’s that slide?  Is this it?  Yeah.  So up in the upper right here, you notice that the mean value of X is the same as that for all the other females.  And the mean value of Y is the same for all the other males. 

Now down here, we’ve got a – now it could also just be a cell line artifact too.  So that’s something to check as well.  So we can’t say for sure.  But we’d say there’s evidence for that, given the data.    

So again, if you have questions, you can type 'em in the question and answer pane.  We’ll go to the next question.  “When merging your data with Hapmap samples for population stratification using PCA, distances between samples vary according to the size of the samples you merge – your sample versus the Hapmap sample.  What do you recommend?  Maximizing the Hapmap sample?  Same number of samples in your study?  Or minimize Hapmap samples?”  

So yeah, this is true.  If you just do your study, and your study is fairly homogenous, then the spread will be very small.  Versus if you put it with the Hapmap samples.  But the Hapmap samples generally have some good anchors of these different populations.  We’re not necessarily looking for some absolute value of the principal components.  Certainly sample size and so forth could change. 

What you’re really looking at is relative differences between those anchor populations of, say, the African, the Asian, and the Caucasian.  And then in the Hapmap 3, we’ve got those additional populations in there.  So it sounds like your problem that you’re trying to address is: how do I decide if something is sufficiently departed from my population of interest. 

And some approaches we’ve looked at there are – you could do a measure of the distance from the centroid, and then do the interquartile range quality criteria, to determine if your samples fall sufficiently far from, say, the Ceph samples.  And so you might want to just take Seth samples in your population for that purpose, and do the principal components analysis.  Hopefully, that’s helpful.

“How to distinguish between true CNV and a cell line artifact, when there is only cell line DNA available?” Well, the cell line artifacts I’ve seen have mostly been entire chromosomes that have had extra copies or deletions – less deletions though than extra copies, for some reason, probably because of the amplification that happens in the cell line process. 

So probably if you have the deletion of an entire chromosome – or more often than not, it’s the gain of an entire chromosome – it’s pretty rare, and you should see some very serious phenotypic consequences for the individual.  Now are there cell line artifacts that are much smaller changes, in which case it’d be much more difficult to potentially tease those out?  So I guess go back to the phenotype. 

And, ultimately, you may not be able to disentangle the two.  So I’m not a big fan of using cell lines, if other DNA is available.  One of the messages I gave in the experimental design discussion of my previous webinar was all about having very homogenous sources for your DNA.  We see in the principal components analysis that you can often very easily tell cell line samples versus non cell line samples when you look at the log ratio data. 



“Could you mention those QA measures which cannot be tested by SVS, and need other external software?”  Well, I guess the identity by descent, one.  I’d have to look at the measures we’ve done and probably go address them in turn.  So sample call rate, we do.  X heterozygosity, we do. 

NSP-STY mismatches, we do.  But that script has not yet been made available on our website.  We have to kind of tidy it up a bit, before it’s ready for primetime.  Cryptic relatedness, we don’t do.  Population stratification, we do.  Derivative Log Ratio Spread, yes.  Wave effects, no.  That’s something we’ve used the PennCNV for – to measure it.  String value distribution, yes, via scripts. 

Again, mean X and Y intensity, yes.  Cell line artifacts – we have a script that can measure the average intensity across all chromosomes for a marker mapped data set.  And then segment over/under abundance, yes.  So it looks like we have some work to do on wave effects and cryptic relatedness, to have it all within the software. 

But we don’t necessarily advocate that we have everything, or we’ll always have everything.  I freely use the other academic packages, and give them credit where credit is due.  And we make interfaces available to export to some of them.  So it doesn’t have to be an either/or, if some of the features that we advocate using are not necessarily in our software. 

So here’s another question: “When talking about batch effects, are you distinguishing between laboratory effects that come about when samples are genotyped in different batches?  Or analyses effects when the data are analyzed in different batches?” 

That’s a good question.  When I’m talking about batch effects, I’m generally talking about laboratory effects when samples are genotypes in different batches.  When data are analyzed in different batches – I generally don’t analyze them in different batches.  So I will probably talk about it next time, when we talk about genotype calling approaches. 

The general consensus that I’ve come to, with various colleagues also at the Micro Array Equality Consortium where we’ve worked on a number of studies, is that the best results come from analyzing all the samples in a single batch, and then having measures of confidence that are appropriately diminished when there are big differences between laboratory batches. 

Now if, indeed, you do have to analyze your data in batches, or you’re doing some sort of a meta analysis, where that’s unavoidable, then, of course, a lot of caveats I’ve made about batch effects – in a sense, you’re going to have both, potentially, a computational as well as a laboratory difference.  And again, those are some of the more challenging problems to address in GWAS. 

If I’m saying your name right – apologize if I’m not – asks: “Can you give us an example of a population sample, not case control, that the genotype error is correlated with phenotype?”  So I’m not sure what you mean by population sample, not case control.  Because you know that a case control is the population we study.  Perhaps you mean like a quantitative trait. 

But the gist of it is in a case control study, you may capture the cases and control separately, or run them on different sets of plates – or you’re borrowing cases from someone else.  An example that’s not case control, again, would be if you’re borrowing samples from another study, and doing combined analysis.  Or another example is you’ve run a series of arrays over time, and the order in which you chose to run those arrays is somehow correlated with phenotype. 

Perhaps you just chose to analyze – you supplemented your data with extreme values of a quantitative trait at the end of your study.  And then you genotyped those all together.  Well, then if you’re looking at – you could have a confounding between differences in the phenotype, versus the fact that you actually ran those experiments, and did not intersperse those extreme value phenotypes along the way, as you ran your plates over time. 

So we’ve found that the biggest problem of where biases are introduced is the plate-to-plate difference.  And so what we seek to do is have a balanced, homogenous mixture of your different phenotypes on each plate.  And in our previous webinar, which is recorded on our website if you’d like to watch it, we discuss in-depth how we’ve done that.  And we could certainly talk to you more, if it’s a question you’re trying to answer for a specific problem.  There are special considerations you have to have for quantitative traits. 

So another question is: “How long does it take to run the CNV segmentation for a case control study with about 4,000 subjects?”  Well, I ran, over the weekend, a 5,000-person CNV study for the Illumina 610.  And it was done before I came back Monday morning, and I started it Friday evening.  But I was using five computers that each had 8 cores.  So it is quite compute intensive. 

But we’ve developed methodologies to take advantage of the multi-core processors.  And so relative to the amount of time it takes to genotype of do the laborious QC process, we find the actual segmentation process is not particularly burdensome.  If you’re running it on a dual core machine though, yes, it could take you weeks for a study of that size.  And so we wouldn’t recommend doing that. 

So a number of people have said thank you.  And I’d like to thank you all.  That’s all the questions that we’ve had, so far.  And so we will, of course, make this recording available, as soon as Josh can get it edited.  And again, feel free to browse our website content, to learn more.  We’ve really put up a lot of resources. 

And just very recently, we overhauled our website, so got to give Josh Forsythe a lot of credit for that.  Josh has also helped a lot with product management.  And so feel free to communicate to him your thanks, as well as suggestions for improvements that we can make, to make your genome-wide SNP and copy number studies easier, faster, and more productive.

So with that, I’d like to close the webinar.  And again, thank you for attending.  And look for announcements of both an opportunity to do a services engagement with us, if that’s of interest to you, as well as announcements of our upcoming webcasts in the new year.  So wish you all wonderful holidays, and hope you get some rest, or at least get some good analysis done on your vacation.  And we’ll look forward to talking more in the weeks to come and in the new year.  Take care now.  Bye-bye. 



© 2012 Golden Helix, Inc     Facebook     Twitter     Linked In     Blog   YouTube

Site Map   |   Privacy Policy   |   Contact Us