Genome-wide complex trait analysis


Genome-wide complex trait analysis Genome-based restricted maximum likelihood is a statistical method for variance component estimation in genetics which quantifies the total narrow-sense contribution to a trait's heritability of a particular subset of genetic variants. This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development. It can also be extended to analyse bivariate genetic correlations between traits. There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to population stratification.
GCTA heritability estimates are useful because they provide lower bounds for the genetic contributions to traits such as intelligence without relying on the assumptions used in twin studies and other family and pedigree studies, thereby corroborating them and enabling the design of well-powered genome-wide association study there is no genetic contribution, b) the genetic contribution is entirely in the form of genetic variants not included, or c) the genetic contribution is entirely in the form of non-additive effects such as epistasis/dominance. Running GCTA on individual chromosomes and regressing the estimated proportion of trait variance explained by each chromosome against that chromosome's length can reveal whether the responsible genetic variants cluster or are distributed evenly across the genome or are sex-linked. Chromosomes can of course be replaced by more fine-grained or functionally informed subdivisions. Examining genetic correlations can reveal to what extent observed correlations, such as between intelligence and socioeconomic status, are due to the same genetic traits, and in the case of diseases, can indicate shared causal pathways such as can be inferred from the genetic variation jointly associated with schizophrenia and other mental diseases or reduced intelligence.

History

Estimation in biology/animal breeding using standard ANOVA/REML methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding, and several authors have noted that relatedness could be measured directly from genetic markers, leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements.
As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 2006/2007, which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs. The GCTA algorithm was then described and a software implementation published in 2011. It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches.

Benefits

Robust heritability

Twin and family studies have long been used to estimate variance explained by particular categories of genetic and environmental causes. Across a wide variety of human traits studied, there is typically minimal shared-environment influence, considerable non-shared environment influence, and a large genetic component, which is on average ~50% and sometimes much higher for some traits such as height or intelligence. However, the twin and family studies have been criticized for their reliance on a number of assumptions that are difficult or impossible to verify, such as the equal environments assumption, that there is no misclassification of zygosity, that twins are unrepresentative of the general population, and that there is no assortative mating. Violations of these assumptions can result in both upwards and downwards bias of the parameter estimates.
The use of SNP or whole-genome data from unrelated subject participants bypasses many heritability criticisms: twins are often entirely uninvolved, there are no questions of equal treatment, relatedness is estimated precisely, and the samples are drawn from a broad variety of subjects.
In addition to being more robust to violations of the twin study assumptions, SNP data can be easier to collect since it does not require rare twins and thus also heritability for rare traits can be estimated.

GWAS power

GCTA estimates can be used to resolve the missing heritability problem and design GWASes which will yield genome-wide statistically-significant hits. This is done by comparing the GCTA estimate with the results of smaller GWASes. If a GWAS of n=10k using SNP data fails to turn up any hits, but the GCTA indicates a high heritability accounted for by SNPs, then that implies that a large number of variants are involved and thus that much larger GWASes will be required to accurately estimate each SNP's effect and directly account for a fraction of the GCTA heritability.

Disadvantages

  1. Limited inference: GCTA estimates are inherently limited in that they cannot estimate broadsense heritability like twin/family studies as they only estimate the heritability due to SNPs. Hence, while they serve as a critical check on the unbiasedness of the twin/family studies, GCTAs cannot replace them for estimating total genetic contributions to a trait.
  2. Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity ; and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least n>1000 and ideally n>10000. In contrast, twin studies can offer precise estimates with a fraction of the sample size.
  3. Computational inefficiency: The original GCTA implementation scales poorly with increasing data size, so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis, so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically. This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using moment matching.
  4. Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, a meta-analysis must be performed.
In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data e.g. "LD score regression" contrasts linkage disequilibrium statistics with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The Broad Institute runs which provides a public web interface to >=177 traits with LD score regression. Another method using summary data is HESS.
  1. Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics.
  2. Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects as they represent mutation load being purged by negative selection. As with measurement error, this will bias GCTA estimates towards underestimating heritability.

    Interpretation

GCTA estimates are often misinterpreted as "the total genetic contribution", and since they are often much less than the twin study estimates, the twin studies are presumed to be biased and the genetic contribution to a particular trait is minor. This is incorrect, as GCTA estimates are lower bounds.
A more correct interpretation would be that: GCTA estimates are the expected amount of variance that could be predicted by an indefinitely large GWAS using a simple additive linear model in a particular population at a particular time given the limited selection of SNPs and a trait measured with a particular amount of precision. Hence, there are many ways to exceed GCTA estimates:
  1. SNP genotyping data is typically limited to 200k-1m of the most common or scientifically interesting SNPs, though 150 million+ have been documented by genome sequencing; as SNP prices drop and arrays become more comprehensive or whole-genome sequencing replaces SNP genotyping entirely, the expected narrowsense heritability will increase as more genetic variants are included in the analysis. The selection can also be expanded considerably using haplotypes and imputation ; e.g. Yang et al. 2015 finds that with more aggressive use of imputation to infer unobserved variants, the height GCTA estimate expands to 56% from 45%, and Hill et al. 2017 finds that expanding GCTA to cover rarer variants raises the intelligence estimates from ~30% to ~53% and explains all the heritability in their sample; for 4 traits in the UK Biobank, imputing raised the SNP heritability estimates. Additional genetic variants include de novo mutations/mutation load & structural variations such as copy-number variations.
  2. narrowsense heritability estimates assume simple additivity of effects, ignoring interactions. As some trait values will be due to these more complicated effects, the total genetic effect will exceed that of the subset measured by GCTA, and as the additive SNPs are found and measured, it will become possible to find interactions as well using more sophisticated statistical models.
  3. all correlation & heritability estimates are biased downwards to zero by the presence of measurement error; the need for adjusting this leads to techniques such as Spearman's correction for measurement error, as the underestimate can be quite severe for traits where large-scale and accurate measurement is difficult and expensive, such as intelligence. For example, an intelligence GCTA estimate of 0.31, based on an intelligence measurement with test-retest reliability, would after correction, be a true estimate of ~0.48, indicating that common SNPs alone explain half of variance. Hence, a GWAS with a better measurement of intelligence can expect to find more intelligence hits than indicated by a GCTA based on a noisier measurement.

    Implementations

The original "GCTA" software package is the most widely used; its primary functionality covers the GREML estimation of SNP heritability, but includes other functionality:
Other implementations and variant algorithms include:
GCTA estimates frequently find estimates 0.1-0.5, consistent with broadsense heritability estimates. Traits univariate GCTA has been used on include :