quantitative trait loci, methods and protocols

Methods in Molecular Biology TM VOLUME 195 Quantitative Trait Loci Methods and Protocols Edited by Nicola J Camp Angela Cox HUMANA PRESS Association Studies Jennifer H Barrett Introduction A classical case-control study design is frequently used in genetic epidemiology to investigate the association between genotype and the presence or absence of disease Association studies can also be useful in the investigation of quantitative traits The aim of such studies is to test for association at the population level between the quantitative trait and genotype at a particular locus Whether investigating qualitative or quantitative traits, such studies depend on the prior identification of a candidate gene or genes The genotyped locus could either be a polymorphism within a potentially trait-affecting gene or a marker in linkage disequilibrium with such a gene Currently, screening of the whole genome is only feasible using linkage analysis, which is discussed elsewhere, because linkage extends over much greater distances than does linkage disequilibrium Quantitative trait association studies are based on a sample of unrelated subjects from the population Various sampling designs are possible, including random sampling and sampling on the basis of an extreme phenotype The advantages and disadvantages of these alternative designs are discussed The basic method of analysis is called analysis of variance (see Subheading 2.1.) a standard statistical technique for testing for differences in mean between two or more groups, on the basis of the comparison of between- and withingroup variances An alternative if subjects are sampled on the basis of extreme phenotype is to compare genotypes between groups with high and low trait values (see Subheading 2.2.) From: Methods in Molecular Biology: vol 195: Quantitative Trait Loci: Methods and Protocols Edited by: N J Camp and A Cox  Humana Press, Inc., Totowa, NJ Barrett Methods 2.1 Analysis of Variance and Linear Regression The standard approach to the analysis of quantitative trait association studies assumes the following model The phenotype yij of individual i with genotype j at the locus of interest is given by yjj = µj + ei (1) where µj is the mean for the jth genotype and ei represents residual environmental and possibly polygenic effects for individual i, assumed to be Normally distributed with mean and variance σ2 The data required consist of measured e phenotypes and genotypes on a sample of unrelated individuals The parameters µj are estimated in the obvious way by the mean values of individuals with genotype j The F-statistic from analysis of variance (ANOVA), the ratio of between- and within-genotype variances, is used to test for the association between genotype and phenotype, because under the null hypothesis that all genotypes have the same mean and variance, this ratio should be This approach has been called the measured-genotype test (1), in contrast to earlier biometrical methods that use information on the distribution of the phenotype only (i.e., with unmeasured genotype) discussed briefly in Note Equivalently, a linear regression analysis of phenotype on genotype can be carried out, possibly including as covariates other factors that may be related to phenotype Where the genotype is determined by one biallelic polymorphism (with possible genotypes AA, AB, and BB), a test for trend is provided by regressing the phenotype on the number of copies of the A allele There are many examples of this type of approach in the literature For example, O’Donnell et al (2) used multiple linear regression to investigate the relationship between diastolic blood pressure and different genotypes of the angiotensin-converting enzyme (ACE) gene Hegele et al (3) use analysis of variance to demonstrate association between serum concentrations of creatinine and urea and the gene encoding angiotensinogen (AGT) 2.2 Analysis of Extreme Groups An alternative approach is to use a sampling scheme that selects individuals on the basis of extreme phenotypes (4,5) There is considerable literature on the use of such sampling schemes for sibling pair linkage studies (e.g., ref 6) Extreme sampling is advocated to increase power and efficiency, as extremes are more informative The approach is particularly useful when the phenotype is relatively easy to measure, so that large numbers of individuals can easily be screened to select extremes for genotyping Association Studies In association studies adopting this method, individuals are randomly selected conditional on their phenotype being below a specified lower threshold or exceeding a specified upper threshold Alternatively, the upper and lower n percentiles of a random sample from the population may be included A crosstabulation is then formed by classifying subjects by genotype and by high/low phenotype The genotype frequencies are then compared between subjects with high and low trait values using a chi-squared test For example, Hegele et al (3) compared allele and genotype frequencies at the AGT locus in subjects with the lowest and highest quartiles of serum creatinine and urea levels Interpretation In common with association studies for qualitative traits, a significant association does not demonstrate an effect of the polymorphism considered, because it may also arise through linkage disequilibrium with another locus A further similarity is that population admixture can lead to spurious associations For this reason, family-based approaches, such as the transmission-disequilibrium test for quantitative traits (7), have been developed (see Chapter 5) 3.1 Heterogeneity Published results of associations with quantitative as with qualitative traits are not always in agreement Because for most complex traits the effect of any one locus is likely to be small, individual studies are often not sufficiently powerful to detect association To address this issue, Juo et al (8) carried out a meta-analysis of studies investigating association between apolipoprotein AI levels and variants of the apolipoprotein gene, which had produced conflicting results This is a potentially useful approach, but may be flawed by publication bias, which is likely to be more of an issue in epidemiological studies than in clinical trials There is also an assumption that patients are genetically and clinically homogeneous, with similar environmental exposures 3.2 Using Extremes An important consideration when using extreme sampling strategies (as in outlined in Subheading 2.2.) is that extremes may be untypical of the quantitative trait as a whole in that they may be under the influence of other genes A clear example of this, cited in ref 4, is that studying individuals with achondroplastic dwarfism would be inappropriate if the primary interest were in identifying genes controlling height 3.3 Power of Association Studies An attractive feature of association studies is that they may require smaller sample sizes than methods based on linkage (9) Barrett Schork et al (5) investigated the power of the extreme sampling method analytically (Subheading 2.2.) to detect association between the trait and a single biallelic marker in linkage disequilibrium with a trait-affecting locus Power depends on many factors, including locus-specific heritability, degree of linkage disequilibrium, allele frequencies, mode of inheritance, and choice of threshold In some settings, overall sample sizes of less than 500 provided adequate power to detect association with a locus accounting for 10% of the trait variance The power of several methods of analysis, variants of those described here, has been compared in a simulation study (10) Under the models considered, ANOVA/linear regression (see Subheading 2.1.) generally performed better than a variant of the extremes method (see Subheading 2.2.), based on the same number of genotyped individuals, as most of the information on phenotype is lost by categorizing into “high” and “low” values As with any method based on selective sampling, another drawback is that it is also necessary to phenotype a larger number of subjects to achieve the same sample size for analysis The same authors suggested a variation on ANOVA/linear regression, the truncated measured genotype (TMG) test, where only extremes are included in the analysis (see Note 4) This TMG test was found to be more powerful than ANOVA/ linear regression for the same sample size of genotyped individuals, although, again, a larger number of subjects must be phenotyped to achieve this These results are, however, dependent on the underlying genetic model Allison et al (4) showed that extreme sampling can actually lead to a decrease in power in the presence of another gene influencing the trait Page and Amos (10) also found that variants of ANOVA/linear regression and of the TMG test, which are based on alleles, were more powerful than the genotype-based methods discussed earlier In these approaches, the phenotype of each individual contributes to two groups, one for each allele or, in the case of homozygotes, contributes twice to one group Allele-based methods, which “double the sample size,” are generally only valid under the assumption of Hardy–Weinberg equilibrium (11) Furthermore, the greater power of this approach is to be expected for the models used in these simulations, all of which assumed an additive effect of the trait allele, and may not apply more generally Long and Langley (12) investigated the power to detect association using a number of single nucleotide polymorphisms in the region of a quantitative trait locus, but excluding the functional locus itself Their test statistic was based on ANOVA (see Subheading 2.1); the significance of the largest Fstatistic obtained from any marker was estimated from its empirical distribution based on 1000 random permutations of the phenotype/marker data From their simulations, they concluded that, using about 500 individuals, there was generally sufficient power to detect association if 5–10% of the phenotypic variation was attributable to the locus Furthermore, tests using single markers had greater Association Studies Table Summary Data on ACE Levels According to Genotype ace geno Mean Freq 74.496732 90.233871 103.73913 83.243333 II ID DD Total Std dev 31.729764 39.484505 46.564928 37.475487 153 124 23 300 power than haplotype-based tests The latter were based on comparing mean trait values across all distinct haplotypes, and the authors concede that other haplotype-based tests making use of additional information may perform better Software The basic methods described in this chapter can be carried out in standard statistical software packages such as Stata (13), which is used here, SAS, or SPSS The data would generally be expected to consist of one record for each subject, recording their measured trait value, their genotype, and any covariates of interest Worked Example 5.1 Analysis of Variance An insertion/deletion (I/D) polymorphism of the ACE gene is associated with plasma ACE levels in some populations Plasma ACE levels were measured and I/D genotype obtained for 300 Pima Indians to investigate the relationship in this population (14) The data consist of 300 records, including ACE levels (ranging from to 238 units) and genotype (II, ID, or DD) In Stata, ANOVA can be carried out by the command oneway ace leve ace geno, tabulate where ace leve and ace geno are the variables for ACE levels and genotype, respectively This produces Tables and Table is produced by specifying the tabulate option after the oneway command (for one-way analysis of variance) and provides useful summary information In addition to the mean ACE levels within each genotype group (i.e., estimates of µ1, µ2, and µ3), the standard deviation and the number of subjects with each genotype are displayed It can be seen that individuals with the DD genotype have much higher levels on average than those with the II genotype, with intermediate levels found in heterozygotes Table is the basic ANOVA table The total variability of the data is measured by the total sum of squares (419,919) (i.e the sum of squares of the Barrett Table Analysis of Variance Results for the Data in Table Source SS df MS F Between groups Within groups Total 27426.3358 392492.901 419919.237 297 299 13713.1679 1321.52492 1404.41216 10.38 Prob > F 0.0000 differences between each of the observations and the overall mean) This figure can be separated into the between-genotype sum of squares (the sum of squares of the difference between the group mean and the overall mean) and the withingenotype sum of squares (the sum of squares of the differences between each observation and the mean for the corresponding genotype) These are used to estimate the corresponding variance, shown in the mean square (MS) column, by dividing by the number of degrees of freedom [The number of degrees of freedom is one less than the number of groups or observations within groups (i.e., 3−1 for between genotypes and 152+123+22 within genotypes).] The F-statistic (10.38) is the ratio of these estimated variances Under the null hypothesis of no difference between groups, its expected value is and it should follow an F-distribution with (2, 297) degrees of freedom In this case, there is overwhelming evidence for a difference in level according to genotype The differences in the initial table are not the result of random variation The analysis of variance table (Table 2) can also be obtained by using the Stata command anova ace leve ace geno This gives the additional information R-squared = 0.0653 indicating that the I/D genotype explains 6.5% of the variance in plasma ACE levels in this population Slightly different output, but exactly the same F-test and estimate of Rsquared can alternatively be obtained by carrying out a regression analysis: xi: regress ace leve i.ace geno The i in front of the ACE genotype variable shows that this is to be treated as a categorical variable in the analysis If, instead, interest was in testing for a trend in ACE levels with the number of D alleles, then genotype could be Association Studies Table Genotype Frequencies in Two Extreme Groups Defined by the Top and Bottom Quintiles of ACE Levelsa Five quantiles of ace leve Total a ace geno II ID DD Total 39 62.90 17 28.33 56 45.90 20 32.26 33 55.00 53 43.44 4.84 10 16.67 13 10.66 62 100.00 122 100.00 122 100.00 Pearson chi2(2) = 15.5722, Pr = 0.000 coded as 0, 1, or to indicate the number of D alleles, and the following regression carried out: regress ace leve ace geno This produces an F-statistic of 20.77 on (1, 298) degrees of freedom 5.2 Analysis of Extremes Using the same dataset, a new variable is created, recording the appropriate quantile for each subject’s ACE level In this example, quintiles are used, creating groups of approximately 60 subjects This is easily done in Stata as follows: xtile acegp5=ace leve, nq(5) A chi-squared test is then carried out comparing the top and bottom quintiles: tab acegp5 ace geno if acegp5==1 | acegp5==5, chi row producing Table The chi-squared statistic of 15.57 on degrees of freedom again indicates very strong evidence of association between ACE levels and genotype, even though only 40% of the original subjects are used in the analysis Nearly 63% of those with low ACE levels had II genotype compared with only 28% of those with high levels, and the DD genotype was over three times as common in those with high levels compared with those with low levels Notes Commingling analysis The model underlying ANOVA (see Subheading 2.1.) assumes that the data consist of a mixture of Normal distributions, one corresponding 10 Barrett to each genotype, each with the same variance Even in the absence of genotype data, statistical methods can be used to test for evidence of a mixture of more than one Normal distribution This “unmeasured genotype” approach is sometimes known as commingling analysis Evidence for a mixture of two or three distributions is supportive of the hypothesis that a major gene underlies the trait, although, of course, environmental factors could also give rise to distinct distributions Model fitting allows estimates to be made of parameters of interest such as µj and σ2 and e the proportion of subjects in each class In the presence of genotype data in a candidate gene, the method of commingling analysis can be extended to condition on the measured polymorphism(s) In addition to testing for evidence of a mixture of distributions, this method also provides evidence of whether the measured genotype itself gives rise to the mixture or whether another polymorphism in the gene is a more likely explanation (15,16) Distributional assumptions In view of the underlying model for ANOVA, a Normalizing transformation may be applied to the data It is important to note that the model assumes a Normal distribution within each genotype rather than overall (In commingling analysis, Normalizing the data leads to a conservative test for mixture, as this may remove skewness in the overall distribution of the data arising from the mixing of distributions.) The further assumption of a common within-genotype variance can be tested, and homogeneity of variance may sometimes be achieved by transformation In the worked example in this chapter, there is some evidence for heterogeneity in the variances One advantage of the extremes method outlined in Subheading 2.2 is that it does not rely on these distributional assumptions Nonparametric alternatives Another nonparametric alternative to ANOVA is the Kruskal–Wallis test In this approach, the complete set of N trait values is ranked from to N, and the average rank in each genotype group is calculated The test statistic is based on comparing the genotype-specific average ranks with the overall average rank of (N+1)/2 Under the null hypothesis of no genotype–phenotype association, the test statistic follows a chi-squared distribution with two degrees of freedom (assuming three genotypes), and a significantly higher value indicates that the distributions differ Applying this method to the example in Subheading 5., the test statistic takes the value 18.2 (p=0.0001) This method is only slightly less powerful than ANOVA when the data are Normally distributed and has the advantage that distributional assumptions are not made However, the test alone is not very informative, and, in general, the estimates provided by ANOVA are also useful Analysis of extremes An alternative suggestion for the analysis of extreme samples, the TMG method mentioned earlier, is to use analysis of variance, ignoring the sampling scheme The analysis of variance assumption of random sampling from a Normal distribution is violated, but it has been argued that, for large enough sample sizes, the significance level of the test is still correct (10) The analogs of this test and of those outlined in Subheadings 2.1 and 2.2 based on alleles rather than genotypes, where each individual’s phenotype contributes twice to the analysis, violate the further assumption of independence of observations Slatkin (17) suggested selecting individuals on the basis of unusually high (or low) trait values and testing (1) for a difference in genotype frequency between the Association Studies 11 selected sample and a random sample and (2) for differences in phenotype distribution according to genotype within the selected sample These two tests are approximately independent and so can be combined into one overall test This approach is particularly powerful when a rare allele has a substantial effect on phenotype, even though the overall proportion of phenotypic variance attributable to the locus is small Family-based samples Although association studies as described in this chapter are applicable to unrelated sets of cases and controls, extensions have been suggested to allow for relatedness between subjects Tregouet et al (18) suggested using estimating equations, a statistical method for estimating regression parameters based on correlated data They found that, for nuclear families of equal size, the power of this approach was comparable to maximum likelihood and was similar to the power expected in a sample of the same number of unrelated individuals However, the type error rate could be substantially inflated in the presence of strong clustering if the number of families is relatively small (

quantitative trait loci, methods and protocols

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan