Functionality of QTL analysis application
Fred van Eeuwijk & Marcos Malosetti
9 January 2011
Input for QTL analysis
For QTL analysis, the following types of data are required:
A file with trait information at plot level (although also trait information at genotypic level can be offered). The columns contain labelling information in the form of environmental factors (year, site/location, management), design factors (replicate, incomplete block within replicate, row, column) and genotypic factors (identifiers for individuals, families and groupings), followed by one or more trait columns. Preferred data file types are txt and csv.
Type of population
To allow the calculation of conditional QTL genotypes given marker information (marker phenotypes and map information), the basis of all QTL mapping methods, the type of breeding population needs to be given. Supported population types are F2, BC, DH, RILn, CP (cross pollinator), and AP (association panel). The information on the type of population can be included in the marker scores file.
A txt file with marker phenotypes. Various types of formats (MapQTL, R/qtl, Flapjack) are supported, but the Flapjack format is preferred.
The input information for a QTL analysis is completed by a map file, a simple txt file that contains the names and positions of the markers.
Exploration and data quality control
After having read in the data (phenotypes, marker scores, map), the various types of data need to undergo a quality check, and next a number of quantities need to be calculated at trial level as preliminaries to a GxE analysis, that combines phenotypic information across trials.
Single trial analysis
The principal aim of phenotypic analyses at single trial level is to produce: 1) an estimate of heritability as a kind of upper limit for the fraction of variation explained by QTLs; 2) genotypic means (best linear unbiased estimates, or BLUEs); 3 weights. These phenotypic analyses should be able to deal with any kind of experimental design as well as allow spatial analyses as add-on to the standard analyses that adhere to the design. The output of the single trial phenotypic analyses should be two-way GxE tables of means and weights to be used in further GxE and multi-environment QTL analyses.
The analyses per trial allow data quality control at plot level via plots of residuals versus fitted values, histogram and box plots of residuals, and QQ plots. Similarly, data quality control and exploration is possible at the level of the genotypic means, for example data summaries can be made containing numbers of observations, numbers of missing values, mean, standard deviation, variance, minimum, first quantile, median, third quantile, maximum. Furthermore, at the genotypic level it is useful to look at histograms and box plots per environment. All these numeric and graphical techniques help to identify outlying genotypes.
Data quality control for marker score file
To inspect the marker score file a graphical genotype image can be made. Such an image helps in detecting non random patterns of missing data, outlying genotypes and markers, suspicious allele frequencies, and excessive or repressed recombination. All these visual conclusions can be supported by Chi-square types of test. An important test to detect violations of the standard population genetic assumptions prior to QTL analysis is the test on segregation distortion.
Identifying population structure and estimating kinship
In the case of association panels, the table of genotype by marker scores can also be used as the basis for the detection of population structure by either some type of cluster analysis, multi-dimensional scaling, principal coordinate analysis, or principal components analysis. Furthermore, the same marker information can serve as the basis for one or another estimate of a kinship matrix.
Imputing missing values for association panels
For the estimation of population structure or kinship, missing marker scores in the genotype by matrix file can be imputed based on a procedure that looks at the most frequent allele(s) in the set of closest genotypes, where the user defines the size of the set.
Calculating genetic predictors
For regular breeding populations, i.e., no association panels, missing marker scores are not a problem as the missing marker information can easily be calculated from flanking marker information, the type of population and the marker map, i.e., distances between markers. Conditional genotype probabilities can be calculated at marker positions to alleviate the problem of missing marker information, but the same kind of calculation can be performed for any position in the genome. In that way marker/QTL genotype probabilities can be produced on a genomic grid of a chosen density. These probabilities are converted in so-called genetic predictors, explanatory variables that allow the detection of QTLs along the genome. Genetic predictors can be constructed for additive, dominance and epistatic effects. These predictors form the basis for the procedures of simple and composite interval mapping.
Checking map order and distances
From the marker scores the recombination frequencies between pairs of marker loci in the population can be compared with those as expected from the map. This may help in assessing the appropriateness of the map.
Linkage Disequilibrium (LD) decay
For association panels it is useful to assess the relationship between linkage disequilibrium and genetic distance. This may be done per chromosome or averaged across chromosomes. The magnitude of LD decay gives information about the required marker density for an LD study.
After the single trial analyses, a GxE table of means and another of weights is available. Before embarking on a QTL analysis, the GxE structure in the means needs to be investigated, taking into account the weights. The GxE analysis is useful in identifying outlying genotypes and environments and should lead to the selection of a model for an appropriate variance-covariance model (VCOV) for the genetic signal in the data, i.e., a good description of the genetic variances within trials and genetic correlations between trials. The selected VCOV is used in the QTL analysis to model the background genetic signal. The statistical techniques to explore the GxE are a scatter plot matrix (scatter plots of the performance in one environment versus those in another environment), AMMI and GGE analysis (PCA on either the double centred and environment centred GxE means). The VCOV structure for use in subsequent QTL analysis is selected by REML.
QTL mapping strategy
After the GxE analysis by mixed models, QTL analysis is equivalent to selecting a set of genetic predictors in a multi-environment mixed model. The QTL analysis consists in an initial single marker scan, one or more composite interval mapping scans, in which co-factors correct for QTLs segregating elsewhere in the genome, and a final model selection in which a multi-QTL model is selected by backward selection from an initial model based on the last composite interval mapping scan. For the first single QTL scan, the VCOV model of the GxE analysis serves as random background model for the residual genetic signal. For the later composite interval scans and the final model selection the background model can usually be simplified.
Philosophy for testing for and multiple test correction
The philosophy of the testing procedure is to test for QTL+QTLxE simultaneously. Only after the final model selection, it is verified whether a QTL position really requires a QTLxE term. Equally so, testing for additive and dominance effects occurs simultaneously. After assessing whether QTLxE is required for a particular QTL, it is tested whether dominance is required, or whether additive effects suffice. The advantage of this procedure that the multiple testing correction procedure that is required to protect against false positive QTLs can be the same for any kind of population and number of environments. The correction that is implemented assesses the effective number of tests along the genome from the number of significant eigen values for the genotype by marker matrix of marker scores. The effective number of tests is subsequently incorporated in a Bonferroni procedure. This correction works for both linkage and association analyses.
Presentation of final results
The output of the QTL analyses gives plots of -log10(P-values) for tests on QTL+QTLxE presence against genomic position for simple and composite interval mapping. In addition, plots are created of QTL presence in individual trials/environments. Point estimates for QTL position for a final model are printed together with point and interval estimates for QTL effects.
For association panels the same logic can be followed as for standard breeding populations. The main difference is that for association panels tests for QTLs can only be made at marker positions and not in between markers, as in simple and composite interval mapping. Furthermore, for association panels the relationships between the genotypes need to be included in the mixed model, whether in the form of a factor representing population structure, a set of principal components obtained from decomposition of the genotype by marker matrix, or a kinship matrix.