TY - JOUR
T1 - Cross-validation without doing cross-validation in genome-enabled prediction
AU - Gianola, Daniel
AU - Schon, Chris Carolin
N1 - Publisher Copyright:
© 2016 Gianola and Schon.
PY - 2016
Y1 - 2016
N2 - Cross-validation of methods is an essential component of genome-enabled prediction of complex traits. We develop formulae for computing the predictions that would be obtained when one or several cases are removed in the training process, to become members of testing sets, but by running the model using all observations only once. Prediction methods to which the developments apply include least squares, best linear unbiased prediction (BLUP) of markers, or genomic BLUP, reproducing kernels Hilbert spaces regression with single or multiple kernel matrices, and any member of a suite of linear regression methods known as "Bayesian alphabet." The approach used for Bayesian models is based on importance sampling of posterior draws. Proof of concept is provided by applying the formulae to a wheat data set representing 599 inbred lines genotyped for 1279 markers, and the target trait was grain yield. The data set was used to evaluate predictive mean-squared error, impact of alternative layouts on maximum likelihood estimates of regularization parameters, model complexity, and residual degrees of freedom stemming from various strengths of regularization, as well as two forms of importance sampling. Our results will facilitate carrying out extensive cross-validation without model retraining for most machines employed in genome-assisted prediction of quantitative traits.
AB - Cross-validation of methods is an essential component of genome-enabled prediction of complex traits. We develop formulae for computing the predictions that would be obtained when one or several cases are removed in the training process, to become members of testing sets, but by running the model using all observations only once. Prediction methods to which the developments apply include least squares, best linear unbiased prediction (BLUP) of markers, or genomic BLUP, reproducing kernels Hilbert spaces regression with single or multiple kernel matrices, and any member of a suite of linear regression methods known as "Bayesian alphabet." The approach used for Bayesian models is based on importance sampling of posterior draws. Proof of concept is provided by applying the formulae to a wheat data set representing 599 inbred lines genotyped for 1279 markers, and the target trait was grain yield. The data set was used to evaluate predictive mean-squared error, impact of alternative layouts on maximum likelihood estimates of regularization parameters, model complexity, and residual degrees of freedom stemming from various strengths of regularization, as well as two forms of importance sampling. Our results will facilitate carrying out extensive cross-validation without model retraining for most machines employed in genome-assisted prediction of quantitative traits.
KW - Cross-validation
KW - GenPred
KW - Genomic BLUP
KW - Genomic prediction
KW - Genomic selection
KW - Reproducing kernels
KW - Shared Data Resources
UR - http://www.scopus.com/inward/record.url?scp=84994572070&partnerID=8YFLogxK
U2 - 10.1534/g3.116.033381
DO - 10.1534/g3.116.033381
M3 - Article
C2 - 27489209
AN - SCOPUS:84994572070
SN - 2160-1836
VL - 6
SP - 3107
EP - 3128
JO - G3: Genes, Genomes, Genetics
JF - G3: Genes, Genomes, Genetics
IS - 10
ER -