A comparison of regression methods based on dimensional reduction for genomic prediction

J.A. da Costa, C.F. Azevedo, M. Nascimento, F.F. e Silva, M.D.V. de Resende, A.C.C. Nascimento
Published: May 31, 2021
Genet. Mol. Res. 20(2): GMR18877
DOI: https://doi.org/10.4238/gmr18877

Cite this Article:
J.A. da Costa, C.F. Azevedo, M. Nascimento, F.F. e Silva, M.D.V. de Resende, A.C.C. Nascimento (2021). A comparison of regression methods based on dimensional reduction for genomic prediction. Genet. Mol. Res. 20(2): GMR18877. https://doi.org/10.4238/gmr18877

About the Authors
J.A. da Costa, C.F. Azevedo, M. Nascimento, F.F. e Silva, M.D.V. de Resende, A.C.C. Nascimento
Corresponding Author: J.A. da Costa
Email: jaquicele.costa@ufv.br

ABSTRACT

The quality of fit of a multiple linear regression model often encounters multicollinearity and high dimensionality problems, making it impossible to obtain stable estimates through the traditional method of estimation based on ordinary least squares. To overcome such challenges, dimensionality reduction methods have been proposed, because of their simple theory and easy application. We compared three dimensionality reduction methods: Principal Components Regression (PCR), Partial Least Squares (PLS), and Independent Components Regression (ICR). An important step for dimensionality reduction and prediction is selecting the number of components, as it affects the linear combinations of the explanatory variables. The linear combinations are inserted into the model to predict the response based on a reduced number of parameters. We examined the criteria for the selection of the number of components. The dimensionality reduction methods were applied to genomic and phenotype data. We evaluated 370 accessions of Asian rice, Oryza sativa, which were genotyped for 36,901 SNPs markers considered to predict the genomic values for the number of panicles per plant trait. This data set presented multicollinearity and high dimensionality. The computational time for each method was also recorded. Among the methods, PCR and ICR gave the highest accuracy values, with ICR standing out for presenting estimates of the least biased genomic values. However, ICR required more computational time than the other methodologies.

Key words: High dimensionality, Independent components, Multicollinearity, Partial least squares, Principal components.

Back To Top