Research Article

Determination of the optimal number of markers and individuals in a training population necessary for maximum prediction accuracy in F2 populations by using genomic selection models

Published: November 21, 2016
Genet. Mol. Res. 15(4): gmr15048874 DOI:


Genomic selection is a useful technique to assist breeders in selecting the best genotypes accurately. Phenotypic selection in the F2 generation presents with low accuracy as each genotype is represented by one individual; thus, genomic selection can increase selection accuracy at this stage of the breeding program. This study aimed to establish the optimal number of individuals required to compose the training population and to establish the amount of markers necessary to obtain the maximum accuracy by genomic selection methods in F2 populations. F2 populations with 1000 individuals were simulated, and six traits were simulated with different heritability values (5, 20, 40, 60, 80 and 99%). Ridge regression best linear unbiased prediction was used in all analyses. Genomic selection models were set by varying the number of individuals in the training population (2 to 1000 individuals) and markers (2 to 3060 markers). Phenotypic accuracy, genotypic accuracy, genetic variance, residual variance, and heritability were evaluated. Greater the number of individuals in the training population, higher was the accuracy; the values of genotypic and residual variances and heritability were close to the optimum value. Higher the heritability of the trait, higher is the number of markers necessary to obtain maximum accuracy, ranging from 200 for the trait with 5% heritability to 900 for the trait with 99% heritability. Therefore, genomic selection models for prediction in F2 populations must consist of 200 to 900 markers of major effect on the trait and more than 600 individuals in the training population.



Plant selection has been carried out by mankind since early history; however, it intensified at the beginning of the last century, when breeding programs of major crops were established (Allard, 1999). Over the past 100 years, methods and technologies associated with plant selection have shown tremendous progress (Borém and Miranda, 2013). Selection methods have evolved from mass selection (Borém and Miranda, 2013), which involves simple phenotypic selection of individuals; or combined selection (Borém and Miranda, 2013), which takes into account the information between and within families; to reciprocal recurrent selection (Ordas et al., 2012), which includes a gradual increase in the frequency of favorable alleles by repeated selection cycles, without lowering the genetic variability of the population.

With the advent of molecular markers in the 80s, it was possible to improve selection accuracy by means of marker assisted selection (MAS) (He et al., 2014). Although MAS has led to a significant improvement in plant breeding, this technique is only effective for qualitative traits or for traits governed by a few genes, such as in the case of sudden death syndrome in soybean (Lightfoot, 2015), wheat rust (Yaniv et al., 2015), tolerance to salinity (Ashraf et al., 2012), and rice bacterial blight (Pandey et al., 2013). Within 30 years from its advent, techniques involving molecular markers evolved from employing isoenzymes (Dirlewanger et al., 1998), RAPD (Lynch and Milligan, 1994), RFLP (Langer and Maixner, 2004), AFLP (Frascaroli et al., 2013), microsatellites (Soldati et al., 2013) to single nucleotide polymorphism (SNP) (Belaj et al., 2012).

Among the variations found in the genome, SNP variations are the most widely distributed and abundant in the genome. With the development of SNP genotyping platforms and with the improvement of statistical methods, Meuwissen et al. (2001) presented a new approach based on multiple regression using markers as covariates, also known as genomic selection. The objective of genomic selection is to identify possible markers in linkage disequilibrium with the gene regions of interest. Since this pioneering study, several authors have used this technique to predict the genetic value in several plant species, such as corn (Beyene et al., 2015), soybeans (Zhang et al., 2016), wheat (Bassi et al., 2016), forest species (Cros et al., 2015), sugarcane (Gouy et al., 2013), and rice (Spindel et al., 2015).

Although there are several studies on genomic selection, only a few show the effect of different factors affecting prediction accuracy in genomic selection, such as the number of markers and individuals in the training population. Isidro et al. (2015) evaluated five criteria to determine the optimal number of individuals to compose the training population for five traits in wheat. The authors found that greater the number of individuals in the training population, greater was the value of prediction accuracy. However, other effects must also be taken into account, such as plant architecture and population structure. de Los Campos et al. (2013) carried out marker selection based on their importance to the trait as indicated by the results of the GWAS analysis, through meta-analysis. They found that marker selection was effective in humans, since accuracy was 7.5% higher for genomic selection models using 5k SNPs when compared with models using 400k SNPs.

The F2 generation is one of the most important stages in a plant breeding program because greater genetic variability and heterosis are found at this stage (Tang et al., 1993). Moreover, in the F2 generation, it is possible to estimate the allelic frequency for each gene by the Mendelian segregation, to evaluate possible deviations from the Hardy-Weinberg equilibrium (Falconer and Mackay, 1996), to estimate genotypic and environmental variance, and consequently, to estimate heritability (Tang et al., 1996). The use of genomic selection in F2 populations is still restricted to a few studies (Ren et al., 2015) and little is known about how the factors affecting the prediction accuracy can affect the estimate of genetic parameters in F2 populations. Therefore, the objectives of this study were to establish the optimal number of individuals and markers required to compose the training population in genomic selection models in order to capture maximum genetic variance and consequently achieve greater accuracy in F2 populations.


Data simulation

Simulation of F2 populations was performed using the simulation module of the GENES software (Cruz, 2013). This allowed for information on the genome, parents genotypes, populations of controlled crossings, and quantitative trait data to be generated.

Genome simulation

The simulated genome comprised 15 linkage groups, similar to a diploid species 2n = 2x = 30. Each linkage group was simulated with 200 cM, comprised 200 markers, spaced equidistantly (1 cM), totaling 3060 markers. These markers were assumed to be codominant and biallelic. Furthermore, 4 markers per linkage group were considered responsible for the control of phenotypic expression of quantitative traits, which were randomly inserted into the genome.

Parent simulation

Contrasting homozygous parents were simulated, i.e., parent 1 was coded as carrying allele A1 (code 2), and parent 2 was coded as carrying the alternative allele A2 (code 0) for all existing markers. Thus, the cross between parent 1 and 2 generated the F1 population with all markers being heterozygous and in an approximation stage (A1B1//A2B2).

Simulation of the mapping population

F2 populations were generated by the selfing of individuals from the F1 population. Each individual of the F1 population produced 5000 gametes, and when 2 of these gametes met at random, the first individual of the F2 population was generated. This process was repeated until the formation of all individuals in each population.

Each gamete was formed based on the following criteria: the allele of the first marker was randomly chosen (A1 or A2) to start gamete formation (initialization allele); the allele of the second marker was chosen taking into account the distance to the first gene, i.e, the crossing-over frequencies were counted, and the choice of which allele (B1 or B2) to constitute the gamete was based on the probabilities of each gamete P(A1B1) and P(A2B2), which are parental gametes, and P(A1B2) and P(A2B1), which are recombinant gametes. This process was carried out for every gene. Null interference, i.e., the crossing-over that occurred between genes A and B, was considered to not interfere with the following crossing-over between genes B and C. This ensured that all gametes formed were different owing to the random choice of the allele in the first gene, and to the probability conditioned to each allele for the next genes. Since all genes were simulated equidistantly at 1 cm, the recombination frequency was 1% for all genes, i.e., the probability of each gamete was: P(A1B1) = P(A2B2) = 0.49, and P(A1B2) = P(A2B1) = 0.005.

The F2 population simulation was encoded with 0, 1 and 2, in which 0 corresponded to homozygote individuals (A2A2), 1 corresponded to heterozygote individuals (A1A2), and 2 corresponded to homozygote individuals (A1A1) for a given locus.

Simulation of quantitative traits

For the simulation of quantitative traits, a value corresponding to the probability generated by a binomial distribution, of parameter p = q = 0.5, and n = 59 (generating a probability family of 60 elements) was first assigned as the importance of each locus. This value, which denominates the proportion of the genetic variance, explained by each QTL (PGV/QTL), reflects the importance of the locus to the genotypic mean, and consequently to the proportion of genetic variance of the trait explained by each QTL.

Each trait was simulated as being controlled by 60 QTLs distributed equidistantly in the genome (4 QTLs per linkage group). The effect of each QTL was defined as: A1A1 = μ + a; A1A2 = μ + d; A2A2 = μ - a, in which a is the additive effect of each gene, and d is the dominant effect of each gene. Since the value of d was defined as null, the mean degree of dominance (d/a) was equal to zero for all loci, i.e., all loci only presented an additive effect.

The genotypic value (GV) of each individual was defined by the equation:

GV= i=1 n (PGV/QT L i  x effect of QT L i )

(Equation 1)

The environmental effect was not correlated with the genotypic value, and was estimated following an N(0, σ 2 ) distribution. The value of σ 2 is calculated from the heritability of the trait and by the value of genotypic variance ( σ g 2 ) . It was simulated that traits with heritability of 5, 20, 40, 60, 80 and 99%. σ g 2 was calculated as being the variance of the genotypic value of individuals of the F2 population. Thus, the phenotypic value was calculated as:


(Equation 2)

In which u = 100 is the mean, and PV is the phenotypic value.

Data analysis

After population generation, the mapping process was carried out, starting with the analysis of individual loci segregation. Chi-square (χ2) tests were used to check if the markers were segregated per what is expected for an F2 population. It was verified if Linkage Groups were restored, with chromosome size, distance between markers and order of markers, and it was concluded that it was an F2 population with the desired simulation properties.

The ridge regression best linear unbiased prediction (RR-BLUP) method of genomic selection used in the analysis, which aims at estimating the effect for each of the covariates (SNP markers) included in the model. RR-BLUP assumes that all SNPs control phenotypic expression of QTL, and assumes homogeneous variance.

Genomic selection models ranging from 2 to 1000 individuals in the training population (TP) and in all the available markers, i.e., 3060 markers, were to verify the optimum number of individuals required to comprise the TP. The validation population was always composed of 200 individuals randomly chosen in the population. Trend graphs were plotted for genetic variance, residual variance, heritability, and genotypic and phenotypic accuracy.

To evaluate the number of markers necessary to capture the entire genetic variance, and consequently achieve greater accuracy, the number of markers in the genomic selection models varied from 2 to 3060. The TP comprised 800 individuals, and the validation population consisted of 200 individuals. Marker selection was carried out and the marker with the lowest effect was deleted from the original matrix and not used in subsequent analysis, i.e., the following model would have one marker less. This marker selection occurred until the model comprised only 2 markers. Trend graphs were plotted for genetic variance, residual variance, heritability and phenotypic accuracy.

Phenotypic and genotypic accuracies were estimated by the Pearson’s correlation between the phenotypic value and the genomic estimate breeding value (GEBV), and between the true genotypic value and the GEBV, respectively.

Genetic variance ( σ g 2 ) was estimated according to Falconer and Mackay (1996):

σ g 2 =2 i=1 n pq α i 2

(Equation 3)

in which α i 2 is the allele substitution effect for each locus. Heritability ( h 2 ) was estimated as:

h 2 = σ g 2 ( σ g 2 + σ 2 )

(Equation 4)

Software and hardware information

Simulations were carried out using the GENES software (Cruz, 2013), while the analyses of segregation and genomic selection tests were performed using the statistical R software (R Core Team, 2015). The RR-BLUP package was used to run the RR-BLUP model. Two high-performance computers (Intel Xeon, processor E5-26 12° generation 3.30 GHz, 64 and 96 Gb RAM, 1024 Gb hard drive) were used to perform the genomic selection analysis.


In order to evaluate how the size of the TP and the number of markers influenced the genomic estimate breeding value prediction using RR-BLUP, the size of the TP ranged from 2 to 1000 individuals, and from 2 to 3060 SNP markers.

Marker segregation test

The segregation test was carried out in order to verify whether the simulation process established a population with genetic characteristics of an F2 population, as proposed by Falconer and Mackay (1996). It was observed that the allelic and genotypic frequencies were close to the expected value for an F2 population, for all the markers that control the trait (QTL - Table 1) and the other simulated markers (Table S1). Segregation test, minor allele frequency (MAF), P value associated with the <inline-formula><math xmlns='' id='m12'> <mrow> <msup> <mi>χ</mi> <mn>2</mn> </msup> </mrow> </math></inline-formula>test of the evaluation of the Hardy-Weinberg equilibrium (HWE P value), and effect of markers associated with the quantitative trait loci (QTL).

QTL AA Aa aa Total maf HWE P value PGV/QTL AE (u+a)
M42 248 503 249 1000 0.4995 0.849491 0.00000000 0.000000
M83 220 525 255 1000 0.4825 0.104833 0.00000000 0.000000
M123 249 488 263 1000 0.4930 0.451512 0.00000000 0.000000
M165 254 502 244 1000 0.4950 0.896830 0.00000000 0.000000
M246 253 478 269 1000 0.4920 0.166462 0.00000000 0.000000
M287 275 480 245 1000 0.4850 0.215878 0.00000000 0.000000
M327 290 487 223 1000 0.4665 0.494414 0.00000000 0.000000
M369 255 506 239 1000 0.4920 0.698262 0.00000000 0.000001
M450 234 491 275 1000 0.4795 0.605211 0.00000000 0.000004
M491 248 499 253 1000 0.4975 0.950199 0.00000002 0.000022
M531 245 499 256 1000 0.4945 0.952612 0.00000011 0.000109
M573 231 507 262 1000 0.4845 0.635811 0.00000049 0.000486
M654 255 495 250 1000 0.4975 0.752424 0.00000194 0.001942
M695 238 503 259 1000 0.4895 0.838532 0.00000702 0.007021
M735 257 502 241 1000 0.4920 0.892912 0.00002310 0.023069
M777 255 502 243 1000 0.4940 0.895725 0.00006920 0.069208
M858 234 529 237 1000 0.4985 0.066591 0.00019000 0.190321
M899 247 520 233 1000 0.4930 0.203601 0.00048100 0.481401
M939 263 489 248 1000 0.4925 0.490986 0.00112300 1.123269
M981 253 509 238 1000 0.4925 0.564308 0.00242400 2.423897
M1062 251 490 259 1000 0.4960 0.528386 0.00484800 4.847794
M1103 264 494 242 1000 0.4890 0.715601 0.00900300 9.003046
M1143 232 508 260 1000 0.4860 0.595299 0.01555100 15.55072
M1185 265 517 218 1000 0.4765 0.251149 0.02501600 25.01637
M1266 246 491 263 1000 0.4915 0.575321 0.03752500 37.52455
M1307 269 493 238 1000 0.4845 0.679807 0.05253400 52.53438
M1347 273 487 240 1000 0.4835 0.430338 0.06869900 68.69880
M1389 254 498 248 1000 0.4970 0.900241 0.08396500 83.96520
M1470 248 497 255 1000 0.4965 0.850723 0.09596000 95.96023
M1511 254 481 265 1000 0.4945 0.230923 0.10257800 102.5782
M1551 249 501 250 1000 0.4995 0.949546 0.10257800 102.5782
M1593 275 470 255 1000 0.4900 0.059366 0.09596000 95.96023
M1674 234 497 269 1000 0.4825 0.879831 0.08396500 83.96520
M1715 260 499 241 1000 0.4905 0.958649 0.06869900 68.69880
M1755 239 509 252 1000 0.4935 0.565527 0.05253400 52.53438
M1797 252 505 243 1000 0.4955 0.749867 0.03752500 37.52455
M1878 241 501 258 1000 0.4915 0.942279 0.02501600 25.01637
M1919 239 495 266 1000 0.4865 0.769225 0.01555100 15.55072
M1959 253 519 228 1000 0.4875 0.221634 0.00900300 9.003046
M2001 252 508 240 1000 0.4940 0.609637 0.00484800 4.847794
M2082 238 505 257 1000 0.4905 0.743092 0.00242400 2.423897
M2123 255 480 265 1000 0.4950 0.206994 0.00112300 1.123269
M2163 254 504 242 1000 0.4940 0.796736 0.00048100 0.481401
M2205 229 499 272 1000 0.4785 0.996183 0.00019000 0.190321
M2286 254 495 251 1000 0.4985 0.752043 0.00006920 0.069208
M2327 280 501 219 1000 0.4695 0.855905 0.00002310 0.023069
M2367 268 492 240 1000 0.4860 0.630126 0.00000702 0.007021
M2409 256 504 240 1000 0.4920 0.793981 0.00000194 0.001942
M2490 255 508 237 1000 0.4910 0.605591 0.00000049 0.000486
M2531 244 502 254 1000 0.4950 0.896830 0.00000011 0.000109
M2571 244 492 264 1000 0.4900 0.621650 0.00000002 0.000022
M2613 256 502 242 1000 0.4930 0.894419 0.00000000 0.000004
M2694 242 496 262 1000 0.4900 0.809997 0.00000000 0.000001
M2735 257 504 239 1000 0.4910 0.792309 0.00000000 0.000000
M2775 260 501 239 1000 0.4895 0.938444 0.00000000 0.000000
M2817 261 491 248 1000 0.4935 0.572781 0.00000000 0.000000
M2898 241 539 220 1000 0.4895 0.013079 0.00000000 0.000000
M2939 255 499 246 1000 0.4955 0.951607 0.00000000 0.000000
M2979 232 505 263 1000 0.4845 0.728628 0.00000000 0.000000
M3021 236 510 254 1000 0.4910 0.520283 0.00000000 0.000000

M = marker. The number following the letter M stands for the number of the marker in the original table; PGV/QTL = proportion of genetic variance of each trait explained by each QTL; AE = additive effect.

The evaluation of the Hardy-Weinberg equilibrium was carried out using the chi-square test (χ2), which indicated the expected segregation for an F2 population (Table 1; and Table S1 shows the P value).

The proportion of the genetic variance of the trait explained by each QTL followed a binomial distribution, as expected in the simulation process (Table 1). The values of the additive effects ranged between the QTL, and they were high in QTL located on the median chromosomes, and low in the QTL located in the first (chromosome 1 over 5) and in the last chromosomes (chromosome 16 over 20) (Table 1).

Evaluation of the training population size

The genotypic variances presented similar behavior regardless of the simulated heritability (Figure 1). Similar results were observed for the residual variance. It was observed that the higher the number of individuals used in the reference population, the greater was the ability of the model to estimate variance components in a manner similar to the simulated parameters. It was also observed that heritability influences the number of individuals to compose the TP used to estimate accurate variance components (genotypic and residual).

Trend of genotypic variance (blue) and residual variance (red) in function of the number of individuals in the training population, for traits with different heritability: A. 5%; B. 20%; C. 40%; D. 60%; E. 80%; F. 99%.

The genotypic and phenotypic accuracies showed higher estimates in the TP with a high number of individuals (Figure 2). It was also observed that the higher the heritability of the trait, the higher were the estimates of phenotypic and genotypic accuracies (Figure 2). However, the TP with more than 600 individuals provided a small gain in phenotypic and genotypic accuracies. Genotypic accuracy had lower values than the phenotypic accuracy for all the heritabilities evaluated. However, the higher the value of the simulated heritability, the closer were the estimates of phenotypic and genotypic accuracies (Figure 2).

Trend of phenotypic accuracy, genotypic accuracy, and heritability in function of the number of individuals in the training population for traits with different heritability: A. 5%; B. 20%; C. 40%; D. 60%; E. 80%; F. 99%.

The estimated heritability showed similar behavior to the genotypic and residual variances, i.e., it presented variable values when few individuals were used in the TP (Figure 2). With the increase in the number of individuals in the TP, more stable values of the estimated heritability were observed, and these values were closer to the values of simulated heritability, except for the traits with 80 and 99% heritability, whose estimated heritability values were lower than the simulated heritability values (Figure 2).

Evaluation of the number of markers necessary to obtain genomic prediction in an F2 population

The genotypic variance was quadratic, i.e., it increased up to a certain number of markers, and then it gradually decreased with the increase in the number of markers (Figure 3). The optimal number of markers ranged according to the heritability of the trait (Table 2). It was also found that the higher the trait heritability, the greater the number of markers necessary to obtain the best genotypic variance estimate.

Trend of genotypic variance and residual variance in function of the number of markers used for the training of the genomic selection model for traits with different heritabilities: A. 5%; B. 20%; C. 40%; D. 60%; E. 80%; F. 99%.

Number of markers (NM) necessary to obtain the optimal value (OV) of genotypic variance <inline-formula><math xmlns='' id='m13'> <mrow> <mo stretchy="false">(</mo><msubsup> <mi>σ</mi> <mi>g</mi> <mn>2</mn> </msubsup> <mo stretchy="false">)</mo></mrow> </math></inline-formula>, residual variance <inline-formula><math xmlns='' id='m14'> <mrow> <mo stretchy="false">(</mo><msup> <mi>σ</mi> <mn>2</mn> </msup> <mo stretchy="false">)</mo></mrow> </math></inline-formula>, heritability <inline-formula><math xmlns='' id='m15'> <mrow> <mo stretchy="false">(</mo><msup> <mi>h</mi> <mn>2</mn> </msup> <mo stretchy="false">)</mo></mrow> </math></inline-formula>, and accuracy in an F<sub>2</sub> population.
σ g 2 α 2 h 2 Accuracy
OV C1 7466 2126 0.77 0.91
C2 2035 337 0.85 0.94
C3 897 115 0.88 0.96
C4 560 50 0.91 0.98
C5 669 62 0.91 0.98
C6 557 28 0.95 0.99
NM* C1 240-397 372-554 307-487 545-739
C2 258-415 479-634 408-579 589-741
C3 332-525 518-709 489-647 693-864
C4 410-573 602-757 522-711 704-870
C5 428-588 532-685 500-666 639-827
C6 486-674 585-746 557-710 729-884

*The interval corresponds to the 5% best values for each parameter evaluated. C1 to C6 correspond to each trait simulated by varying the heritability value (5, 20, 40, 60, 80 and 99%).

Residual variance presented a quadratic trend with positive concavity, i.e., it decreased up to a certain number of markers, and then increased with an increase in the number of markers (Figure 3). The optimal number of markers for residual variance increased with the increase in the heritability of the trait (Table 2).The value of the estimated heritability increased exponentially up to an optimal number of markers, and then decreased linearly (Figure 4). The number of markers for the maximum heritability point increased with an increase in the simulated heritability for each trait (Table 2). The decrease in estimated heritability with an increase in the number of markers was lower for traits with higher simulated heritability.

Trend of phenotypic accuracy and heritability in the function of the number of markers used for genomic selection model training for traits with different heritabilities: A. 5%; B. 20%; C. 40%; D. 60%; E. 80%; F. 99%.

Prediction accuracy of the training population presented an exponential increase up to a maximum point, and then a slight linear decrease (Figure 4). This decrease was lower for traits of high simulated heritability. The optimal number of markers to obtain greater accuracy increased with an increase in the heritability value of the trait (Table 2).


Marker segregation test

Using the allelic frequency, the gene frequency, and the Hardy-Weinberg equilibrium, it was verified that the simulated population indeed represented a population with all the characteristics of an F2 population, i.e., (A)p = (a)q = 0.5, (AA)p2 = (aa)q2 = 0.25, and (Aa)2pq = 0.5.

The great importance of recovering all the information from an F2 population using the simulation process is that the genetic variance, environmental variance, and heritability are easy to estimate in this type of population. According to Falconer and Mackay (1996), genetic variance in F2 population is estimated as follows:

σ g 2 =2pq α 2 + (2pqd) 2

(Equation 5)

in which d value was simulated as 0 for all loci, and thus the genetic variance is equal to the additive variance. This is easily calculated, since α 2 is the variance of the markers calculated using the RR-BLUP method. Thus, heritability can be estimated from the equation proposed by Falconer and Mackay (1996) for an F2 population:

h 2 = σ g 2 ( σ g 2 + σ 2 )

(Equation 6)

and σ 2 is the residual variance of the markers estimated by the RR-BLUP method.

Consequently, all the genetic and environmental parameters were accurately calculated, and hence, these parameters were the criteria for choosing the best genomic selection model, i.e., the model composed of the ideal number of individuals in the training population, and the number of markers required to accurately train the model.

Training population size versus estimated genetic value

Usually, the increase in the number of individuals in the TP increases the prediction accuracy of the genetic value (Desta and Ortiz, 2014). However, despite the increase in accuracy, when more than 600 individuals was used in the TP, this increment was very low, making it almost null for traits with a heritability of 80-99% in the present study.

Besides the number of individuals in the TP, population structure may influence the prediction by the genomic selection methods. Studies on oat (Asoro et al., 2011), corn (Ogutu et al., 2012), and beet (Würschum et al., 2013) showed that the use of a structured population together with large enough TP considerably increases prediction accuracy. Therefore, all the results of this study are valid for an F2 population. Other studies are required to evaluate other types of populations, such as backcrossing, RILs, half-sib families, and full-sib families, since each type of population has a different structure, influencing the prediction by genomic selection methods.

Isidro et al. (2015), when evaluating several methods to optimize the choice of individuals to compose the TP, found that the population structure and the trait architecture are the factors that most influence the TP performance. Thus, it is difficult to verify a standard size of the TP for the several possible heritabilities and different population types. Using the current study with an F2 population, it can be concluded that 600 individuals are enough, regardless of the trait architecture. However, for traits with low heritability, accuracy values are less stable, i.e., depending on the individuals of the TP, accuracy is higher or lower, and when heritability increases, the accuracy value is constant, regardless of the number of individuals of the TP. This was verified in this study, since the analyses were repeated 50 times for each TP size. Thus, prior knowledge of the trait under study may help researchers to design the experiment in order to obtain accurate results through genomic selection, and consequently reduce the cost of the breeding program.

Markers density versus estimated breeding value

It was found that a number of markers ranging from 200 to 900 is enough to capture all the genotypic variance of an F2 population, and consequently achieve maximum accuracy. This value varies depending on the trait heritability, since the higher the heritability, the greater is the number of markers necessary to obtain maximum accuracy. This fact can be explained by the effect of each QTL and their influence on the genomic selection methods. All traits were simulated with 60 QTL; however, the higher the heritability of the trait, the higher is the effect of each QTL. One of the characteristics of genomic selection methods is the capture of minor effect markers, mainly because the RR-BLUP method uses the same variance for all markers. This means that RR-BLUP cannot capture the full effect of major effect QTL, which thereby requires more markers to explain the genotypic variance of the trait. An alternative to improve the variance capture of QTL for major effects is the use of Bayesian methods, which assume specific variance for each marker, such as Bayes A and Bayes B (Gianola et al., 2009).

Erbe et al. (2013) evaluated Brown Swiss cattle by genotyping the animals using 777k chips, and observed that the genetic variance estimated via genomic selection models increased to 20k, becoming constant after this number of markers. They concluded that even with a population of infinite individuals for training and a large number of markers, it would not be possible to increase the accuracy for this population. Poland et al. (2012) found that 1827 SNPs were enough to capture all the genetic variance in wheat populations. In our study, the marker of lower effect was deleted in each iteration, and was observed that it was not necessary to use several markers to explain the genotypic variance of the trait, since by the simulation process, only 60 markers explained all the variation of the trait. Thus, prior knowledge of the trait may be important for the development of low density chips specific for a given trait or species. The development of this type of chip is important to reduce genotyping costs. In animal breeding, low density chips for cattle (Heaton et al., 2002; Boichard et al., 2012) and pigs (Wellmann et al., 2013) have been developed. Moreover, the genotyping cost of a low density chip is much lower than that of a high density chip (Habier et al., 2009).

In addition, the high accuracy for models using a small number of markers (200 to 900 SNPs), which was verified in this study, can be explained by the fact that individuals of the training population are highly correlated, since all of them are descended from the same parent, i.e., the individuals of the F2 population share alleles identical by descent (Poland et al., 2012). However, despite the reduced number of markers for each studied trait, these markers are different for each trait, making the construction of a multi-trait low-density chip very difficult (Habier et al., 2009). Therefore, in further studies, it is necessary to seek strategies that enable simultaneous marker selection for several traits, and thus build a multi-trait low-density chip.

The number of markers used in this study were low. However, several studies have shown that a small number of markers can be used for high-accuracy genomic selection for many traits (Bhering et al., 2015; Spindel et al., 2015). In the future, it is necessary to test the optimal number of markers and individuals, using a large number of markers.


The ideal number of individuals to compose the training population is strongly correlated with the heritability of the trait. However, a training population comprising more than 600 individuals ensures maximum accuracy, regardless of the heritability for an F2 population.

A genomic selection model that uses 300-800 markers is enough to capture all the genetic variance, and to decrease the residual variance, in order to obtain the maximum prediction accuracy of an F2 population.

Supplementary material

Table S1. Segregation test, minor allele frequency (MAF), P value associated with the χ 2 test of the evaluation of the Hardy-Weinberg equilibrium (hwe.p.value) for all SNPs.

About the Authors