Research Article

Time-series microarray data simulation modeled with a case-control label

Published: May 12, 2016
Genet. Mol. Res. 15(2): gmr7287 DOI: 10.4238/gmr.15027287

Abstract

With advances in molecular biology, microarray data have become an important resource in the exploration of complex human diseases. Although gene chip technology continues to grow, there are still many barriers to overcome, such as high costs, small sample sizes, complex procedures, poor repeatability, and the dependence on data analysis methods. To avoid these problems, simulation data have a vital role in the study of complex diseases. A simulation method of microarray data is introduced in this study to model the occurrence and development of general diseases. Using classic statistics and control theory, five risk models are proposed. One or more models can be introduced into the baseline simulation dataset with a case-control label. In addition, time-series gene expression data can be generated to model the dynamic evolutionary process of a disease. The prevalence of each model is estimated and disease-associated genes are tested by significance analysis of microarrays. The source code, written in MATLAB, is freely and publicly available at http://sourceforge.net/projects/genesimulation/files/.

With advances in molecular biology, microarray data have become an important resource in the exploration of complex human diseases. Although gene chip technology continues to grow, there are still many barriers to overcome, such as high costs, small sample sizes, complex procedures, poor repeatability, and the dependence on data analysis methods. To avoid these problems, simulation data have a vital role in the study of complex diseases. A simulation method of microarray data is introduced in this study to model the occurrence and development of general diseases. Using classic statistics and control theory, five risk models are proposed. One or more models can be introduced into the baseline simulation dataset with a case-control label. In addition, time-series gene expression data can be generated to model the dynamic evolutionary process of a disease. The prevalence of each model is estimated and disease-associated genes are tested by significance analysis of microarrays. The source code, written in MATLAB, is freely and publicly available at http://sourceforge.net/projects/genesimulation/files/.

About the Authors