A Big Data Analytics And Statistical Genetics Approach For Gene Expression–Based Biomarker Discovery In Neurodegenerative Disorders Using AI And Machine Learning
DOI:
https://doi.org/10.4238/3r8zn256Abstract
Alzheimer disease (AD) and Parkinson disease (PD) are neurodegenerative disorders that are marked by progressive neuronal dysfunction and significant molecular heterogeneity that does not permit early diagnosis and specific intervention. Gene expression profiling provides an effective method to discovery transcriptomic biomarkers, but high dimensionality, cohort variability and multiple-testing burden results tend to undermine the reproducibility. In the research, we used a combined big data analytics and statistical genetics platform to conduct robust gene expression-based biomarkers by using the publicly available transcriptomic data of brain and peripheral blood samples (in total n = 412; 238 cases and 174 controls). The differential expression analysis was performed through moderated linear modelling with false discovery rate (FDR) control of Benjamini-Hochberg error and post-processing quality control and normalisation to minimise the effects of batching. It was used to consider genes significant as FDR < 0.05 with log 2 fold change value 1 or more and confidence interval does not cross zero. This statistical filtering found 326 dysregulated genes significant enough to be enriched with pathways which are associated with neuroinflammation, synaptic signalling, mitochondrial dysfunction, and protein homeostasis. In order to optimise the candidate biomarkers, we used a machine learning pipeline with an Elastic Net constant, Random Forest ranking of importance and stability selection and then classified them with logistic regression, support vector machine, and gradient boosting models. The consistent resampling biomarker panel was a 14-gene biomarker panel. In stratified nested cross-validation, the highest performing classifier had an area under the receiver operating characteristic curve (AUROC) of 0.91 ± 0.03, sensitivity of 0.87 and specificity of 0.85 and was also highly stable in terms of its performance in independent validation cohorts (AUROC = 0.88). A combination of the effect size, FDR signal and confidence interval reporting was more effective in enhancing the reliability of biomarkers compared to selection using p-value. These results indicate that research methods that integrate stringent statistical genetics with machine learning algorithms that can be easily interpreted have increased the strength and forecasting capacity of gene expression-based biomarkers. The suggested framework is a consistent and biologically based approach to AI-led biomarker discovery in neurodegenerative diseases, which will be used in translational and precision medicine in the future.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

