COMPUTATIONAL IDENTIFICATION AND FUNCTIONAL ANNOTATION OF NON-CODING GENETIC VARIANTS USING WHOLE-GENOME SEQUENCING DATA

Sudeshna Chakraborty; Dr Ranjana Patnaik; Jayakodi. T; Kasthuri K; Ankit  Sachdeva; Dr. Anbukkarasi; Dr. Maharshikumar B.  Shukla

doi:10.4238/33049296

Authors

Sudeshna Chakraborty Professor, School of Computer Science and Engineering, Galgotias University, India Author
Dr Ranjana Patnaik Professor, Department of Biomedical Sciences, School of Biosciences and Technology, Galgotias University, India Author
Jayakodi. T Assistant Professor, Meenakshi College of Allied Health Sciences, Meenakshi Academy of Higher Education and Research Author
Kasthuri K Associate Professor, Department of Biochemistry, Meenakshi Medical College Hospital & Research Institute, Meenakshi Academy of Higher Education and Research Author
Ankit Sachdeva Centre of Research Impact and Outcome, Chitkara University, Rajpura – 140417, Punjab, India, ORCID: https://orcid.org/0009-0004-5602-4682 Author
Dr. Anbukkarasi Associate Professor, Pathology, Sree Balaji Medical College and Hospital, Bharath Institute of Higher Education and Research Author
Dr. Maharshikumar B. Shukla Associate Professor, Faculty of Science, Gokul Global University, Sidhpur, Gujarat, India, ORCID: 0009-0004-9071-023X Author

DOI:

https://doi.org/10.4238/33049296

Abstract

Non-coding genetic variants are also a delegated criticism of genomics because of their regulatory sophistication and absence of direct protein-coding impacts. The paper introduces a full computational pipeline of the recognition and functional annotation of non-coding variants with whole-genome sequencing (WGS)-level data. The pipeline proposed combines the variants sorting, regulatory regions mapping, multi dimensional feature discovery and a machine learning based classification to order variants of interest in order of importance. The publicly available repositories have been used to obtain whole-genome variant datasets that were annotated based on conservation scores, chromatin accessibility profiles, transcription factor binding sites and epigenomic signatures. An Extreme Gradient Boosting (XGBoost) classifier was used to determine the functional and non-functional variants in the basis of these combined features. This model was found to have an accuracy of 92.4 percent and it had better performance than the tools that were considered to be in use like CADD and GWAVA. Besides, the analysis of functional enrichment showed that prioritized variants have strong relationships with major regulatory pathways and disease-relevant gene networks. The results indicate how effectively the combination of multi-omics data and explainable machine learning methods can be used to enhance the prediction of non-coding genetic variation and biological insights.