Research Article

Locally linear embedding and neighborhood rough set-based gene selection for gene expression data classification

Published: August 30, 2016
Genet. Mol. Res. 15(3): gmr8990 DOI: 10.4238/gmr.15038990

Abstract

Cancer subtype recognition and feature selection are important problems in the diagnosis and treatment of tumors. Here, we propose a novel gene selection approach applied to gene expression data classification. First, two classical feature reduction methods including locally linear embedding (LLE) and rough set (RS) are summarized. The advantages and disadvantages of these algorithms were analyzed and an optimized model for tumor gene selection was developed based on LLE and neighborhood RS (NRS). Bhattacharyya distance was introduced to delete irrelevant genes, pair-wise redundant analysis was performed to remove strongly correlated genes, and the wavelet soft threshold was determined to eliminate noise in the gene datasets. Next, prior optimized search processing was carried out. A new approach combining dimension reduction of LLE and feature reduction of NRS (LLE-NRS) was developed for selecting gene subsets, and then an open source software Weka was applied to distinguish different tumor types and verify the cross-validation classification accuracy of our proposed method. The experimental results demonstrated that the classification performance of the proposed LLE-NRS for selecting gene subset outperforms those of other related models in terms of accuracy, and our proposed approach is feasible and effective in the field of high-dimensional tumor classification.

Cancer subtype recognition and feature selection are important problems in the diagnosis and treatment of tumors. Here, we propose a novel gene selection approach applied to gene expression data classification. First, two classical feature reduction methods including locally linear embedding (LLE) and rough set (RS) are summarized. The advantages and disadvantages of these algorithms were analyzed and an optimized model for tumor gene selection was developed based on LLE and neighborhood RS (NRS). Bhattacharyya distance was introduced to delete irrelevant genes, pair-wise redundant analysis was performed to remove strongly correlated genes, and the wavelet soft threshold was determined to eliminate noise in the gene datasets. Next, prior optimized search processing was carried out. A new approach combining dimension reduction of LLE and feature reduction of NRS (LLE-NRS) was developed for selecting gene subsets, and then an open source software Weka was applied to distinguish different tumor types and verify the cross-validation classification accuracy of our proposed method. The experimental results demonstrated that the classification performance of the proposed LLE-NRS for selecting gene subset outperforms those of other related models in terms of accuracy, and our proposed approach is feasible and effective in the field of high-dimensional tumor classification.

About the Authors