Missing data estimation with Deep learning and Fuzzy C-means Dimensionality reduction for Microarray Gene Expression Data

K Ishthaq Ahamed,Shaheda Akthar

Data mining is a devastating area of computer science helping the people, society and industries for solving their problems. Generally Data mining algorithms works on data and produces the required results depends on the application and algorithm. Usually data mining algorithms were categorized depends on their application and usage. Microarray Gene expression data which is basically a medical gene expression data consisting of gene and their respective experimental conditions. Medical scientist works on these microarray array gene expression data to identify the relation and dependency between different types of genes. There are two important things associated with these datasets, one is, missing entries and another is its dimensionality. Dimensionality is a curse for processing and applying data mining algorithm. Performance of data mining usually depends on the data and its characteristics. If the data will have large number of missing entries, applying data mining algorithm on that data is not advisable. The performance of these data mining algorithms will degrade if the data will have missing entries. In this paper we have suggested a methodology to overcome huge dimensionality and missing entries. We have used a fuzzy c-means clustering algorithm to overcome with huge dimensionality, and deep learning is used for estimating the missing entries in the datasets. Spellman, Prostate, Breast cancer real time datasets are used in this paper. Methodology is successfully applied on these datasets and results obtained are compared with existing imputation algorithms like missForest , KNN(K nearest neighbour) , PMM( Predictive Mean Matching) , Tree based approach and Random Forest. Our proposed model gives good results compared with other missing data imputation algorithms.

Volume 12 | Issue 6

Pages: 152-163

DOI: 10.5373/JARDCS/V12I6/S20201016