Mathematics

Joint dimensionality reduction and clustering with missing data

Publié le - Advanced Machine Learning and Data Science

Auteurs : Yasmine Agliz, Vincent Audigier, Ndèye Niang, Mohamed Nadif

To address the challenge of clustering highdimensional data, subspace clustering methods, such as Reduced K-means (RKM) have been proposed. These methods identify clusters by simultaneously finding the low-dimensional subspaces and the matching partition, making them particularly effective in high-dimensional settings. Moreover, high-dimensional data can also mechanically lead to missing data, thereby challenging the subspace clustering. Two new methods are proposed, based on two widely used approaches for missing data: direct methods and multiple imputation (MI). Firstly, the RKPOD direct method accounts for incomplete data through a criterion based on observed values to provide the partition and the associated subspace. Secondly, based on multiple imputed datasets, the MI-RKM method, yields several partitions and associated subspaces which are aggregated through Non-Negative Matrix Factorization and Multiple Factor Analysis, respectively. The two methods are evaluated through a study based on simulated data and a real dataset. For the simulation case, under the MCAR and MAR mechanisms, with the appropriate missing data initialization method, both methods recover the initial clusters and corresponding subspaces in the reference case, as well as in the unbalanced and overlapping case. However, in the scenario with correlated noise variables, both methods struggle slightly more to recover the subspace and the corresponding partition.